もっと詳しく

A small character “0” actually caused the complete collapse of station B.

I don’t know if you still remember that night, the all-night carnival of “building power outage”, “server explosion” and “programmers deleted the library and ran away” at station B. (manual dog head)

After a year, the “true culprit” behind it has finally been revealed by Ah B——

Unexpectedly, it is such a simple line of code, just lying on the B station for two or three hours, making the B station programmers sleepless all night and lose their hair.

You may ask, isn’t this a common function to find the greatest common divisor, how can it be so powerful?

One after another behind the scenes, in the final analysis, it is actually one sentence: 0, it is really not popular.

7 lines of code made station B crash for 3 hours because of

For more details, let’s take a look at the “Accident Report” together.

“Blood” caused by the string “0”

Let’s talk about the root cause of the tragedy, which is the gcd function posted at the beginning.

Friends who have learned a little programming knowledge should know that this is a recursive function that uses the method of tossing and turning to calculate the greatest common divisor.

Unlike our method of calculating the greatest common divisor by hand, this algorithm is sauce:

For a simple example, a=24, b=18, find the greatest common divisor of a and b;

A divided by b, the remainder is 6, then let a=18, b=6, and then count down;

18 divided by 6, this time the remainder is 0, then 6 is the greatest common divisor of 24 and 18.

That is to say, a and b are repeatedly divided and the remainder is taken until b=0. In the function:

if b==0 then return a end

This judgment statement takes effect, and the result is calculated.

Based on this mathematical principle, let’s look at this code again, it seems to be no problem:

7 lines of code made station B crash for 3 hours because of

But what if the input b is the string “0”?

It is mentioned in the technical analysis article of station B that the code in this incident was written in Lua. Lua has the following characteristics:

This is a dynamically typed language. In common practice, variables do not need to define types, just assign values ​​to variables directly.

When Lua performs arithmetic operations on a string of numbers, it will try to convert the string of numbers into a number.

In the Lua language, the result of the mathematical operation n%0 is nan (Not A Number).

Let’s simulate this process:

1. When b is a string “0”, since the gcd function does not perform type verification on it, when the judgment statement is encountered, “0” is not equal to 0. In the code, “return _gcd(b, a% b)” trigger, return _gcd(“0”, nan).

2. _gcd(“0”, nan) is executed again, so the return value becomes _gcd(nan, nan).

This is the end, and the condition of b=0 in the judgment statement can never be reached, so an infinite loop appears.

That is to say, the program starts to go around in circles like crazy, and for a result that can never be obtained, the CPU is occupied by 100%, and other user requests cannot be processed naturally.

7 lines of code made station B crash for 3 hours because of

So the question is, how does this “0” get in?

The official statement is:

In a certain publishing mode, the instance weight of the application will be temporarily adjusted to 0, and the weight returned by the registry to the SLB (load balancing) is “0” of the string type. This release environment is only used in the production environment, and the frequency of use is extremely low. This problem is not triggered during the early grayscale process of SLB.

In the balance_by_lua stage, the SLB will pass the service IP, Port, and Weight stored in the shared memory to the lua-resty-balancer module as parameters to select the upstream server. When the node weight=“0”, the _gcd function in the balancer module The received input parameter b may be “0”.

How bugs are located

From the perspective of “Zhuge Liang after the fact”, the root cause of the complete collapse of Station B is somewhat of a direct “this is it”.

But from the perspective of the programmers involved, things are really not that simple.

At 22:52 that night—most programmers just got off work or have not yet got off work (doge), the operation and maintenance of station B received an alarm that the service was unavailable, and immediately suspected the computer room, the network, the fourth floor LB, and the seventh floor. There is a problem with infrastructure such as SLB.

Then immediately pulled an emergency voice conference with the relevant technical staff to start processing.

Five minutes later, the operation and maintenance found that the CPU usage of the seventh-floor SLB in the mainframe room, which carries all online services, reached 100% and could not process user requests. After excluding other facilities, the fault was locked to this floor.

(Seven-layer SLB refers to load balancing based on application layer information such as URLs. Load balancing distributes client requests to server clusters through algorithms, thereby reducing server pressure.)

In the midst of all kinds of emergencies, the small episode also appeared: the programmers who were remotely at home were on the VPN but could not enter the intranet, so they had to call the person in charge of the intranet again, and went through the green channel before they all went online (because one of the domain names) is proxied by the faulty SLB).

7 lines of code made station B crash for 3 hours because of

At this point, 25 minutes have passed, and the rush correction begins.

First, the operation and maintenance first restarted the SLB hot, but it did not recover; then tried to reject the user traffic and restarted the SLB cold, but the CPU was still at 100%, but it still did not recover.

Next, the operation and maintenance found that a large number of SLB requests in the multi-active computer room were timed out, but the CPU was not overloaded. When the SLB was about to be restarted, the internal group responded that the main site service had been restored, and functions such as video playback, recommendation, comment, and dynamic were basically normal.

It was 23:23 at this time, 31 minutes from the accident.

It is worth mentioning that the recovery of these functions is actually due to the “high availability disaster recovery architecture” that was complained by netizens at the time of the incident.

7 lines of code made station B crash for 3 hours because of

As for why this line of defense didn’t work in the first place, there may still be some of you and I in it.

To put it simply, the big guy starts to refresh frantically without opening the B station. The CDN traffic returns to the source to retry + the user retry, which directly increases the traffic of the B station by more than 4 times, and the number of connections suddenly increases by 100 times to the level of tens of millions. More active SLB will give the whole overload.

7 lines of code made station B crash for 3 hours because of

However, not all services have a multi-active architecture, and the matter has not been completely resolved.

In the next half hour, everyone did a lot of operations, rolled back the Lua code that was launched in the last two weeks or so, but did not restore the remaining services.

When the time came to 12 o’clock, there was no way to do it. “No matter how the bug came out, we will restore the service.”

Simple + Rough: The operation and maintenance directly took one hour to rebuild a new set of SLB clusters.

At 1 am, the new cluster was finally built:

On the one hand, someone is responsible for switching the core business traffic such as live broadcast, e-commerce, comics, and payment to the new cluster one after another, and restoring all services (all completed at 1:50 a.m., temporarily ending the nearly 3-hour crash);

On the other hand, continue to analyze the cause of the bug.

After they ran out a detailed flame graph data with an analysis tool, the troubled “0” finally revealed a clue:

The CPU hotspot is clearly concentrated in one call to the lua-resty-balancer module. And the _gcd function of this module returns an unexpected value after a certain execution: NaN.

At the same time, they also discovered the conditions that trigger the incentive: the weight=0 of a container IP.

They suspected that this function triggered a bug in the jit compiler, and the operation error fell into an infinite loop, causing the SLB CPU to 100%.

So the jit compilation is turned off globally, temporarily avoiding the risk. After everything was settled, it was almost 4 o’clock, and everyone finally slept well for the time being.

The next day, everyone was not idle. After reproducing the bug in the offline environment, it was found that it was not the problem of the jit compiler, but a special release mode of the service, where the weight of the container instance would be 0, and this 0 is a string form.

As mentioned earlier, the string “0” was converted into a number in the arithmetic operation in the dynamic language Lua, and it went to a branch that should not be taken, resulting in an infinite loop, which caused the unprecedented situation of station B. the big crash event.

The cauldron of recursion or the cauldron of weakly typed languages?

Many netizens still have fresh memories of the accident. Some recalled that they thought it was impossible to replace their mobile phone with a computer, and some still remembered that the incident became a hot search 5 minutes later.

Everyone is surprised that such a simple infinite loop can cause such a large website to collapse.

However, it was pointed out that an infinite loop is not uncommon, and it is rare that a problem occurs at the SLB layer and during the distribution process. It is not like a problem in the background that can be quickly restarted and solved.

7 lines of code made station B crash for 3 hours because of

In order to avoid this situation, some people think that recursion should be used with caution, and it is better to set a counter if it is used, and return it directly after reaching a value that is unlikely to be reached by the business.

Others think this is not to blame for recursion, but mainly because of weakly typed languages.

This has also led to the joking term “the scheming ‘0’”.

7 lines of code made station B crash for 3 hours because of

In addition, because the accident was really delayed for too long and too many things, at that time, station B gave all users a one-day membership.

Someone calculated an account here, saying that these 7 lines of code caused the boss of station B to lose about 1,5750,000 yuan. (manual dog head)

7 lines of code made station B crash for 3 hours because of

What do you want to complain about this bug?

Reference link:

[1]”2021.07.13 We collapsed like this” by Bilibili Technology

https://mp.weixin.qq.com/s/nGtC5lBX_Iaj57HIdXq3Qg

Hashtag: Station B crash code

.
[related_posts_by_tax taxonomies=”post_tag”]

The post 7 lines of code made station B crash for 3 hours because of “a scheming 0” appeared first on Gamingsym.