もっと詳しく
Thank you IT home netizens Sancu The clue is delivered!

A small character “0” actually caused the complete collapse of station B.

I wonder if you still remember that night, station B“Power outage in the building”, “Server explosion”, “Programmers delete the library and run away”all-night party. (manual dog head)

After a year, the “real culprit” behind it has finally been revealed by Ah B——

Unexpectedly, it is such a simple line of code, just lying on the B station for two or three hours, making the B station programmers sleepless all night and lose their hair.

You may ask, isn’t this a common function to find the greatest common divisor, how can it be so powerful? One after another behind the scenes, in the final analysis, there is actually one sentence:0, it’s really unpopular.

For more details, let’s take a look at the “Accident Report” together.

“Blood” caused by the string “0”

Let’s talk about the root cause of the tragedy first.That is, the gcd function posted at the beginning. Friends who have learned a little programming knowledge should know that,This is a recursive function that uses rolling division to calculate the greatest common divisor.

Unlike our method of calculating the greatest common divisor by hand, this algorithm looks like this:

For a simple example, a=24, b=18, find the greatest common divisor of a and b;

A divided by b, the remainder is 6, then let a=18, b=6, and then count down;

18 divided by 6, this time the remainder is 0, then 6 is the greatest common divisor of 24 and 18.

That is to say, a and b are repeatedly divided to take the remainder until b=0, in the function:

if b==0 then return a end

This judgment statement takes effect, and the result is calculated. Based on this mathematical principle, let’s look at this code again, it seems to be no problem:

But what if the input b is the string “0”?

As mentioned in the technical analysis article of station B,The code in question is written in Lua. Lua has the following characteristics:

  • This is a dynamically typed language. In common practice, variables do not need to define types, just assign values ​​to variables directly.

  • When Lua performs arithmetic operations on a string of numbers, it will try to convert the string of numbers into a number.

  • In the Lua language, the result of the mathematical operation n%0 is nan (Not A Number).

Let’s simulate this process:

1. When b is a string “0”, since the gcd function does not perform type verification on it, when encountering a judgment statement, “0” is not equal to 0, in the code “return _gcd (b, a% b)” trigger, return _gcd (“0”, nan).

2. _gcd (“0”, nan) is executed again, so the return value becomes _gcd (nan, nan).

This is the end. The condition of b=0 in the judgment statement can never be reached, so an infinite loop appears.That is, the program starts going around in circles like crazy, andUse 100% of the CPU for a result you can never getother user requests will naturally not be processed.

So the question is, how does this “0” get in?

The official statement is:

In a certain publishing mode, the instance weight of the application will be temporarily adjusted to 0, and the weight returned by the registry to the SLB (load balancing) is “0” of the string type. This release environment is only used in the production environment, and the frequency of use is extremely low, and this problem is not triggered during the early grayscale process of SLB.

In the balance_by_lua stage, the SLB will pass the service IP, Port, and Weight stored in the shared memory to the lua-resty-balancer module as parameters to select the upstream server. When the node weight=“0”, the _gcd function in the balancer module The received input parameter b may be “0”.

How bugs are located

From the perspective of “Zhuge Liang after the fact”, the root cause of the complete collapse of Station B is somewhat of a direct “this is it”. But from the perspective of the programmers involved, things are really not that simple.

At 22:52 in the evening – most programmers just got off work or haven’t got off work (doge),The operation and maintenance of station B receives an alarm that the service is unavailablethe first time to suspect the equipment room, network, four-layer LB, seven-layer SLB and other infrastructure problems.

Then immediately pulled an emergency voice conference with the relevant technical staff to start processing. After 5 minutes, the operation and maintenance foundThe CPU occupancy rate of the seventh-floor SLB in the mainframe room, which carries all online services, has reached 100%the user request cannot be processed, and after excluding other facilities, the lock failure is this layer.

(Seven-layer SLB refers to load balancing based on application layer information such as URLs. Load balancing distributes client requests to server clusters through algorithms, thereby reducing server pressure.)

In the midst of all kinds of emergencies, the little episode also appeared: the programmers who were at home remotely could not access the intranet, so they had to call the person in charge of the intranet again, and went through the green channel before they all went online (because one of the domain names was caused by a fault. SLB proxy).

At this point, 25 minutes have passed, and the rush correction begins.

First, the operation and maintenance first restarted the SLB hot, but it did not recover; then tried to reject the user traffic and restarted the SLB cold, but the CPU was still at 100%, but it still did not recover.

Then, the operation and maintenance found that a large number of SLB requests in the multi-active computer room were timed out, but the CPU was not overloaded. When the SLB was about to be restarted, the internal group responded that the main site service had been restored, and functions such as video playback, recommendation, comment, and dynamic were basically normal.

It was 23:23, 31 minutes before the accident.

It is worth mentioning that,The recovery of these functions is actually due to the “high availability disaster recovery architecture” that was complained by netizens at the time of the incident..

As for why this line of defense didn’t work in the first place, there may still be some of you and I in it.

To put it simply, the big guy starts to refresh frantically when he can’t open station B, CDN traffic returns to the source and retry + user retry,Directly increase the traffic of station B by more than 4 timesthe number of connections suddenly increased by 100 times to the level of tens of millions,More active SLB will give the whole overload.

However, not all services have a multi-active architecture, and the matter has not been completely resolved. In the next half hour, everyone did a lot of operations, rolled back the Lua code that was launched in the last two weeks or so, but did not restore the remaining services.

When the time came to 12 o’clock, there was no way to do it. “No matter how the bug came out, we will restore the service.” Simple + Rough:The operation and maintenance directly took an hour to rebuild a new set of SLB clusters.

At 1:00 in the morning, the new cluster was finally built: on the one hand, someone was responsible for switching the core business traffic such as live broadcast, e-commerce, comics, and payment to the new cluster, and restoring all services (all completed at 1:50 in the morning, temporarily ended and collapsed) accidents approaching 3 hours);

On the other hand, continue to analyze the cause of the bug. After they ran a detailed flame graph data with the analysis tool, the troublesome “0” finally revealed a clue: the CPU hotspot was obviously concentrated in a call to the lua-resty-balancer module. And the module’s _gcd function returned an unexpected value after a certain execution: NaN.

At the same time, they also found conditions that trigger triggers:weight=0 for a container IP. They suspected that this function triggered a bug in the jit compiler, and the operation error fell into an infinite loop, causing the SLB CPU to 100%. So the jit compilation is turned off globally, temporarily avoiding the risk. After everything was settled, it was almost 4 o’clock, and everyone finally had a good night’s sleep.

The next day, everyone was not idle. After reproducing the bug in the offline environment non-stop, it was found that it was not the problem of the jit compiler, but the problem of the jit compiler.A special publishing mode of the service will cause the container instance weight to be 0and this 0 is in the form of a string.

As mentioned earlier, this string “0” was converted into a number in the arithmetic operation in the dynamic language Lua, and went to a branch that should not be taken, resulting in an infinite loop, which caused the unprecedented situation of station b. the big crash event.

The cauldron of recursion or the cauldron of weakly typed languages?

Many netizens still have fresh memories of the accident. Some recalled that they thought it was impossible to change their mobile phone to a computer, and some still remembered that the incident became a hot search 5 minutes later.

Everyone is surprised that such a simple infinite loop can cause such a large website to collapse. However, it has been pointed out that infinite loops are not uncommon,Rarely something goes wrong at the SLB layer, during the distribution processit doesn’t seem like a problem in the background can be solved quickly by restarting.

In order to avoid this situation, some people think that recursion should be used with caution, and it is better to set a counter if it is used, and return it directly after reaching a value that is unlikely to be reached by the business.

Others think this is not to blame for recursion, but mainly because of weakly typed languages. This has also led to the joking term “the scheming ‘0’”.

In addition, because the accident was really delayed for too long and too many things,At that time, station B gave all users a one-day membership.

Someone made a calculation here, saying that it was these 7 lines of code,Let the boss of station b lose about 1,5750,000 yuan. (manual dog head)

What do you want to complain about this bug?

Reference link:

[1]”2021.07.13 We collapsed like this” by Bilibili Technology

https://mp.weixin.qq.com/s/nGtC5lBX_Iaj57HIdXq3Qg

.
[related_posts_by_tax taxonomies=”post_tag”]

The post 7 lines of code crashed station B for 3 hours, but because of “a tricky 0” appeared first on Gamingsym.