I don’t know if the poor friends still remember, on July 13 last year, a big event happened at station B. It collapsed without warning. . . (If you forgot, you can read this article)
As for why it collapsed, no one had a clue at the time. But blowing water is a set of sets, such as power outages, fires, programmers rm -rf /* running away. . . It’s a no-brainer.
Later, with Station B cultivating at 2:00 in the morning and slowly solving the server problem, this matter can be considered to come to an end.
I thought that the collapse of station B this time would become a joke in our surfing life, just like the countless collapsed websites on Weibo, leaving only one big member for us to “remember”.
Unexpectedly, on July 13 this year, station B specially posted an article, and came to tell us about what happened that night.
We also read this article, my dear, the reason why the entire B station collapsed was just a line of code that was not written properly? ? ? With this article, Shichao is going to take everyone to review this matter from the perspective of station B. Don’t worry, there will be no jerky and incomprehensible nouns, no sharp and confusing slang, so Xiaobai can also understand. Case backtracking: An accident occurred at 22:52 on July 13, 2021.
The engineer (SRE) responsible for the reliability of the site and the customer service of station B have received a large number of alarms that the website cannot be opened.
The colleagues in charge of dealing with these accidents have already left work, and immediately prepare to log in to the company’s intranet through VPN at home to deal with these problems.
It turned out that the VPN also crashed. . . Can’t get into the system at all. In the end, it was only after the whole “green channel” of the company that I successfully entered. You said this green channel would not be a sunflower (a remote desktop software)
▼
After the green channel was successfully opened and the teams responsible for various businesses were in place, station B also began to analyze and locate the problem. The faulty module is also obvious. The CPU of the 7-layer SLB (load balancing server, used to handle multi-user and multi-service situations) in the main room of the online business runs to 100%.
To put it simply, the CPU was occupied by an assassin who did not know where it came from, and it was unable to process business.
System not responding .exe ▼
The first attempt at station B is the same as what we usually do after the phone and computer are stuck.
Just restart and you’re done, believe that restarting can solve 90% of the problems!
But unfortunately, station B is the 10.5% this time.
It said that the business has recovered, but no, after the host room restarted, there was still a problem that the CPU ran to 100%. However, other computer rooms are getting better. Although they will be stuck, there is no problem of CPU running full.
Some businesses that have done multiple jobs (multiple sites provide services at the same time) are slowly recovering. so. . . Rebooting doesn’t completely solve the problem, but since this problem hasn’t appeared in the past.
Could it be a newly added code problem? As time passed by, with the help of analysis tools, the problem was located to the newly launched Lua (a programming language, similar to Python, Java, etc.) functions.
Subsequently, station B began a wave of tense rollback operations.
After this work, although it seems to have found a few suspected problems, the server still needs to hang up, and there is still some distance from “recovery”.
No way, we have to let the business run first. So the team began to split up. One team continued to troubleshoot the problem to find the cause, while the other team started rebuilding a new SLB service.
After an exciting hour, the new SLB configuration was successful, and the traffic originally directed to the master station slowly began to migrate.
Fortunately this time.
At two in the morning, three hours after the crash, the business of station B finally recovered. The culprit: The above are the stories that happened at station B that night. Although the surface problems were solved, the business was restored.
But what is the most fundamental reason? If the root cause is not found, there will be a second thunderstorm sooner or later.
The classmates in charge of troubleshooting did not disappoint. After the time pressure was greatly eased, the truth was found. No aliens, no fires, no power outages, which is very different from what netizens imagined. The root cause of the collapse of station B this time is simply because a function for finding the greatest common divisor was not written well. . .
Let’s take a look at this “root of all evil” first.
This is a typical “call itself” recursive function. The two numbers ab and ab are calculated for the remainder, and the function terminates when b is equal to 0. Otherwise, the function will call itself and run it again.
It seems that there is no problem at all, not only the termination condition of the recursion (b = 0) is clear, and there is not much complicated logic processing. But since things can develop to this point. . . That means there is a big problem. Poor friends who have some understanding of programming may have found something wrong:
What 0 is the 0 you passed in? Yes, in programming languages, the number 0 and the string ‘ 0 ‘ are not the same thing. In order to prevent stupid computer languages from confusing things, static languages like C and Java require us to declare the type of the variable when we create a new variable.
Find out if it’s an integer, a decimal, or a character. However, Lua is a very intelligent language, and it does not have this requirement. Just let it do the dirty work automatically, and Lua will automatically assign variable types according to the needs of the program.
C language example: # Define an integer data a, assign 1 to it # Define a string data b, assign it ‘1’ int a = 0; char a = ‘0’; Lua example: — define a as number 0, b is string ‘0’ a = 0b = ‘0’
So, the value we pass in to parameter b, is it the number 0, or the character ‘ 0 ‘? Once the previous data validation is not properly turned off, when a function is executed, the character ‘ 0 ‘ is passed to this function.
The mine was detonated. The string ‘ 0 ‘ will not be equal to the number 0, and the termination condition of the function will not pass.
So the program goes into recursive mode and calls itself again. In the follow-up budget, Lua’s “wisdom” suddenly played a role. Lua patted his head, why would someone use the character ‘ 0 ‘ for calculation, he must want to use this parameter as a number.
So a forced type conversion occurs.
So we all learn mathematics in elementary school. . . Dividing by 0 happens. If the old big brother C language does this work, it may directly give a Floating point exception and report an error. But Lua is different, as a new-age “smart” language, it will return a nan (Not A Numbewr) gracefully.
program, continue to run. Even worse, nan will not be equal to 0. . . The program’s termination condition could not be fulfilled. After running a few loops like this, the function _gcd(a,b) originally used to calculate the greatest common divisor of a and b becomes an unstoppable function _gcd(nan,nan).
On the road that can’t stop, it can’t stop at all, and it directly eats up the CPU resources.
Being too smart is not a good thing. . .
In this way, the occupied CPU collapsed other businesses in one go. Can’t the programmer at station B mentioned above be able to rescue the network through VPN at home? That’s right, when they log in to the intranet, some of the services also need to be processed through the intranet. . .
It belongs to breaking the key in the keyhole, and it is natural to collapse. After the collapse: Finally, if the poor friends are more interested in the relevant technical details, Shichao recommends you to read this 2021.07.13 released by the B station. In addition to the origin and transition of the accident, it is also the future technology. Progress and reflection have made a more professional and comprehensive summary.
To be honest, such an opportunity is actually quite rare. There are so many apps that crash every year, but very few are willing to send them out for peers to learn from, and for the general public to have fun.
Swipe up ▼
Station B is willing to share this time and face its “scars”. It also allows us to see the most real side of Internet operation and maintenance. These experiences will not be written in any textbook. Oh yes, the night that this article was published, station B actually crashed again secretly. . .
I don’t know if it’s because the team has summed up the experience of last year. This time, I haven’t waited for most of the people to respond. . . Station B has solved the problem.
Hashtag: B station server crash
.
[related_posts_by_tax taxonomies=”post_tag”]
The post Station B revealed that last year’s server crash was because of this? -Server, Station B, Crash – Fast Technology (Media of Drive Home) – Technology changes the future appeared first on Gamingsym.