Paul Buchheit, Gmail’s first product manager, said the best products make people, once they use them, unable to imagine life without them. This sentence has always been implemented in Gmail, which has nearly 2 billion users around the world, and if you find a sample in China, WeChat is the most appropriate.
A person living in China without a WeChat account is now enough of a story, but so is the torment of a national product. On the morning of June 16, WeChat Pay briefly appeared abnormal and immediately became a hot search. Any mishaps in it would cause a collective discomfort. This caution prevents WeChat from being a functionally sensitive product.
But it still needs to take the initiative to change to keep up with this era, but for the development team of WeChat, this is a road with a very narrow space for trial and error. People can’t go back to a time when they didn’t have WeChat, and WeChat better not remind them.
Such a thing happened in 2013, when a construction team in Shanghai knocked the “only” 300 million users from sending and receiving messages for nearly five hours. This bottom line was tightened again on the eve of the Spring Festival in 2020. If the 2013 incident was a passive accident, the test two years ago was a must.
At that time, WeChat was leaving physical servers and was in the middle of everything turning to virtual and cloud. In a “Spring Festival Guarantee” stress test in mid-January, the WeChat team conducted an aggressive test on the virtual server after expansion, and the server reached its limit when the number of simultaneous accesses was only half of the expected number. That year’s New Year’s Eve was January 24th, and if the problem was resolved within two weeks, it meant that the entire WeChat might be shut down again on a large scale before the New Year’s bell rang.
In the end, the undercurrent did not surface. Now, when it comes to WeChat on that day, occasionally someone remembers that it was the first time that the exclusive red envelope cover was launched, and everything was fine.
After the 930 revolution, open source collaboration and self-research on the cloud have become Tencent’s new strategic direction, and it has also become an opportunity for WeChat to migrate to the cloud. WeChat is Tencent’s most cautious business, which can be seen from its cloud migration sequence within Tencent – the last one. WeChat completed the replacement of physical machines with virtual machines in two years, and then gradually moved away from the original internal self-developed cloud platform system and turned to K8S, which is more open source. For WeChat, which has become the background of life, this is a huge change that cannot be publicized. Until now, the process of migrating WeChat infrastructure to the cloud has gradually been completed, and a complicated road has emerged behind it.
Physical machine, Yard, and that old WeChat
In hindsight, in the year 2013, a line was faintly drawn on WeChat.
In mid-January of this year, the WeChat team announced on Weibo that the number of WeChat users finally exceeded 300 million, making it the communication software with the most downloads and users in the world at that time. At this time, it is even a few days before the two-year anniversary of WeChat’s first launch. In less than two years, the functions of People Nearby and Shake It brought the first batch of users to WeChat with the initial hot feeling of the mobile Internet, and then in 2012, the circle of friends and the video chat function appeared.
Before 2013, except for the orange envelope in the dialog box, the WeChat we are now familiar with has basically taken shape.
In a flash and a dark, Tencent Soso was sold in 2013. This product, which followed Google and Baidu in 2006, ended up dead, and was packaged and injected into Sogou seven years later. Tencent’s search business has temporarily stopped, and the confusion has turned into more effort in the star business. Wen Jie, an engineer who led the establishment of the entire structure of Tencent Soso and sold it, joined the WeChat Technical Architecture Department in the same year as the backbone force.
WeChat strives to be simple and run out, and the daily sending and receiving volume of tens of billions of messages and the number of tens of thousands of servers are another story behind the implementation of this prosperity. The server capacity of WeChat needs to meet the upper limit of pressure, and the CPU utilization rate is not always at the peak. At 9:00 pm, the message sending and receiving period is the highest. After a few hours until the early morning, the CPU utilization rate is only 3%. , the limit drop is 15 times. The vast majority of server computing power is wasted.
The third 100 million users, WeChat has only been used for less than four months, and an imminent outbreak period can be foreseen. A new resource distribution logic within WeChat is about to emerge, and Wen Jie and the entire technical architecture department will lead this transformative research and development. At the end of 2013, the self-developed cloud platform system Yard began to appear in internal discussions.
Yard is an acronym for four English words, Yet, Another, Resource and Dispatcher, which together means “just another resource distribution system”. Or called a container management system, after Yard uses container technology to finely isolate the WeChat server CPU, multiple functional modules can be split and deployed on the same server.
This means that there is a more efficient way of mixing online and offline. When there is a sudden traffic demand online, offline tasks can quickly free up server resources. The utilization rate of CPU resources in the WeChat cluster under Yard has reached more than 40%.
This method worked, and Yard held up the next outbreak of WeChat. At the end of 2016, the combined number of monthly active users of WeChat and WeChat reached 889 million. That year, the number of netizens in my country was only 731 million.
But when WeChat completed the most important journey of user growth and began to pay more attention to the breadth of its business, Yard’s disadvantages also began to appear.
WeChat in early 2014 is still three years away from the launch of the first mini-program, and there is not even WeChat Pay yet. The door to the platform for accepting guests from all over the world has not yet been opened, and Yard has not given much consideration to the compatibility with external technical tools during R&D. In fact, Yard was born with a very specific goal, which is to perform flexible scheduling of virtualization for the server’s CPU and storage to reduce costs and increase efficiency. In other words, Yard is to solve a very clear direction, which is different from WeChat’s original infrastructure. born out of the need for strong associations.
But with the influx of more business, Yard, which is not open source, is like a non-standard product,
WeChat’s business has rapidly expanded within a few years, and the business involves more fields. Each team has its own preferences for the technical tools that are relied on. The customization requirements bring a lot of unnecessary workload. The mainstream of big data-related business is more inclined to Hadoop or Spark technology; the team doing AI training is inclined to Tensorflow or Pytorch, but these frameworks need to be manually re-adapted when accessing Yard for the first time, even in every framework After the upgrade, the same thing has to be done again. The more new technical tools are introduced, the more exposed Yard’s limitations in openness are.
After the 930 revolution, the separation of physical machines has become the beginning of cloud migration, but this is only the first step. The overall infrastructure is moved to the cloud. WeChat is bound to move to an open source environment this time, and the Kubernetes system seems to be the most suitable way.
wind direction
Yard really started to land in WeChat around 2013 and 2014, which is also the beginning of WeChat cloud. This year, the global open source trend has finally begun to warm up.
At that time, Linux, another penguin in the northern hemisphere, was in the limelight. Nadella, who was elected as Microsoft’s new CEO in 2014, immediately held high “Microsoft loves Linux” after he took office; in the same year, it has hosted more than 10 million repositories after six years of launch. GitHub has gradually become the living room of developers of Silicon Valley giant technology companies such as Microsoft and Google.
In early 2013, a draft of the White House’s “Open Data Policy” was posted on GitHub. Before this, there had never been a government policy document hosted on a private company’s servers. Although this document cannot be re-manipulated or derived from any code project, it still has extremely important symbolic meaning. GitHub, and the open source ideas behind it, came to the fore with Chris Wansklas.
Previously, Microsoft, or the entire mainstream voice of technology, stood on the opposite side of open source, just like Windows and Linux had a long-standing confrontation on security. But the charm of technology is also here. The superiority of open source is undoubtedly revealed in this era when all scenarios tend to be virtualized. Once a consensus is reached, the transformation will be instantaneous.
From giants to indie developers, the idea of open source is clearly heating up. Collaborating on code, and even making the very act of writing code community-based, is becoming the new way of project management in the information world.
Also in 2013, the first version of the Docker project was uploaded to GitHub, open sourced under the Apache 2.0 license and maintained on GitHub. Docker opened up the history of containers as a virtualization technology. Before that, with the development of hardware performance, hardware performance excess became an increasingly prominent problem, and hardware virtualization became the first solution. The traditional virtual machine technology is to virtualize a set of hardware, run a complete operating system (Guest OS) on it, and then run the required application process on the system. But Guest OS itself is a system that takes up a lot of memory and needs to be installed repeatedly on all virtual machines, which is very heavy. In contrast, the application process packaged in the container can run directly in the host kernel, and the container does not have its own kernel, and hardware virtualization is unnecessary. The logic of this packaging isolation is lighter and has better expansion. elasticity.
Due to the emergence of containers, hardware virtualization, that is, a virtual machine and a large-memory Guest OS, is no longer a necessary condition for efficient resource allocation. However, containers are more inclined to a technical approach. This technology ultimately needs to solve problems on the application side. Therefore, a higher-dimensional scheduling tool is required on top of a huge container infrastructure cluster.
At the European DockerCon conference in October 2017, Docker company CTO Solomon Hykes announced that in addition to supporting its own scheduling engine Swarm, the next version of Docker will support an external scheduling platform – Google’s Kubernetes for the first time.
Kubernetes, also known as K8S (due to a total of 8 letters), is an open source system for automatic deployment, elastic scaling, and management of container applications. The main function is container orchestration for production environments. In June 2014, Google cloud computing expert Eric Brewer unveiled this new open source tool at a press conference in San Francisco. After iterating to v 1.0 on July 22, 2015, k8s was officially announced.
Docker, which first proposed the concept of containers, took the initiative to approach K8S three years later. This move brought shocks to the industry no less than the phrase “Microsoft loves Linux”. This means that in the market of container scheduling tools, K8S wins the battle with Swarm and Mesos and becomes the industry standard.
To some extent, WeChat Yard has some similarities with Windows, both of which were once closed-source works that were tech-first but completely inward-looking. At that time, it was different from the past. After WeChat grew into a platform and the connected businesses became more and more complex, an innovation from closed source to open source was inevitable. Coincidentally, Microsoft acquired Github for $7.5 billion in 2018, and WeChat decided to start switching from Yard to K8S during the year.
This process is not done overnight. The migration to K8S requires the necessary support of the hardware environment. Tencent’s team responsible for the construction of the cloud environment has started to build it since 2018. At the same time, with the 930 revolution as a boundary, Tencent began to change the server provision model, from providing physical machines to CVM virtual machines.
As mentioned earlier, virtual machines have no advantage over physical machines in performance. The value of getting rid of physical machines lies in reducing costs. There is no depreciation, there is no need to purchase physical servers or specially arrange computer rooms, which will save hundreds of millions of dollars. This step will be completed in 2020. It was also from that time that a Yard, which was completely running on the cloud, began to migrate to K8S.
Turn to K8S
When Yard began to take shape in 2014, K8S had not yet appeared. At the time of design, WeChat’s internal positioning of Yard was only to meet its own needs, and there was no need for more generalization or further cloudification. Converting from two seemingly disjointed systems with a lot of complex functionality, compatibility became the most important issue in this migration process.
One of the most typical conflicts is that two functional modules are deployed on a server with the K8S architecture. These two functional modules must be completely isolated. This is a basic assumption formed by K8S or the current cloud platform from the perspective of security. However, this point was not particularly emphasized in the early design of Yard. Yard’s core deployment logic fully serves WeChat, and two functional modules in a machine can communicate with each other through shared memory and other methods.
In mid-2020, during the migration process of an internal performance tool, the entire platform was down on a large scale once.
“At that time, there were 20 or 30 services running on it, and suddenly all the services were abnormal. My phone and corporate WeChat were all bombed, and they were all looking for me.” The budget is only a few minutes. For Lucienduan, an engineer at the WeChat Payment Platform Architecture Center, the thunder that was tested in advance this time was a rare “dark cloud” moment in the experience.
The accident was eventually traced back to an irregularly written task. A line of inconspicuous error codes caused the gateway to be overloaded and directly hung up the gateway.
In the early days of K8S, the migration process was immature, and the entire architecture team had to work under this huge potential risk from time to time.
Fortunately, this operation error was only one of the few accidents, and it did not affect WeChat users outside. This is also the bottom line that WeChat has drawn for this cloud migration process. For the 1 billion users who are using WeChat, they do not need to know what is happening behind the green dialog box in their hands, but replacing the self-developed Yard with K8S has to be related to the normal operation of WeChat. occur simultaneously.
Therefore, in the early stage of the migration process, the WeChat team did a smoke test in advance. All WeChat functions based on Yard need to be run on the K8S in advance to screen out some obvious problems.
Determining compatibility is the first step in Yard’s migration to K8S, followed by the alignment of all functions in the two systems, including the ability to support disaster recovery in the three parks. This is a very conspicuous lesson in WeChat’s entire product history.
On July 22, 2013, the main optical fiber of the WeChat Shanghai data center was accidentally cut, which led to a collective paralysis of 2,000 servers. WeChat has been deploying three mutually backup service instances of the core module in a single message system in the same computer room. This redundant design was not conspicuous in the early days of WeChat’s rapid growth, but that accident caused a large amount of messenger transmission and reception. The circle of friends service was interrupted for nearly 5 hours.
Tencent Qingyuan Data Center Source: WeChat Team
After the accident, WeChat began to distribute servers, and the disaster recovery model of placing computer rooms in three different buildings came into being. This is also a key point for K8S to align Yard.
“Whether K8S can support the three parks well is the first consideration at that time.” To be cautious, the WeChat team has a clear requirement for this migration, and every step of the migration operation must be able to roll back Yard. “The capacity of the YARD platform must be able to withstand the traffic brought by the K8S platform rollback at any time to ensure business loss,” said the WeChat team.
The rest is what K8S can bring to WeChat after replacing Yard.
Coder to Owner
The frequency of software development and deployment in the DevOps era is urgent to weekly or even daily, but the separation of development and operation and maintenance has gradually become an obvious efficiency problem within WeChat. While Dev and Ops are written together, the actual operation is done by two teams. After the development team completes the writing and packaging of the code, it is handed over to the operation and maintenance team to deploy the core and go online. As a result, the operation and maintenance personnel are not familiar with the code logic, and the developers do not know how to go online. Such problems frequently occur within WeChat, and urgent problems often need to be dealt with by many people.
“This kind of thing has lowered the R&D efficiency of the entire team,” many people in the WeChat business team mentioned at the same time.
The most obvious change for WeChat developers after migrating to K8S is here. The full-stack deployment makes the role of operation and maintenance largely merged with that of developers. In addition to writing code, WeChat’s development team can also complete expansion, launch and module deployment at the same time. This link from development to launch has been greatly shortened. In the words of WeChat infrastructure engineer edselwang, “Business code writers start from A pure Coder becomes an Owner of a business module”.
And because K8S has more comprehensive virtualization support, after the entire R&D system is completed on the cloud, the node deployment is separated from the virtual machine, and the CI/CD (continuous integration/continuous deployment) process during the development process can be used as a pipeline-like automatic delivery process. Complete implementation, which can be understood as a “self-healing” ability.
edselwang gave an example. If the node deployed on the virtual machine is broken, because the virtual machine does not have the attribute of direct node migration, the operation and maintenance personnel need to manually transfer the node between the two virtual machines. However, if the node is deployed on the K8S platform, the system can automatically schedule the node instead of manual work.
In the peak period of robbing red envelopes on the 30th night, the entire WeChat operation and maintenance team worked overtime to keep the schedule in front of the server, and it would be easier after the overall cloud.
On a larger level, WeChat was not the first to go to K8S within Tencent. Tang Daosheng, who built QQ, entered the new CSIG division after the 930 revolution. The IEG division, where many star game studios are located, also began to put their architecture on the cloud a few years ago.
Tencent’s overall K8S environment was built before the migration of WeChat, which means that after the latter escaped from Yard, it will be further integrated into Tencent Cloud’s native facility system in terms of infrastructure research and development, no matter from resource scheduling or system tool adaptation. In terms of nature, the decision-making cost of new business has become lower.
Such a complex infrastructure ultimately points to a more advanced productivity tool that unlocks human value.
Stephen Liu, head of WeChat’s technical architecture, expects a fully cloud-native WeChat to eventually become “autonomous driving” in the sense of resource scheduling.
“If WeChat was Level 0 before 2014, it is now Level 1 after Yard. After 2021, I think we should be at Level 2.” Stephen Liu envisages In the future, the WeChat Spring Festival supply guarantee scheduling will be completely dominated by system scheduling, which must be based on a completely cloud-native WeChat.
2019 is the last time WeChat applied for physical servers. According to the usual depreciation time of four to five years, if there is no accident, the last batch of physical servers will be out of warranty around the end of 2023, which happens to be the 10 that Yard started to build. years later. At that time, WeChat will really put the whole body on the cloud.
Everything is quiet, WeChat has become the new WeChat.
Hashtag: WeChat Server Tencent Cloud
.
[related_posts_by_tax taxonomies=”post_tag”]
The post You may not know that the server of WeChat is completely moved to the cloud today appeared first on Gamingsym.