Author: Vivo Internet Container Team – Pan Liangbiao

This article is based on the content of the speech delivered by Mr. Pan Liangbiao at the “2022 vivo Developer Conference”. Reply to[2022 VDC]on the official account to obtain relevant information on the topics of the Internet technology sub-session.

Since 2018, vivo has built a one-stop cloud-native machine learning platform based on containers. It supports the algorithm middle platform upwards, provides algorithm engineers with capabilities such as data management, model training, model management, and model deployment, empowers businesses such as advertising, recommendation, and search, and successfully reduces costs and improves efficiency for algorithms, making cloud-native And container value is beginning to emerge. Based on the pilot results of the machine learning platform, after the pilot practice and value analysis of the algorithm scene, the internal strategy has been upgraded. It is determined to build an industry-leading container ecosystem based on the cloud-native concept to achieve large-scale cost reduction and efficiency improvement.

This article will introduce in detail the specific practices of vivo in the construction of high-availability container clusters, including the polishing and construction of high-availability construction of container clusters, automated operation and maintenance of container clusters, container platform architecture upgrades, container platform capability enhancements, and container ecological integration. At present, vivo’s container product capability matrix is ​​gradually improving, and will continue to focus on three directions: comprehensive containerization, embracing cloud native, and offline hybrid deployment.


Cloud native and containers are currently hot topics, and Kubernetes has become the de facto standard in the field of container orchestration.

In the process of implementing cloud-native and containers internally, domestic and foreign enterprises will encounter various problems and challenges based on their own business scenarios and development stages. This article is the exploration and implementation practice of vivo in the field of cloud-native containers. Readers have some references and help.

1. Container technology and cloud native concept

The first is the introduction of container technology and cloud native concepts.

1.1 Introduction to Container Technology

Container technology is not a new technology. From the birth of chroot in the Unix system in 1979 to the present, after more than 40 years of development, it has gone through four stages, namely:Technology germination period, technology burst period, commercial exploration period and commercial expansion period.

Each stage solves different technical problems, namely: environment isolation, software distribution and orchestration, commercial service form, scale and scene expansion.

Compared with virtual machines, container technology has less loss of a virtual operating system, so it has better performance than virtual machines.Additionally the container is inSystem resources, startup time, cluster size, high availability strategyThere are also very obvious advantages in other aspects.

According to the 2020 CNCF China Cloud Native Survey Report, 68% of the surveyed Chinese companies have already used container technology in the production environment.

From the perspective of industry development, both cloud vendors and major technology companies are building their own next-generation infrastructure based on container technology to promote digital innovation in enterprises. Container technology has been widely recognized and popularized.

1.2 Introduction to Cloud Native Concepts

Container technology has given birth to the cloud-native thought trend, and the cloud-native ecosystem has promoted the development of container technology. So what exactly is the definition and meaning of cloud native?

There is actually no standard definition for cloud native. If you have to give it a definition, the industry has two opinions:

  • One definition comes from Pivotal, a company that proposes cloud-native applications and is a pioneer and pathfinder of cloud-native applications. Pivotal’s latest official website introduces cloud native with four main points, namely:DevOps, Continuous Delivery, Microservices and Containers.

  • Another definition comes from CNCF. CNCF was established in 2015. It is an open source organization whose purpose is to support the open source community to develop key cloud-native components, including Kubernetes and Prometheus monitoring.

It divides cloud native into3 core technologieswith2 core ideas:

However, regardless of the definition,Containers are the foundation and the core technical means for the implementation of cloud native.

1.3 Cloud Native Value Analysis

Any technology and concept must have actual business value. From the three dimensions of efficiency, cost, and quality, the technical value of cloud native and containers can be summarized as follows:

  • efficiency:It can achieve fast continuous delivery and deployment, portable image packaging, and elastic computing second expansion.

  • cost:It can achieve on-demand allocation without waste, unified scheduling with high filling, and mixed deployment with less fragmentation.

  • quality:It can realize the observable operation status, self-healing when a fault occurs, and operation and maintenance of cluster management.

2. Exploration and practice of vivo container technology

The introduction of new technologies brings new value and inevitably introduces new problems. Next, we will introduce vivo’s exploration and practice in container technology.

2.1 Pilot Exploration

In the algorithm scenario of vivo, the machine learning platform is responsible for the iteration of the algorithm model, which is the core part of the Internet algorithm business. The early platform is based on the traditional architecture, which has certain deficiencies in efficiency, cost, performance and experience, and cannot satisfy The demand for rapid growth of the algorithm business. Based on this, we first conduct a pilot exploration of containers in the algorithm scenario. Since 2018, we have built vivo’s one-stop cloud-native machine learning platform based on containers, which has supported the company’s algorithm middle platform and provided algorithm engineers with data management, model training, model management, and model deployment. capabilities to empower services such as advertising, recommendation, and search.

vivo’s cloud-native machine learning platform has the following five advantages:

  • Full scene:The business is end-to-end, covering recommendation, advertising, and search scenarios.

  • Good experience:The queuing time is short and the user experience is excellent. The queuing time of task P99 is less than 45 minutes.

  • low cost:The scheduling ability is good, the resource utilization rate is high, and the average CPU utilization rate is greater than 45%.

  • efficient:The network scale is large, and the training runs fast, with a training speed of 830 million samples per hour.

  • Excellent result:The algorithm iteration is stable, the training success rate is high, and the training success rate is greater than 95%.

Vivo’s cloud-native machine learning platform has successfully reduced costs and improved efficiency for algorithms, making the value of cloud-native and containers shine.

2.2 Value Mining

Based on the previous pilot results of the machine learning platform, we deeply analyzed and explored the value of containers and cloud-native. Combined with the situation of vivo, we found that containers and cloud-native are the best solutions for enterprises to reduce costs and improve efficiency on a large scale.

1) In terms of cost reduction

Currently, the utilization rate of our internal server resources is low. Taking the CPU utilization rate as an example, the current average utilization rate of vivo servers is around 25%. Compared with the industry-leading level of 40% to 50%, there is still a lot of room for improvement.

The advantages of containers in resource isolation, unified scheduling, and offline deployment are all effective technical means to improve resource ROI.

2) In terms of efficiency improvement

We are currently inMiddleware version upgrade, machine migration, test environment management, burst traffic response and environment consistency for global deploymentThere are business pain points in other aspects.

containerFast delivery, elastic self-operation and maintenance, microservices, service gridCloud-native technologies and architectures are powerful measures to improve efficiency.

2.3 Strategic upgrade

After the pilot practice and value analysis of algorithm scenarios, we have upgraded our internal strategy and decided to build an industry-leading container ecosystem based on the cloud-native concept to achieve large-scale cost reduction and efficiency improvement.

In order to better match the implementation of the strategy and embrace cloud native, we also re-planned and upgraded the internal technical architecture, and newly introduced platforms and capabilities such as unified traffic access platform, container operation and maintenance management platform, unified name service, and container monitoring to support containers. The comprehensive construction and promotion of ecology within the company.

2.4 Challenges

2.4.1 Cluster Challenges

To provide large-scale production-available container services, the availability of container clusters will first face many challenges. The following introduces the four major challenges encountered during the construction of production clusters in vivo containerization.

  • Rapid growth in cluster size:The scale of the vivo cluster server is tens of thousands of host nodes, dozens of managed clusters, the scale of a single cluster is 2,000+, and the number of instances is 100,000+, which poses great challenges to cluster performance and machine management.

  • Cluster O&M, operation and standardization:Due to irregular cluster management in the early days, problems such as black screen operations and human misoperations emerged one after another, and cluster operation and maintenance personnel were overwhelmed with various firefighting efforts every day.

  • Cluster container monitoring architecture and observability:With the rapid growth of the cluster size, the monitoring components of the container are under great pressure, and higher requirements are put forward for the collection, storage and display of container monitoring.

  • Online K8s version upgrade iteration:Faced with the rapid iteration of the Kubernetes version, it is necessary to change the engine of the flying aircraft.

For the challenges, our solutions are: high availability, observability, standardization and automation. Among them, the challenge of container monitoring and non-destructive upgrade of k8s version, vivo official account has a detailed introduction of technical solutions, this article focuses on the introduction of cluster high availability and operation and maintenance automation.

2.4.2 Platform challenges

In addition to the challenges of cluster stability, the platform will also face various challenges. Due to the imperfect container platform and surrounding ecological capabilities, there are high adaptation and migration costs for the business. To sum up, there are four main challenges we encountered:

  • Container IP changes:In the early days of k8s, the business was designed to be stateless. Its native implementation is that the IP of the container will change every time it is released. This is not very friendly to traditional businesses that partially rely on fixed IPs, and the cost of business transformation is relatively high.

  • Adaptation and compatibility of surrounding ecology:Including publishing system, middleware micro-service platform, internal development framework and traffic access layer, etc.

  • User habits:Vivo has a relatively mature publishing platform. Users are used to publishing by computer room, and they are used to operating resource allocation and publishing separately.

  • Value output:The improvement of O&M R&D efficiency is difficult to quantify, and the cost advantage of containers is difficult to measure in the short term.

The above challenges push us to open up the ecology around the container, and at the same time enhance the product capabilities of the container platform to adapt to various business scenarios and reduce the migration cost of users.

2.5 Best Practices

2.5.1 High-availability construction of container clusters

Next, introduce vivo’s best practices in the construction of high-availability container clusters. We start fromFault Prevention, Fault Discovery, and Fault Recoverythree dimensions to build a container cluster availability assurance system.

1. In terms of fault prevention, we start fromProcess tools, disaster tolerance and infrastructure3 aspects to build:

  • Process tools:It mainly includes failure plans and failure drills, and realizes standardization, white screen and automation of operation and maintenance through the construction of an operation and maintenance management platform.

  • Disaster recovery capability:The main purpose is to build business cross-fault domain disaster recovery capability, to ensure cross-cluster scheduling and fast one-click migration of services and business traffic when a cluster fails.

  • infrastructure: Mainly by shielding the user’s perception of the underlying cluster, one computer room has multiple sets of clusters, and one service is deployed on multiple clusters at the same time, so as to avoid the impact of a single cluster failure on the business.

2. In terms of fault discovery, we mainly use measures such as self-built monitoring dashboard, daily cluster inspection, core component monitoring, and cluster external dial-up testing to detect and handle faults in a timely manner and reduce the impact on business.

3. In terms of fault recovery, it is mainly based on the previous fault plan, recover quickly, stop losses in time, and do a good job of fault replay, continuously improve our fault prevention and discovery mechanism, and accumulate valuable experience.

In addition, the observability of the cluster is an important basis for usability guarantee. We build our own SLO panel to monitor the status of the cluster in real time. Only by knowing the operation status well can we be as stable as Mount Tai and respond to all changes calmly.

2.5.2 Container Cluster Automated O&M

In addition to building the stability of the container cluster itself, in terms of O&M automation, we have built a container multi-cluster management platform to realizeCluster configuration standardization,coreWhite screen in O&M scenariosto improve operation and maintenance efficiency.

Our container cluster management platform manages cloud-native in a cloud-native way. Simply put, it is based on the operator mechanism of k8s to realize k8s on k8s.

At present, our platform has been able to realize the unified management of multiple clusters. The cluster deployment is also automated and standardized. It also realizes the connection of the underlying IAAS layer. The cluster nodes can be fully process-oriented and visualized. The function can help us discover problems and hidden dangers of the cluster in time.

Daily operation, maintenance and operations through the platform can not only improve efficiency, but also have auditing capabilities, with operation and change logs that can be traced back, making it easy to locate problems.

2.5.3 Container platform architecture upgrade

In order to adapt to the rapid internal popularization and promotion of business containerization, we have upgraded vivo’s container platform architecture.

The new architecture is divided into 4 layers, and the container + k8s is used as the basic unified base, connecting downward to the infrastructure of the company’s IAAS layer, providing container products and platform capabilities upward, and using the open API for the upper layer to call and customize its own upper layer logic.

On top of the API are the various service types supported by the container, includingOnline services, middleware services, big data computing, algorithm training, real-time computing, etc.the top is to empower the various businesses of vivo Internet.

Based on this container platform architecture, services can be realizedResource isolation deployment, fast delivery and on-demand usage, but also has better elastic scalability. For the platform, we can unify resource scheduling, realize time-sharing multiplexing of resources, and offline mixing, etc., to improve resource utilization.

2.5.4 Enhancement of Container Platform Capabilities

The internal containerization scenarios of vivo are relatively diverse. In order to allow businesses to access and use containers with peace of mind and low cost, during the promotion process, we have adapted containers and enhanced native capabilities based on open source + self-research.

The following is a simple sharing of the six product capability enhancements:

  • Cloud Native Workload Enhancements:Based on the open source openkruise, native deployments, statefulsets and other workloads are enhanced to achieve expansion capabilities such as in-place upgrades, release suspension, streaming, and configuration priorities.

  • Service lossless publishing enhancements:Based on the internal framework and independent research and development of the platform, it realizes the non-destructive publishing of traffic in protocol frameworks such as HTTP and RPC.

  • Container image security:Based on the open source Harbor custom development, it realizes container image security scanning and card control capabilities.

  • Container image acceleration:Based on the open source dragonfly2 custom extension, the distribution performance of large-scale cluster images has been improved by more than 80%.

  • Enhanced IP fixation capabilities:Self-research based on stateful services and CNI supports black and white lists and stateful service scenarios, reducing the cost of business access transformation.

  • Enhanced multi-cluster management capabilities:Function optimization and expansion based on open source Karmada, improve business disaster recovery capabilities, and support single-cluster horizontal expansion capabilities.

Of course, while fully enjoying the benefits of open source, we also continue to participate in open source collaboration and give back to the community. In the process of use and self-development, we also submit the problems and accumulated experience in our own production practice to the community, such as Dragonfly2, Karmada, etc.

2.5.5 Container CICD integration

In addition to the enhancement of platform capabilities, the container platform, as a PaaS platform, needs to be connected with the surrounding ecology to enable better business migration and use. The most important thing is the connection with the release system, which is the CICD platform.

Almost every technology company will have its own CICD, which is a DevOps automated tool that can perform business construction and orchestrate deployment pipelines.

The underlying architecture of vivo’s CICD platform is based on JenKins+Spinnaker. The entire container construction and deployment process is as follows:

  • First, users areCICDCreate and save the pipeline configuration of the publishing process on the platform.

  • Secondly, the CI link can realize docking with the internal GitLab, pull the code, compile the code and build the image based on Jenkins, and push the built image to the image warehouse of the development environment after security scanning.

  • Finally, in the CD link, the CICD platform will call the API provided by the container platform to perform development, testing, pre-release, and deployment operations in the production environment.

2.5.6 Unified traffic access

Next, we will introduce the connection of the most important business traffic access layer in the container ecosystem.

In the early days, vivo internally was implemented based on NginxNorth-South flow and East-West flowforwarding. It can be better supported in virtual machine and physical machine scenarios. With the comprehensive promotion of containers internally, the traditional Nginx architecture can no longer be adapted.

It is mainly reflected in the fact that the number of business instances in the container scenario has increased exponentially compared with the original virtual machine and physical machine. The frequent changes of IP and status synchronization during the container publishing process will put a lot of pressure on the Nginx cluster. When the volume of business requests is very large In some cases, refreshing and loading configuration files at the access layer will cause business jitter, which is unacceptable to us.

Based on this background, we built a cloud-native traffic access layer based on APISIX to meet the needs of comprehensive containerization. After more than a year of construction, our current unified traffic access platform has been able to well support containerized access and has better scalability.

2.6 Practical achievements

2.6.1 Improvement of Product Capability Matrix

After years of polishing and construction, the vivo container product capability matrix tends to be perfected. The entire product capability matrix is ​​divided into 4 layers:

  • Basic service layer:Contains 3 types of services,Mirror management,Cluster operation and maintenancewithcluster monitoring.

  • Capability layer:Contains 6 core competencies, namelycluster scheduling,CAAS-API,container configuration,Container Service Monitoring Alarm,container logswithPlatform scalability.

  • Platform layer:Contains two major platform capabilities, namely CI and CD.

  • Business Layer:Currently covering all business scenarios of vivo Internet.

2.6.2 Outstanding Achievements in Service Access

Next, we will introduce the promotion of vivo containers in detail.

At present, containers mainly cover four major scenarios inside vivo, namely:internet online business,algorithm online,big data computingwithAI algorithm trainingWait. Next, a brief introduction will be made from the scale and value of access.

  • Internet Online Services:Each internal business line has a large number of services running on the container, such as vivo mall, account, browser, quick application, weather, etc., and 600+ services have been connected.

  • Algorithm online service:Currently accessing 500+ services and 3000+ servers, involving various business lines of Promotionsou.

  • Big data computing services:Including offline computing such as Spark, real-time computing such as Flink, Olap and other scenarios, currently connected to 20+ clusters.

  • AI algorithm training:It mainly provides GPU and CPU heterogeneous computing, business scenarios such as Tensorflow, mpi and other scenarios, with a computing power of more than 100,000 cores, and several GPU cards.

After the business is containerized, the effect of cost reduction and efficiency improvement on the business is very obvious, including but not limited to expansion and contraction efficiency, elastic scalability, business self-healing ability, resource cost, etc.

2.7 Practice Summary

Based on our exploration and practice, it can be summarized as thinking in four dimensions: technical value, promotion strategy, platform construction, and cloud-native breakthrough.

  • Find value:Focus on new technologies, but not obsessed with technology itself, must combine business pain points and value.

  • Set strategy:Bottom-up small-scale pilot exploration will generate actual business value and affect top-down strategic adjustments.

  • Build a platform:When there is already a relatively complete platform and capability, it is necessary to find the entry point of the container, carry out integration and co-construction, and avoid pushing it all over again; for new capabilities that need to be built from 0 to 1, it is necessary to decisively incubate and innovate.

  • Seeking a breakthrough:In the process of business containerization, we have done a lot of compatibility and adaptation for fast containerization. In order to better reduce costs and improve efficiency, in the future, we hope to guide users to achieve a breakthrough from using cloud native to making good use of cloud native.

In general, technology serves the business, and enterprises should find suitable solutions based on their own status quo and create value for the business.

3. Vivo’s future outlook on cloud native

3.1 Development of vivo infrastructure

Looking at the future development from the past, looking back on the past 10 years, the development of vivo’s infrastructure has gone through three stages:

  • Phase one:In the traditional R&D operation and maintenance stage, from 2011 to 2018, from the early do-separated R&D model to the implementation of the virtualization solution based on openstack+kvm.

  • Phase two:In the stage of Internetization of application architecture, from 2018 to 2020, containerization began to rise within vivo.

  • Phase three:In the evolution stage of cloud-native infrastructure, from 2021 to the present, cloud-native and containers will be applied and promoted in more scenarios within vivo, such as offline mixing.

3.2 Vivo’s Cloud Native Future Outlook

Return to the original thinking of things, do the right thing, and do things right. Don’t follow blindly, have determination, be based on value, look at the development of new technologies objectively, make bold assumptions, carefully verify, and practice to gain true knowledge.

The future of vivo’s cloud native will develop in three directions, namely full containerization, embracing cloud native, and offline mixing.

  • Our vision is:One-time development runs everywhere, and the ultimate efficiency and cost optimization are achieved through automatic operation and maintenance!

  • For developers:We hope that everyone will become the blue whale that swims in the sea, carrying our business applications, building them once and distributing them everywhere, flexible scheduling and operation and maintenance.

  • For managers:We hope to achieve cost optimization while pursuing efficiency.


END


you may also like



#Vivo #Cloud #Native #Container #Exploration #Implementation #Practice #vivo #Internet #Technology #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *