In March 2020, Dewu’s technical team completed the reconstruction of the entire trading system within three months, delivered the Wucaishi project, and the business system entered the era of microservices. After the split of system services, although each service will have different teams performing their duties, the dependencies between services will also become complicated, and the requirements for related infrastructure such as service governance will be higher.
Monitoring services is an important link in service governance and stability construction. It can help detect problems early, estimate system water levels, and analyze faults, etc. From the end of 2019 to the present, Dewu’s application service monitoring system has gone through three stages of evolution. Now, the entire Dewu’s application microservice monitoring system has been fully integrated into the cloud-native observability technology OpenTelemetry.
Looking back over the past ten years, competition in the application service monitoring industry has also been fierce, and related products have sprung up, such as Zipkin, which was open-sourced by Twitter in 2012, Pinpoint, which was open-sourced by Naver, the largest search engine and portal in South Korea, and Uber in recent years. The open source Jaeger, and our domestic Wu Sheng open source SkyWalking.
Some people say that these are actually attributed to the paper published by Google in 2010 based on the practice of its internal large-scale distributed link tracking system Dapper. Its design concept is the ancestor of all distributed call chain tracking systems, but in fact as early as two Ten years ago (2002), eBay, the world’s largest e-commerce platform, already had a call chain tracking system CAL (Centralized Application Logging). In 2011, Wu Qimin, a senior architect of the former eBay’s China R&D Center, jumped to Dianping, and deeply absorbed and digested the design ideas of CAL, led the research and development and open sourced CAT (Centralized Application Tracking).
As an open source system dominated by Chinese people, CAT has done a good job in localization. With its simple structure and out-of-the-box features, CAT is also the first application monitoring system we use.
Real-time application monitoring based on CAT from 0 to 1
Before the delivery of the Dewu Colorful Stone project, the system only monitored at the infrastructure level, and the introduction of CAT made up for the blind spot in application monitoring. It supports the provision of performance monitoring reports in various dimensions, health status detection, and abnormal statistics, which play a positive role in promoting fault troubleshooting, and also provide simple real-time alarm capabilities.
CAT has the ability to aggregate statistics at the minute level of indicators. It is not difficult to see from the UI that it has rich report statistics and troubleshooting capabilities.
However, as the company’s business scale gradually expands, the granularity of microservices inevitably becomes smaller. We found that CAT has gradually been unable to meet our usage scenarios:
- The full link view cannot be visualized:
Scenarios for troubleshooting and daily performance analysis are becoming more and more complex. For a core scenario, its internal call links are usually complex and changeable. From the perspective of traffic, it is necessary to fully know its source, upstream and downstream links , asynchronous calls, and so on, which may be a bit over the top for CAT.
- Lack of chart customization capabilities:
Although CAT provides multi-dimensional report analysis, its customization capabilities are very limited. At that time, the industry’s chart component customization solutions gradually moved closer to Grafana + Prometheus, but if you use CAT, you cannot enjoy powerful charting capabilities. At the same time, with the rise of OpenTracing, an observability project in the cloud-native community, in less than half a year, we gradually offlined CAT and evolved towards the OpenTracing ecosystem.
Continue to create full-link sampling monitoring based on OpenTracing
OpenTracing has customized a complete set of protocol standards for full-link tracing Trace, and does not provide implementation details itself. In the OpenTracing protocol, Trace is considered as a directed acyclic graph (DAG) of Span. The official also cites the causality of the following 8 spans and their single trace example diagram:
At that time, the open source community related to OpenTracing was also extremely active. It used Jaeger to solve data collection, and the call chain was displayed using a Gantt chart:
In the OpenTracing ecosystem, we use the header sampling strategy for link sampling. For metrics, OpenTracing does not formulate its specifications, but in the Google SRE Book, four types of gold indicators are mentioned in the chapter on Monitoring Distributed System:
Throughput: such as the number of requests per second, the usual implementation is to set a counter, which will increase automatically every time a request is completed. Throughput per second is calculated by calculating the rate of change over the time window.
Latency: The time it takes to process a request.
Error rate/number of errors: such as HTTP 500 errors. Of course, some even the HTTP 200 status needs to distinguish whether the current request is an “error” request according to specific business logic.
Saturation: Similar to server hardware resources such as CPU, memory, network usage, etc.
Therefore, we decided to use the Micrometer library to bury the throughput, delay and error rate of each component, so as to monitor the performance of DB and RPC components.Therefore it can also be said that weThe second stage of monitoring is based on indicator monitoring and application performance monitoring supplemented by call chain monitoring.
3.1 Use Endpoint penetration indicators to bury points to help performance analysis
In the process of burying indicators, we introduced the “Endpoint” label in all indicators. The introduction of this tag realizes the behavior of distinguishing associated DBs, caches, message queues, and remote call classes according to different traffic entries. Through the traffic entrance, all component indicators of an instance are run through, which basically meets the monitoring requirements of the following scenarios:
- Troubleshooting RPC callsthe caller not only has the downstream interface information, but also can trace the source of the interface that triggered the call.
- Time-consuming analysis of interfacesaccording to the indicator, the time-consuming decomposition diagram of the unit time window can be restored to quickly view the time-consuming components.
3.2 Questions about model selection
You may ask, there are ready-made APM products in the industry in the field of link monitoring, such as Zipkin, Pinpoint, SkyWalking, etc. Why did you choose OpenTracing + Prometheus to bury points yourself? There are two main factors:
Firstat that time, CAT could not satisfy the full link monitoring and some customized report analysis, and the delivery of the five-color stone project of the transaction link was coming to an end, and it rushed to integrate a huge external APM product without sufficient verification , will bring stability risks to the service, and it is not a rational choice in an extremely limited time period.
second, the monitoring component is released along with a unified basic framework. At the same time, the full-link shadow library routing component developed by another team uses the OpenTracing accompanying data transparent transmission mechanism, and is strongly coupled with the monitoring component. The basic framework The monitoring, pressure testing and other modules will be coordinated, and with the help of the Spring Boot Starter mechanism, the functions can be used out of the box and seamlessly integrated to a certain extent. However, Pinpoint and SkyWalking, which use bytecode enhancement, cannot be well integrated with the basic framework. If they are developed in parallel, there will be more management and maintenance costs on both the basic framework and Java Agent, which will slow down the iteration speed.
In the next two years, application service monitoring has covered nearly 70% of the components used by Dewu Technology Department, providing strong support for Dewu App to achieve an annual SLA of 99.97% in 2021. Now it seems that based on the OpenTracing + Prometheus ecology, the call chain monitoring of the distributed system is well solved. With the help of the Grafana chart tool, flexible indicator monitoring is achieved, and the basic framework is integrated to allow the business side to use it out of the box…However, we It is said that the second stage is based on OpenTracing full-link sampling monitoring. With the rapid development of business, the shortcomings of this architecture are gradually revealed.
3.3 Architecture Features
- experience level
- index: With wide coverage and fine dimensions, it can clearly conduct statistics and analysis according to each dimension of each module, and basically meet the monitoring and flexible chart configuration requirements. But it is undeniable that it is a time-series aggregated data that cannot be refined into individuals. If several high-time-consuming operations occur at a certain point in time, when the throughput reaches a certain level, the average time-consuming index curve will still tend to be stable, with no obvious prominent points, resulting in a decrease in the ability to discover problems.
- link: With a sampling rate of 1%, the business service will basically not cause performance problems due to the large amount of call chains sent, but at the same time, it is often impossible to find the right sampling link from the wrong and time-consuming scenarios. During the period, we once considered changing the head sampling strategy to tail sampling, but faced with very high SDK transformation costs and the backtracking of the sampling strategy in complex call situations (such as asynchronous), and there is no guarantee that every time-consuming and wrong operation will occur At this time, the entire complete calling chain can be restored.
- Integration: Both the business and the basic framework use Maven to build the project, and use the Spring Boot Starter “all in one” out-of-the-box integration method, which greatly reduces the integration cost and also creates hidden dangers for dependency conflicts.
Project iteration level
The iterative cycle differentiated and contradicted, and the integration with the basic framework was the best choice for the rapid promotion and implementation of full-link monitoring at that time. Through this method, the access rate of Java services was once close to 100%, but in the context of rapid business development, The iteration speed of the basic framework is far behind the business iteration speed, which indirectly restricts the iteration of the entire monitoring system.
The cost of data governance is gradually high. Because the iteration rhythms of the basic framework and business systems are naturally inconsistent, and each business system also has its own iteration rhythm, looking at the back-end services of the entire network, the versions of the basic framework are uneven.
Although the monitoring system will try its best to ensure the greatest backward compatibility in each iteration, the data differences caused by different versions have greatly restricted the iteration of the monitoring portal system Tianyan during the nearly two-year iteration cycle, and developers have been running around for a long time Compromise on data, save the country on the curve in the realization of many functions.
Relevant plans rely on the automatic assembly logic of Spring framework beans. The business side has low understanding cost and is easy to change. However, it lacks fine-grained plans, such as the degradation of specific logic during runtime, and so on.
- Starting from the second half of 2021, in order to fully balance the above benefits and risks, we decided to decouple the monitoring collection terminal from the basic framework and iterate independently.Prior to this, under the promotion of CNCF (Cloud Native Computing Foundation), OpenTracing also merged with OpenCensus into a new project OpenTelemetry.
One step forward Based on OpenTelemetry full-link application performance monitoring
The positioning of OpenTelemetry lies in the unification of telemetry data collection and semantic specification in the field of observability. With the support of CNCF (Cloud Native Computing Foundation), in the past two years, with more and more people paying attention and participating, the whole system has also more mature and stable.
In fact, we have started to pay attention to the OpenTelemetry project at the end of 2020, but the project was still in its infancy at that time, and the Trace and Metrics APIs were still in the Alpha stage, with many unstable factors. From the middle of the year to the end of the year, I also more or less participated in the discussion of related issues in the OpenTelemetry community, the development of the telemetry module, the consistency of the underlying data protocol, and the repair of some BUGs. During the half-year period, the related APIs and SDKs have gradually stabilized as more and more people participated.
OpenTelemetry architecture (picture from opentelemetry.io)
4.1 Entering the era of Trace2.0
The positioning of OpenTelemetry is committed to unifying the three major elements of observability, Metrics, Trace, and Log. In terms of telemetry API formulation, it provides a unified context for SDK implementation layer de-association. For example, the relationship between Metrics and Trace, the author believes that OpenTelemetry includes support for the OpenMetrics standard protocol in the implementation of Metrics, and the data in Exemplar format bridges the bridge between Trace and Metrics:
OpenMetrics is a specification based on the Prometheus format, with finer-grained adjustments, and is basically backward compatible with the Prometheus format.
Prior to this, the data of the Metrics indicator type could not be accurately associated with a specific or certain Trace links, and could only be roughly associated with links within a specific range based on timestamps. The defect of this solution is that in the pull mode of the indicator collector vmagent every 10s~30s, the timestamp of the indicator depends on the collection time, which does not match the Trace call time.
Exemplar data will append the Trace ID and Span ID information in the current context at the end of the histogram measurement format, as follows:
shadower_virtual_field_map_operation_seconds_bucket{holder="Filter:Factory",key="WebMvcMetricsFilter",operation="get",tcl="AppClassLoader",value="Servlet3FilterMappingResolverFactory",le="0.2"} 3949.0 1654575981.216 # {span_id="48f29964fceff582",trace_id="c0a80355629ed36bcd8fb1c6c89dedfe"} 1.0 1654575979.751
In order to collect Exemplar format indicators and prevent the high cardinality problem caused by the bucket label “le”, we developed the indicator collection vmagent for the second time, additionally filtered the indicators carrying Exemplar data, and sent such data to Kafka asynchronously in batches. After being consumed by Flink and falling into Clickhouse, the SkyEye monitoring portal system provides the query interface and UI.
UI diagram of quantile line statistics and Exemplar data association
In the data reporting layer, the OpenTelemetry Java SDK uses the Mpsc (multiple production single consumption) queue with better performance than the JDK’s native blocking queue. It uses a large number of long type fields to fill the memory area, and uses space for time to solve false sharing. problem, reducing write contention in concurrent situations to improve performance.
During the peak traffic period, the performance of the sending queue of link data can be seen from the flame graph. The average CPU ratio is less than 2%. There is almost no obvious difference between the overall water level of the daily service CPU and 0 sampling. Therefore, we have undergone various pressure tests. After the comparison, it was decided to open the full reporting of link data on the client side of the production environment, realizing 100% sampling of the whole link in the history of Dewu technology, and ending the problem of difficult troubleshooting due to low sampling rate. So far,In the third stage, Dewu’s full-link tracking technology officially entered the Trace2.0 era.
Thanks to OpenTelemetry’s overall pluggable API design, we re-developed the OpenTelemetry Java Instrumentation project Shadower Java, extending many features:
4.2 Introduce control plane management client collection line
Use the control plane to ensure the delivery of configuration items through the client monitoring mechanism, including:
- Real-time dynamic sampling control
- Diagnostic Tool Arthas Behavior Control
- Real-time global downgrade plan
- Telemetry component runtime switch
- Real-time RPC component access parameter collection switch
- Downgrade control for real-time high-cardinality metric labels
- Plan Management by Probe Version
- Gray-scale access strategy based on the number of authorizations.
The introduction of the control plane makes up for the gap of no downgrade plan, provides more flexible configuration, and supports rapid changes in data collection schemes under different traffic scenarios:
4.3 Independent startup module
In order to solve the long-term dependency conflict problem faced by the business side due to the integration of the basic framework, as well as the data format dispersion and compatibility problems caused by the coexistence of multiple versions, we developed Promise, a general-purpose javaagent launcher, combined with remote Storage, supports configurable download, update, install and start of any javaagent:
[plugins]
enables = shadower,arthas,pyroscope,chaos-agent
[shadower]
artifact_key = /javaagent/shadower-%s-final.jar
boot_class = com.shizhuang.apm.javaagent.bootstrap.AgentBootStrap
classloader = system
default_version = 115.16
[arthas]
artifact_key = /tools/arthas-bin.zip
;boot_class = com.taobao.arthas.agent334.AgentBootstrap
boot_artifact = arthas-agent.jar
premain_args = .attachments/arthas/arthas-core.jar;;ip=127.0.0.1
[pyroscope]
artifact_key = /tools/pyroscope.jar
[chaos-agent]
artifact_key = /javaagent/chaos-agent.jar
boot_class = com.chaos.platform.agent.DewuChaosAgentBootstrap
classloader = system
apply_envs = dev,test,local,pre,xdw
4.4 Extension based on Otel API
4.4.1 Rich component metrics
During the second stage of OpenTracing, we used Endpoint to run through the index tracking points of multiple components. This excellent feature also continued to the third stage. We designed a complete index tracking SDK based on the underlying Prometheus SDK, and used the word The convenience of section code insertion optimizes and enriches more component libraries.(At this stage, the main version of the OpenTelemetry SDK is 1.3.x, and the related Metrics SDK is still in the Alpha stage)
Otel’s Java Instrumentation mainly uses WeakConcurrentMap as a container for asynchronous link context data transfer and thread context association. Since Otel has enhanced many popular component libraries, the frequency of use of WeakConcurrentMap is also very high. Monitoring helps to troubleshoot memory leaks caused by probes, and once its growth rate reaches the threshold we set, an alarm will be issued, and manual intervention will be performed early to implement relevant plans to prevent online failures.
Partial self-monitoring panel
4.4.2 Extended Link Transparent Transmission Protocol
- Introducing RPC IDs
In order to better correlate upstream and downstream applications and allow each traffic to have an “identity”, we have extendedTextMap Propagator The interface allows each traffic to know the source of the request on the link, which plays a key role in cross-region and environment call troubleshooting scenarios.
In addition, for the cross-end scenario, we refer to the RPCID model of the Ali Hawkeye call chain and add the RpcID field. The value at the end of this field will increase automatically every time a cross-end call occurs. For downstream applications, the level of the field itself will increase automatically. :
This field has the following functions:
Supports providing a simplified call link view, and only provides RPC call nodes and call hierarchy relationships when querying bloated links (such as those involving cache and DB calls greater than 2000 Span).
Link fidelity, the client link data reporting queue is not an unlimited queue. When the client itself calls frequently, if the report queue accumulation reaches the threshold, it will be discarded, which will cause the entire link to be incomplete. Of course, this is The expected phenomenon, but without the RpcID field, the link view will not be able to associate the lost node, resulting in confusion and distortion of the entire link hierarchy.
- Custom Trace ID
In order to achieve efficient retrieval efficiency on the link details page, we expand the TraceID generation logic. The first 8 digits of the ID use the instance IP, the middle 8 digits use the current timestamp, and the last 16 digits use random number generation.
32位自定义traceId:c0a8006b62583a724327993efd1865d8
c0a8006b 62583a72 4327993efd1865d8
| | |
高8位(IP) 中8位(Timestmap) 低16位(Random)
There are two advantages to this:
Reverse parsing the timestamp through TraceID and locking the time range can help improve the retrieval efficiency of the repository Clickhouse, and can also help determine whether the current Trace should query the hot storage or the cold storage.
Binding the instance IP helps to associate the instance to which the current Trace traffic entry belongs. In some extreme scenarios, when the nodes on the link cannot be retrieved, the source can also be traced through the two elements of instance and time.
- Asynchronous call recognition
In order to improve service throughput and make full use of hardware resources in business systems, asynchronous call scenarios are ubiquitous.Based on the asynchronous link context transfer implemented by Otel, we additionally expand the “async_flag” field to identify the calling relationship of the current node relative to the parent node, so that the scene of asynchronous calling can be quickly found on the presentation layer
4.4.3 Clearer call chain structure
In some components supported by Otel, some operations do not involve network calls, or have very frequent operations, such as MVC process, database connection acquisition, etc. Generally speaking, such nodes are of little significance in the main view of link details. Therefore, we have optimized and adjusted the generation logic of such nodes, making the main structure of the entire link focus on “cross-end”. At the same time, we have enhanced the details of key internal methods of some core components, and mounted them in the form of “events” On their parent nodes, it is convenient for more fine-grained troubleshooting:
RPC calls for key internal events
DB call connection get event
4.4.4 Profiling support
1) Integration of thread stack analysis. By integrating tools such as Arthas, you can easily view the real-time stack information of an instance thread, and at the same time control the sampling interval to avoid frequent capture from affecting the performance of the business itself.
2) Through the integration of pyroscope, the last mile can be checked through high-latency performance. Pyroscope has done secondary development on the async profiler, and also supports Otel to integrate, but so far, the official has not implemented a complete profiling behavior life cycle, and the profiling behavior will affect performance to a certain extent, so we are concerned about the life of the official Pyroscope The cycle has been extended to realize the “stop” behavior. At the same time, the time-wheel algorithm is used to detect the time-consuming of a specific operation. When the expected threshold is reached, profiling will be triggered, and it will stop when the operation ends or exceeds the maximum threshold.
For applications related to performance diagnosis, please look forward to the follow-up diagnosis topic.
Looking at the three milestone iterations of Dewu in the field of application monitoring and collection, the first stage of CAT is a 0~1 process, which provides a way for application services to observe themselves, allowing business parties to truly understand the service for the first time From the second stage, with the rapid improvement of business development, the requirements of the business side for the monitoring system are not only to be built from scratch, but to be refined and accurate.
Therefore, under the background of rapid iteration, the contradiction between function and architecture evolution, coupled with the development factors of the observable field under the background of external cloud native, prompted us to carry out the third stage of evolution based on the OpenTelemetry system. Excellent results have been achieved at the functional and product levels. Now, we are about to proceed to the next stage of evolution, deeply combining call chains and related diagnostic tools, and based on the third stage, Dewu’s full-link tracking technology officially enters the era of performance analysis and diagnosis.
Dewu monitoring team provides a one-stop observability platform, responsible for link tracking, time series database, log system, including custom dashboard, application dashboard, business monitoring, intelligent alarm, AIOPS and other troubleshooting analysis.
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf
- In-depth Analysis of Dianping Open Source Distributed Monitoring Platform CAT – Alibaba Cloud Developer Community
https://developer.aliyun.com/article/269295
https://www.cncf.io/blog/2019/05/21/a-brief-history-of-opentelemetry-so-far/
- The OpenMetrics project -Creating a standard for exposing metrics data
https://openmetrics.io/
*Text / Ku Feng Xinyuan
Pay attention to Dewu technology, and update technical dry goods every Monday, Wednesday and Friday nights at 18:30
If you think the article is helpful to you, please comment, forward and like~
#Dewu #Cloud #Native #Full #Link #Tracking #Trace20Acquisition #Articles #Personal #Space #Dewu #Technology #News Fast Delivery