Article
From enterprise-level CF platform to cloud native (2): Observability-driven governance—from monitoring large screens to precise decision-making systems
With 6 years of practical experience as an enterprise-level platform architect, we analyze the core position of observability in microservice governance, from data islands to OpenTelemetry unified standards, and build a governance system for accurate decision-making.
From enterprise-level CF platform to cloud native (2): Observability-driven governance—from monitoring large screens to precise decision-making systems
From the era of enterprise-level CF platforms to the present, cloud native projects have moved from early exploration to large-scale implementation. The biggest change during this period was not the update of the tool chain, but a fundamental shift in the understanding of the nature of governance.
In the first article, this article sorted out the evolution path from enterprise-level CF platform to cloud native, and mentioned the concept of “Observability-Driven Governance (ODG)”. This article will provide an in-depth analysis of this concept and explain why observability is not an accessory to governance, but the foundation of the entire governance system.
Judging from the practice in the era of enterprise-level CF platforms, even with industry-leading monitoring tools, if there is a lack of correlation between data, troubleshooting is still a war of attrition. Multiple subsequent cloud native projects have confirmed this: the tooling differences do exist, but the bigger issue is how organizations understand and use observability data. Many organizations equate observability with “monitoring”—deploy some tools, set some alerts, and consider work done. This knowledge leads to a lot of waste of resources and continuous inefficiency in operation and maintenance.
Central to the ODG philosophy is a repositioning of observability: not as a troubleshooting tool after the fact, but as a basis for decision-making before the fact. Circuit breaker thresholds, current limiting rules, and grayscale strategies—these governance decisions all require data support. Without data, governance becomes empiricism and patting oneself on the head; with data but low-quality data, governance makes wrong decisions based on wrong assumptions.
Opening: Why observability is the foundation of governance
In the industry’s early practices, observability tools for enterprise-level platforms (such as the early enterprise-level CF platform and Cloud Foundry) have been quite complete - Dynatrace provides out-of-the-box APM capabilities, Kibana can search logs, and Prometheus can collect indicators. But whenever there is a problem in the production environment, troubleshooting is still a war of attrition.
The problem is not a lack of data, but a lack of correlation between the data.
A typical scenario is: a core transaction interface occasionally times out, and the error rate fluctuates between 0.5% and 2%. The operation and maintenance team first checked the application performance data in Dynatrace and found that the response time was indeed abnormal. Then they switched to Kibana, searched relevant logs based on the timestamp, and found a bunch of error stacks. However, if they wanted to track the complete link of a specific request, they needed to jump repeatedly between the three systems and manually piece together the information. It finally took nearly three hours to locate the source of the problem - improper configuration of a downstream database connection pool.
This scenario is typical because it exposes the fundamental contradiction of observability construction: We have a lot of data, but we lack effective information. Dynatrace’s metrics tell me response time anomalies, Kibana’s logs tell me specific error messages, but there’s no bridge between them. Trace ID is a string in the log and an identifier in another format in Dynatrace. Engineers need to manually establish this mapping relationship.
What’s even more frustrating is that this problem actually reoccurs in different forms every day. In the morning, the order service timed out, in the afternoon, the connection pool of the inventory service ran out, and in the evening, there were occasional exceptions in the payment service. The troubleshooting path for each problem is similar: abnormality found in the indicator system → Search details in the log system → Manually associate links → Repeated jump verification. A lot of engineers’ time is spent on mechanical operations between systems instead of real problem analysis.
This situation is not unique. Industry observations show that between 2018 and 2020, the average time it took to deal with production issues in a typical enterprise-level project was usually: simple configuration errors took an average of 45 minutes, complex issues involving multi-service interaction took an average of 2.5 hours, and cross-team coordinated failures often took more than half a day. Most of this time is spent on “finding data” and “correlating data” instead of analyzing root causes.
Industry practice experience shows that in an internal survey, engineers were asked to record the distribution of troubleshooting time within a week. The results showed that on average, about 40% of the time in each troubleshooting was spent on determining “which system to check”, 35% of the time was spent on “establishing correlations between different systems”, and only 25% of the time was actually used to analyze the root cause. This means that more than 70% of the time is spent on data navigation rather than problem solving itself.
What is even more worrying is the hidden cost of this inefficient inspection. When engineers switch between multiple systems, their attention is constantly interrupted and it is easy to miss key clues; the process of manually correlating data is prone to errors. In practice, engineers have confused the Trace IDs of two different requests, leading to a completely wrong investigation direction; when collaborating across teams, different teams use different tools and data standards, and communication costs are extremely high.
From a technology evolution perspective, this phenomenon continues to exist in different cloud native projects. Surprisingly, even in 2024, the observability construction of many enterprises is still stuck in the “monitoring big screen” stage-beautiful dashboards, scattered data sources, and engineers are still struggling between multiple systems in the event of a failure.
The core insight of the ODG philosophy is: The quality of governance decisions depends on the quality and relevance of observed data. What should be the fuse threshold? Are the current limiting rules reasonable? Can grayscale releases continue to advance? These decisions require accurate, contextual data. Without high-quality observability, governance becomes guesswork based on experience rather than data-based decision-making.
This insight comes from repeated verification of large-scale microservice practices. In a typical enterprise-level project, the team sets various governance rules in accordance with industry “best practices”: the circuit breaker threshold is uniformly set to 50% error rate, the current limit is uniformly set to 1000 QPS, and the timeout is uniformly set to 30 seconds. The rules seem reasonable, but problems frequently occur in actual operation - normal fluctuations trigger circuit breakers, burst traffic is incorrectly throttled, and timeout settings do not match the actual downstream processing capabilities. The fundamental reason is that these thresholds are not set based on the actual operating characteristics of the service, but copy common empirical values.
The ODG method requires first collecting data, analyzing characteristics, and then setting rules. The circuit breaker threshold should be set based on the actual error rate distribution of the service; the current limiting threshold should be set based on the actual capacity and load pattern of the service; the timeout should be set based on the actual response time of downstream dependencies. These seemingly simple adjustments can often lead to significant stability improvements.
The Observability Dilemma in the Age of Enterprise-Grade CF Platforms
Three pillars of data islands
In an enterprise-level CF platform environment, observability data is roughly divided into three categories, each managed by different tools:
Metrics: Mainly collected using Prometheus and stored in a self-built or hosted time series database. This type of data has a small amount of data and a high degree of aggregation, making it suitable for viewing trends and alarms. PromQL queries are flexible, but have limited dimensions and are difficult to associate with specific requests or users.
Logs: Usually ELK (Elasticsearch + Logstash + Kibana) or the company’s internally developed log platform is used. Logs contain rich contextual information, but are large in size, slow to query, and have inconsistent formats—some services use JSON, some use plain text, some put the Trace ID in the header, and some in the message field.
This format confusion brings huge challenges to log analysis. In a legacy system case in 2019, there were 12 microservices, and each service had a different log format. Service A uses JSON like {"timestamp":"2020-01-01T10:00:00Z","level":"ERROR","message":"..."}, Service B uses plain text like 2020-01-01 10:00:00 [ERROR] ..., Service C uses a custom key-value pair format, and Service D uses a custom key-value pair format. There is no structured log at all, relying entirely on regular expression analysis.
To find relevant information in these logs, you need to maintain a complex set of parsing rules, and the rules need to be updated every time a new service is added or the format of an old service is modified. There is a dedicated engineer in the team responsible for maintaining this set of parsing configurations, which in itself is a waste of resources.
Traces (link): In the early days, it mainly relied on the automatic collection of Dynatrace APM. Later, some teams began to try Zipkin or Jaeger. Link data can show the full path of the request, but the sampling rate is a problem: the cost of full collection is too high, and sampling may miss key abnormal requests.
These three types of data are independent of each other. Metrics tells you “the system has slowed down”, Logs tells you “an error was reported here”, and Traces tells you “the request went through these services”, but there is no automatic correlation between the three. When a failure occurs, engineers need to:
- Find abnormal time points in Metrics
- Switch to Logs and search error logs based on time range
- Find a suspicious Trace ID, and then switch to the Trace system to view the link
- If a service exception is found in the link, return to Logs to search for the detailed log of the service.
- Repeat this until a complete picture of the fault is pieced together
This kind of “system jump” troubleshooting is not only inefficient, but also easy to lose context. Industry statistics show that around 2019, a typical cross-service troubleshooting required an average of 15-20 switches between 3-4 systems.
The “black box” limitations of Dynatrace APM
The enterprise-class CF platform environment comes standard with Dynatrace APM, which provides powerful automatic discovery capabilities and OneAgent non-intrusive collection. For monolithic applications or simple service topologies, Dynatrace can indeed quickly locate problems.
However, in the microservice scenario, the limitations of Dynatrace gradually become apparent:
Automatic discovery ≠ Accurate correlation. Dynatrace can automatically identify service call relationships, but this relationship is based on network traffic analysis rather than business semantics. When multiple requests are processed concurrently, it is difficult to distinguish which Metrics peak corresponds to which business scenario.
Sampling strategy is rigid. Dynatrace’s smart sampling can control storage costs, but for sporadic anomalies (such as a specific error occurring several times per hour), smart sampling often misses critical samples. A typical scenario is: users report occasional payment failures, but the relevant links are not visible at all in Dynatrace’s sampling data, and the recurrence does not occur until the sampling rate is manually adjusted.
Customization capabilities are limited. Although Dynatrace’s Dashboard and alarm rules are rich, if you want to implement some customized correlation analysis (such as associating business order numbers with technical trace IDs), you need to rely on Dynatrace’s API for secondary development, which is expensive.
The core problem is: Dynatrace is a closed commercial product, and data storage and analysis logic are black boxes. When deep integration with other systems (such as self-developed governance platforms) is required, the ability to export interfaces is often limited.
Data explosion and storage cost dilemma
As the number of microservices grows, the amount of observability data increases exponentially. A typical project in the late era of the enterprise-level CF platform has more than 80 microservices and an average daily log volume of more than 500GB. Even if the trace data is sampled at 1%, it amounts to dozens of GB every day.
This amount of data brings several practical problems:
Storage costs are out of control. In order to support log queries, the Elasticsearch cluster requires a large number of nodes, and storage and computing costs become the second largest expenditure after computing resources.
Query performance degrades. Query delays on the log platform often exceed 10 seconds during peak periods, and engineers have to wait to troubleshoot problems, greatly reducing efficiency.
Sampling dilemma deepens. In order to control costs, the team was forced to increase the sampling threshold, but this in turn led to a lack of data in abnormal scenarios, forming a vicious cycle.
Several optimization solutions were tried at that time: hierarchical sampling according to service importance, hierarchical log storage (hot/warm/cold), and adding Elasticsearch nodes. These measures alleviated the symptoms, but did not solve the underlying problem—the data remained siled.
Real troubleshooting scenarios
The following describes a typical case in 2019 to illustrate the dilemma at that time.
One weekday afternoon, a monitoring alert reported that the P95 latency of the core transaction service increased from the normal 200ms to 800ms. The engineer on duty started the investigation:
Phase 1 (0-15 minutes): Check Dynatrace and find that the increase in latency is concentrated in a certain service instance. Check the CPU and memory metrics of the instance. There are no obvious abnormalities. It is speculated that it may be a GC problem, but the GC indicators of Dynatrace show normal.
Phase 2 (15-35 minutes): Switch to the log platform and search the ERROR logs of the service based on the time range. Found a lot of connection timeout errors, but not sure which downstream service I connected to. The format of the Trace ID in the log and the link ID in Dynatrace are inconsistent and cannot be directly related.
The third stage (35-60 minutes): According to the timestamp in the log, manually find the corresponding link in Dynatrace. Because of the sampling rate, only a few relevant links were found, and it was found that the call to the downstream inventory service timed out. But the latency metrics for the inventory service itself look normal.
Phase 4 (60-90 minutes): Expand the scope of the investigation and check the logs of the inventory service. It was found that it was calling another price calculation service which had GC pauses during the same time period. But why don’t the inventory service’s latency metrics reflect this issue? It turns out that the inventory service uses asynchronous calls, and its own delay is not directly affected by the downstream.
Phase 5 (90-120 minutes): A configuration error in the price calculation service was finally located, which caused frequent GC to be triggered in certain scenarios. After repairing the configuration, the system returned to normal.
This case exposed the systemic problems caused by data silos. Each individual system—Dynatrace, the logging platform, the metrics system—was working fine, but when combined, the troubleshooting process was extremely inefficient. The deeper problem is that the true impact path of the failure (transaction service → inventory service → price calculation service) cannot be fully presented in any single system. Dynatrace can see the call relationship, but not the delayed delivery of asynchronous calls; the log platform has error details, but lacks link context; the indicator system shows the overall health, but cannot be associated with specific business scenarios.
Observability-driven governance (ODG) concept
From “govern first and then observe” to “observe first and then govern”
The traditional governance construction path is usually:
- Choose a service mesh (like Istio) or a governance framework (like Hystrix)
- Configure rules such as fusing, current limiting, and grayscale
- Deploy and go online, monitor again when problems arise
The fundamental problem with this model is: The formulation of governance rules lacks data support. Should the circuit breaker threshold be set at 50% or 80%? Should the current limit QPS be set to 1000 or 2000? What is the basis for increasing the grayscale ratio from 5% to 20%? These decisions often rely on empirical values or stress test results, but the actual load pattern of the production environment is very different from the stress test.
The ODG philosophy advocates reverse operations:
- Build observability first: Deploy unified Metrics/Logs/Traces collection to ensure data is complete, relevant, and queryable
- Making decisions based on observation data: Analyze historical error rates, delay distributions, and capacity baselines to form a data-driven governance strategy
- Continuous Feedback Optimization: Dynamically adjust rules based on governance effects (verified by observability)
This is not a simple order swap, but a change in the way of thinking: governance is not a stack of static configurations, but a dynamic decision-making system based on real-time data.
Practical case: making governance decisions based on data
Case 1: Dynamic calculation of circuit breaker threshold
In an e-commerce project, the team initially set the circuit breaker threshold to a 50% error rate based on “industry experience.” Frequent false circuit breakers occurred after going online - some services were circuit breaker during normal fluctuations, affecting user experience.
After adopting the ODG method, one month of historical data was first collected to analyze the true error rate distribution of downstream dependencies. It was found that the error rate of most services has an obvious bimodal distribution: less than 1% in normal conditions, higher than 30% in abnormal conditions, and the middle 5%-20% interval rarely occurs.
This discovery is valuable. It shows that for most services, the error rate is either very low (normal) or very high (obvious failure), with few “sub-healthy” intermediate states. This means that the circuit breaker threshold does not need to be finely tuned - as long as it is set somewhere in the middle area (such as 20% or 30%), it will provide a good distinction between normal and abnormal.
Based on this data feature, the circuit breaker threshold is changed from a fixed value to a dynamic calculation:
= max(history P95 * 3, 30%)
Specifically, the system will calculate the error rate P95 for the past 7 days on a rolling basis, multiplied by 3 as the threshold, but not lower than 30%. This not only avoids false fuses caused by normal fluctuations, but also enables quick triggering in abnormal situations.
Implementation effect: The false fuse rate is reduced from 5-6 times per month to zero, and the fuse response time when a real fault occurs is maintained within 10 seconds.
Case 2: Capacity planning with current limiting threshold
Another project requires setting a current limit for the payment interface. The traditional approach is to obtain the ultimate QPS through stress testing, and then apply a discount (such as 80%) as the current limiting value. However, the stress test environment is very different from the production environment, so this discount is difficult to determine.
The ODG method is to first launch a version without current limiting (or set a very high temporary threshold), and collect the real QPS pattern and system resource usage through observability.
By analyzing a week’s monitoring data, we found:
- Daily peak QPS: 1200
- Peak sales QPS: 3500
- CPU usage reaches 70% at QPS 2800
- Memory usage starts to increase rapidly at QPS 3000
- Latency starts to increase significantly after QPS 2500
Based on this data, we set up a hierarchical current limiting strategy:
- Daily current limit: 2500 QPS (the inflection point when latency starts to rise)
- Temporary increase during the big promotion: 3500 QPS (cooperating with capacity expansion)
- Emergency protection threshold: 4000 QPS (hard limit, prevent avalanches)
This strategy was verified in the subsequent big promotion: the system maintained low latency under daily load, and during the big promotion period, it supported a peak of 3400 QPS through automatic expansion without any current limit rejection.
Looking back at this case, the advantage of the ODG method over traditional methods is that it does not set thresholds based on theoretical models (such as pressure test results), but on real production data. Stress testing is often difficult to simulate real user behavior patterns (such as burst traffic, call hotspots of specific interfaces), and production data naturally contains these complexities. Of course, the premise of this approach is that the system has sufficient protection mechanisms (such as emergency hard limits) to withstand extreme traffic during the data collection stage.
Case 3: Automatic decision-making for grayscale publishing
A common practice in grayscale publishing is human flesh decision-making: the publishing staff observes for a period of time, and then expands the proportion if they feel there is no problem. The problem with this approach is:
- Insufficient observation time may miss long-tail anomalies
- Inconsistent subjective judgment standards
- Rollback after problem discovery relies on manual operations
The ODG method implements automatic grayscale decision-making based on SLO (Service Level Objective):
condition:
- < 0.1%(comparison)
- P95 < 20%
- condition 30 minute
condition:
- > 0.5%
- P99 > 50%
- SLO indicator 5 minuteerror
These thresholds are not set on the fly, but are based on historical release data: the past 20 successful releases and 3 rollback cases were analyzed to find the statistical boundaries that distinguish normal fluctuations from true anomalies.
After implementing this mechanism, the average time for grayscale release is shortened from 4 hours to 45 minutes (automatic advancement), while the rollback response time is shortened from manual 15-30 minutes to automatically triggered within 5 minutes.
ODG Core Principles
Through the above practices, several core principles of ODG can be summarized:
1. Data comes before rules
The formulation of any governance rules must be supported by data. Don’t set thresholds without knowing “what normal looks like”. Don’t design a circuit breaker strategy without knowing “how exceptions behave”.
2. Relevance is better than completeness
Rather than pursuing 100% collection of a single type of data (such as full Trace), it is better to ensure the correlation between Metrics/Logs/Traces. A 1% sampling link that can be linked to a business order number is more valuable than an isolated 100% sampling.
3. Dynamic is better than static
Load patterns in production environments change over time (business growth, seasonal fluctuations, architecture evolution). Governance rules should be dynamically adjusted based on recent data rather than configured once and for all.
4. Only if it can be verified can it be trusted
The effectiveness of governance measures must be verifiable through observability. If you can’t prove that the circuit breaker strategy actually reduces the scope of the failure, it may just be masking the problem.
OpenTelemetry: The value of unified standards
Not just “another SDK”
When OpenTelemetry (OTel) first appeared, many people simply understood it as “a collection SDK” - used to replace scattered components such as Prometheus Client, Logback Appender, and Jaeger Client. This understanding underestimates the value of OTel.
The true meaning of OTel lies in standardization: unified data model, unified collection protocol, and unified context propagation mechanism. This standardization solves a core problem in the era of enterprise-grade CF platforms—data silos.
Figure 1: Observable context bridging (from decentralized signals to governance decisions)
Unified Data Model (OTLP)
Before OTel, Metrics had Prometheus format, StatsD format, and InfluxDB format; Logs had Syslog, JSON, and various custom formats; Traces had Zipkin format, Jaeger format, and vendors’ private formats. Each format has its own SDK, collector, and storage solution.
OTLP (OpenTelemetry Protocol) defines a unified protocol and data model, covering three types of data: Metrics, Logs, and Traces. Key design features include:
Unified attribute system: All data types support Attributes in the form of Key-Value, making cross-type associations possible. For example, a Metric and a Log can be associated through the same service.name and deployment.environment properties.
Resource concept: Data points are associated with a Resource, describing the entity (service instance, host, container) that generates the data. This provides a unified basis for cross-service and cross-level data association.
Timing Consistency: All data contains accurate timestamps and is based on the same clock base, making time-aligned queries more reliable.
Unified collection pipeline: Collector architecture
OTel Collector is the hub of the entire system. It is not a simple data forwarder, but a programmable data processing Pipeline:
Receivers → Processors → Exporters
Figure 2: OTel Collector unified collection pipeline (receiving, processing, exporting and back-end consumption)
Receivers: Supports multiple input formats such as OTLP, Prometheus, Jaeger, Zipkin, etc., and can be used as a compatibility layer for old systems.
Processors: Provides filtering, sampling, conversion, batch processing and other capabilities. For example, Tail-based Sampling (sampling based on the entire link result) can be implemented here, solving the problem of “abnormal requests being missed by sampling” in the era of enterprise-level CF platforms.
Exporters: Send processed data to various backends (Prometheus, Elasticsearch, Jaeger, cloud vendor’s APM, etc.).
The value of Collector lies in decoupling: the application only needs to implement the OTel SDK once and send it to the Collector through OTLP; the back-end storage can be flexibly switched or used in parallel without affecting the application code.
In industry practice, Collector configuration changes with the environment:
- Development environment: simplified configuration, output directly to the console and local Jaeger
- Test environment: Enable full collection and store in temporary Elasticsearch
- Production environment: Enable intelligent sampling and output to an enterprise-level observable platform
The application code does not need to be changed at all.
Another benefit of this decoupled architecture is ease of A/B testing and migration. When we want to evaluate a new storage backend (such as migrating from self-built Elasticsearch to a managed observability platform), we only need to add a new Exporter in Collector to allow data to be written to both the old and new backends at the same time. After a period of parallel operation and comparison verification, decide whether to switch completely. The entire process is transparent to the application and risks are controllable.
Unified context propagation: Trace Context automatic transparent transmission
This is the key mechanism for OTel to solve the “data association” problem.
In the era of enterprise-level CF platforms, cross-system Trace ID transfer relies on manual implementation of each service. Some services put the Trace ID in the HTTP Header (X-Trace-ID or X-B3-TraceId), some put it in the message attribute of the message queue, and some do not send it at all. The formats are also inconsistent: some use UUIDs, some use 64-bit integers, and some use hexadecimal strings.
OTel implements the standard W3C Trace Context specification:
traceparentHeader: contains Trace ID, Span ID, sampling flagtracestateHeader: carries vendors-specific extended information
More importantly, OTel SDK automatically handles context transparent transmission:
- HTTP Client automatically injects Trace Context Header
- HTTP Server automatically extracts and continues Trace
- Message queue Producer/Consumer automatically handles message properties
- Asynchronous tasks (thread pool, coroutine) automatically transfer context
This means that as long as all services are connected to OTel, end-to-end link tracing can work automatically without the need for business code to manually pass the Trace ID.
Concept change from “complete data” to “strong correlation”
OTel drives an important shift in thinking: the goal of observability is not to collect all data, but to establish strong correlations between data.
In the era of enterprise-level CF platforms, the industry pursues “completeness” - full logs, full indicators, and the highest possible trace sampling. The result is an explosion in data volume, and correlation remains difficult.
OTel’s practice shows that relevance is more important than completeness:
- A 1% sampling link with full business context (user ID, order number, action type) is more valuable than a 100% sampling in isolation
- One log associated with a Trace ID is more valuable than ten isolated logs
- A Metrics peak that can drill down to a specific Trace is more valuable than a beautiful aggregate chart.
This shift is reflected in OTel’s design:
- Logs supports carrying Trace Context (
trace_id,span_idfields) to implement Log-Trace association - Metrics supports Exemplar (representative sampling points), which can be associated with specific Traces
- All data types share the same Resource and Attributes to achieve cross-type association
In our project, after implementing Log-Trace correlation, troubleshooting efficiency increased by about 60%. In the past, you needed to “find the error in Logs → extract the Trace ID → switch to the Trace system query”. Now you can jump to the link view by clicking the Trace ID directly on the log platform.
2024-2026 New Trends in Observability
eBPF-based non-invasive observation
eBPF (Extended Berkeley Packet Filter) technology is changing the way observability is implemented. The traditional model requires integrating the SDK into the application, while eBPF can collect data at the kernel level to achieve truly non-intrusive observation.
Cilium (eBPF-based Kubernetes network solution) and Tetragon (eBPF-based security observation tool) are representatives in this field. They can observe:
- Network-level request flow (HTTP, gRPC, database protocols)
- Process-level system calls and file operations
- Security-related events (privilege escalation, access to sensitive files)
Advantages:
- No modifications to application code or configuration are required
- Covering third-party components that cannot be instrumented
- Relatively low performance overhead (kernel-level processing)
Limitations:
- The application layer has limited semantic information (for example, the business transaction ID cannot be automatically identified)
- Requires newer kernel version support
- Complex security review (kernel-level code)
In industry practice, eBPF observations supplement the OTel SDK: the SDK is responsible for application layer semantics (business context, custom attributes), and eBPF is responsible for network layer and system layer observations. The two are combined to achieve full-stack observability from the kernel to the business.
Continuous Profiling
Traditional performance profiling (Profiling) is usually event-driven: after a performance problem is discovered, sampling is manually triggered to analyze hot functions. Continuous Profiling is continuous low-frequency collection (such as 1% CPU sampling), which can provide performance insights at any time.
Google’s “Google-Wide Profiling” paper (2010) describes this practice: continuously collecting profiling data from the production environment for capacity planning and performance optimization. In 2024, open source solutions such as Parca and Pyroscope will make this capability popular; by 2026, OpenTelemetry Profiles will enter public Alpha, which means that profiling is being incorporated into unified observation semantics and is no longer just a performance analysis capability independently implemented by each tool.
Core Value:
- You can query “performance characteristics at a certain point in time in the past” at any time
- Compare the performance differences before and after version release
- Discover progressive performance degradation (instead of waiting for an alarm to trigger)
In industry practice, Continuous Profiling helps discover several types of problems that are difficult to catch with traditional monitoring:
- A memory allocation hotspot introduced by a certain release caused the GC frequency to slowly increase.
- Regular expression performance traps in specific business scenarios
- Unexpected overhead caused by third-party library upgrades
AI-driven root cause analysis
Root cause analysis based on large language models (LLM) and machine learning (ML) is a hot topic in 2024-2026. Directions include:
Log anomaly detection: Use ML models to identify changes in log patterns and automatically discover abnormal patterns (not just matching known ERROR keywords).
Link anomaly positioning: Based on historical link data training model, identify “suspected components” in abnormal links.
Multimodal correlation analysis: Combine Metrics, Logs, Traces and external events (release records, configuration changes) to use LLM to generate root cause hypotheses.
Situation Assessment:
- Auxiliary analysis tools are already practical (such as automatically classifying similar error logs)
- Fully automatic root cause location is still immature and has a high false positive rate.
- LLM performs better when processing structured observation data, but complex scenes still require manual judgment.
The typical approach is to use AI analysis as a “preliminary screening” tool to help engineers narrow down the scope of troubleshooting, rather than completely replacing manual work. For example, AI can prompt “This failure may be related to a configuration change 2 hours ago”, and engineers can then conduct in-depth analysis based on this clue.
Fusion observation with Service Mesh
Service Mesh (such as Istio, Linkerd) itself provides rich traffic observation data. The trend from 2024 to 2026 is the deep integration of Service Mesh and OTel ecology:
Envoy natively supports OTel: Envoy can directly output Access Log and Metrics in OTel format without the need for additional conversion layers.
eBPF + Sidecar mixed mode: Sidecar is responsible for application layer semantics, eBPF is responsible for network layer observation, and the data is uniformly aggregated to OTel Collector.
Traffic governance and observation closed loop: Based on the observable data of Service Mesh, automatically adjust traffic routing (such as intelligent routing based on delay data).
In a typical project, such a closed loop is implemented: delay data collected by Service Mesh → OTel Collector → real-time analysis → dynamic adjustment of load balancing weights. This “observation-driven governance” architecture allows the system to automatically avoid slow nodes without manual intervention.
The key to this practice is the real-time nature of delayed data. In traditional solutions, there is often a delay of several minutes or even ten minutes from data collection to analysis and decision-making. For rapidly changing faults (such as network jitter, node overload), this delay makes automatic decision-making meaningless. The optimization solution is to cut the data path: Service Mesh directly exposes indicator endpoints, and the real-time analysis component obtains data through pull mode, bypassing the Collector’s batch processing and queue delays. Although this increases system coupling, this trade-off is necessary for scenarios that require millisecond response (such as automatic route adjustment).
Observability as a foundation for governance
Linkage with other governance capabilities
Observability does not exist independently. It forms a close linkage with other capabilities of microservice governance.
Figure 3: Observability-driven governance closed loop (signaling, decision-making, execution and feedback)
Detailed explanation of linkage scenarios
Observability ↔ Traffic Governance
Scenario: Intelligent routing decisions
Traditional load balancing is based on simple round robin or random strategies and does not take into account the actual status of the backend service. Intelligent routing based on observability can adjust weights based on real-time latency data.
Implementation method:
- OTel SDK collects the delay of each request and associates it to the specific backend instance through Exemplar
- Aggregation analysis identifies “slow nodes” (instances with latency consistently above P95)
- The control plane of Service Mesh reduces the weight of slow nodes based on the analysis results.
- Continuously monitor the effects of adjustments to form a closed loop
In typical projects, this mechanism helps avoid multiple cascading delays caused by network jitter. When a network problem occurs in a certain availability zone, intelligent routing can automatically reduce the traffic in the zone within 30 seconds, avoiding manual intervention.
An important detail of this mechanism is the definition of “slow nodes”. Initially a simple threshold was used (e.g. latency > 500ms), but this was found to give false positives - it was normal for some nodes to have high latency occasionally (e.g. during GC pauses). Later, it was changed to statistical judgment based on duration: only nodes with a delay higher than P95 for 5 consecutive sampling periods are considered “slow nodes”. This adjustment significantly reduced the false positive rate, from 3-4 false adjustments per month to almost zero.
Observability ↔ Resilient Fault Tolerance
Scenario: Adaptive circuit breaker
Traditional circuit breakers are based on fixed thresholds (e.g. error rate > 50%). Adaptive circuit breaker dynamically adjusts the threshold based on historical data.
Implementation method:
- Collect historical error rate distribution of downstream dependencies (normal baseline vs abnormal state)
- Calculate dynamic thresholds based on statistical models such as the 3-sigma principle
- Real-time Metrics trigger circuit breaker judgment
- Circuit breaker events are recorded in Logs to facilitate subsequent analysis.
One detail: The calculation of the circuit breaker threshold needs to consider the business cycle. For example, the error rate of payment services may fluctuate normally during promotion periods. The adaptive algorithm needs to identify this periodic pattern to avoid false circuit breakers.
Observability ↔ Release Governance
Scenario: SLO driven automatic grayscale
Grayscale’s decision-making has been upgraded from “human observation” to “data-driven”:
- Define healthy SLOs for new releases (e.g. error rate < 0.1%, P95 latency growth < 20%)
- Compare the metrics of the baseline version and the grayscale version in real time through the observability platform
- Automatically expand the grayscale ratio when the conditions are met, and automatically roll back when the conditions are not met.
- All decisions and metrics of the release process are logged to Traces (via span events)
In a typical project, the key to this mechanism is “baseline selection” - which version of the indicator is compared? Direct comparison with production stable versions may be affected by other factors (such as traffic fluctuations). The specific method is to deploy the grayscale version and the “shadow baseline” of the same proportion (running the same traffic but not processing) at the same time to ensure the fairness of the comparison.
Data quality determines governance effectiveness
All these linkages have a prerequisite: The quality of the observability data is high enough. Missing data, inaccurate delays, and incorrect labels can all lead to incorrect governance decisions.
The author summarized several data quality guidelines:
Accuracy of Latency Measurement: Client and server side latency definitions may differ (does network transport include it?). In the linkage scenario, it is necessary to clarify the unified measurement method.
Refinement of error classification: HTTP 5xx and 4xx have different processing strategies, and the circuit breaker thresholds for network timeouts and business errors may be different. The fine classification of error codes directly affects the governance effect.
Tag consistency: The naming conventions of labels such as service.name, version, and environment must be strictly implemented, otherwise cross-service correlation queries will fail.
Coordination of Sampling Strategies: If some services use 100% Trace sampling and others use 1%, cross-service links may be broken. Sampling strategies need to be coordinated globally.
Architect Insights: Common Misunderstandings in Observability Construction
After these years of practice, this article summarizes several experiences in observability construction and has also witnessed many misunderstandings.
Misunderstanding 1: Pursuing a “large and comprehensive” large surveillance screen
Many companies set out to build an “integrated monitoring platform” from the beginning, pursuing beautiful dashboards and full data display. The result is that a lot of effort is invested in UI and presentation, but data quality and relevance are neglected.
The author’s suggestion: first solve the problem of “whether it can be quickly positioned”, and then consider “whether the display is beautiful”. A simple log query interface that can be associated with Trace-ID is more valuable than an isolated cool dashboard.
There is a situation in the industry case: a company invested half a year to build an “enterprise-level observability platform”, which has a beautiful dashboard, drag-and-drop charts, and various visual components. However, after going online, it was discovered that engineers still needed to jump between multiple systems when troubleshooting problems, because the platform simply aggregated different data sources for display and did not solve the problem of data association. There is no automatic correlation between Trace and Log, and Metrics cannot drill down to specific request links. In the end, the platform became a decoration, and engineers returned to the native tools they were familiar with.
Misunderstanding 2: Trace ID transparent transmission is not standardized
Trends at the architectural level show that transparent transmission of Trace ID is the basis for link tracing, but problems often occur in practice:
- Some services miss transmitting the Trace ID, causing the link to break.
- Asynchronous tasks (scheduled tasks, message consumption) are not continued Trace Context
- Third-party callbacks (such as payment gateway callbacks) do not carry Trace IDs
The author’s experience: Trace ID transparent transmission needs to be included in the code review checklist, especially in asynchronous and callback scenarios. At the same time, establish link integrity monitoring—regularly check the link disconnection rate of Trace and discover transparent transmission problems in a timely manner.
Misunderstanding 3: The log format is confusing
The log formats of different teams and different services vary greatly: some use plain text, some use JSON; timestamp formats are not uniform (ISO8601 vs Unix milliseconds vs custom formats); log level definitions are inconsistent (does ERROR contain business exceptions?).
This confusion makes cross-service log correlation queries very difficult. The author’s suggestion is to add log format verification to the CI/CD process and make structured logs (JSON) and unified field specifications mandatory.
Misunderstanding 4: Indicator labels are randomly designed
The design of indicator labels (Prometheus Labels) directly affects query capabilities and storage costs. Frequently asked questions:
- The label base is too high (such as using user ID as a label), causing the time series to explode.
- Inconsistent label naming (
service_namevsservice.name) causes the query to fail to aggregate - Key dimensions are missing (for example, there is no
versiontag to distinguish between different versions of indicators)
The author’s experience: Develop an indicator label specification document and define the labels that must be included (service, version, environment, instance, etc.) and naming conventions. For high cardinality dimensions (such as user ID, order number), put them in Logs or Traces instead of Metrics tags.
Misunderstanding 5: Ignoring the observability of observability itself
This is an easily overlooked issue: the observability systems (collectors, storage, query services) themselves need to be monitored. When observation data is missing, if it is not discovered in time, it will lead to “thinking that the system is normal, but actually flying blindly.”
A typical approach is to establish “meta-monitoring” - monitor the collection rate of the collector, the write delay of the storage, and the availability of the query service. When metamonitoring is abnormal, a high-priority alarm is triggered.
Key Insight: The Importance of Contracts
Among all these practices, the author believes that the most important one is Contract:
Trace Context Transparent Transmission Contract: Defines the transmission method of Trace ID in HTTP Header, message queue, and asynchronous tasks, and all services must comply with it.
Log format contract: Defines the field specifications, timestamp format, and log level semantics of structured logs.
Indicator Label Contract: Defines the labels, naming conventions, and cardinality control principles that must be included.
Alarm classification contract: Define response time requirements and upgrade paths for different severity levels.
These contracts need to be documented, tooled, and automatically verified. Observability without contracts is chaotic and cannot support governance decisions.
Conclusion
From enterprise-level CF platforms to cloud native, observability cognition has evolved from “tool stacking” to “data correlation” to “governance-driven”.
Observability-driven governance (ODG) is not a specific tool or architecture, but a way of thinking: Basing governance decisions on high-quality, strongly correlated observation data. OpenTelemetry provides a standard path to achieve this goal, but tools are only means. The key lies in whether the organization is willing to invest resources in building data quality, developing contract specifications, and cultivating engineers’ data-driven thinking.
From 2024 to 2026, new technologies such as eBPF, Continuous Profiling, and AI root cause analysis are expanding the boundaries of observability. But the same principle remains true: the value of observability lies not in “seeing more”, but in “understanding faster”. Only when the system is understood can governance be based on evidence.
Looking back at the evolution from 2015 to 2026 (to date), from data silos in the era of enterprise-level CF platforms, to the unified standard of OpenTelemetry in the cloud-native era, to AI-driven intelligent analysis, the core goal of observability has never changed: to enable engineers to quickly and accurately understand system behavior. The evolution of technology is only making this goal more achievable.
For companies that are building observability, my suggestion is: first establish unified data collection standards to ensure that data can be correlated; then gradually optimize governance decisions based on these data; and finally introduce advanced analysis tools and AI capabilities. Without a high-quality data foundation, no matter how advanced analysis tools are, it will be difficult to function.
In the next article, I will discuss traffic governance—how to achieve refined traffic control and intelligent routing based on observability data.
About the author
Milome has more than ten years of experience in enterprise-level architecture design. He has served as a senior architect for enterprise-level CF platforms and has led the architecture design and implementation of multiple large-scale microservice platforms. Currently focusing on the research and practice of cloud native technology architecture and governance system.
Series context
You are reading: From enterprise-level CF platform to cloud native: more than ten years of evolution of enterprise-level microservice governance
This is article 2 of 6. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- From enterprise-level CF platform to cloud native (1): Architect's review - the gains and losses of microservice governance in the era of enterprise-level CF platform Based on the front-line architecture practice of enterprise-level CF platforms from 2015 to 2020 and industry observations from 2015 to 2026 (to date), we review the microservice governance design decisions in the Cloud Foundry era and analyze which ones have withstood the test of time and which ones have been reconstructed by the cloud native wave.
- From enterprise-level CF platform to cloud native (2): Observability-driven governance—from monitoring large screens to precise decision-making systems With 6 years of practical experience as an enterprise-level platform architect, we analyze the core position of observability in microservice governance, from data islands to OpenTelemetry unified standards, and build a governance system for accurate decision-making.
- From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh Review the practice of Spring Cloud Gateway in the enterprise-level CF platform, analyze the standardization value of Kubernetes Gateway API, explore the evolution logic from Service Mesh to Ambient Mesh, and provide a decision-making framework for enterprise traffic management selection.
- From enterprise-level CF platform to cloud native (4): Redefining elastic fault tolerance—from Hystrix to adaptive governance Review Hystrix's historical position in microservice elastic governance, analyze Resilience4j's lightweight design philosophy, explore new paradigms of adaptive fault tolerance and chaos engineering, and provide practical guidance for enterprises to build resilient systems.
- From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery Review the manual approval model of traditional release governance, analyze the evolution of blue-green deployment and canary release, explore the new paradigm of GitOps and progressive delivery, and provide practical guidance for enterprises to build an efficient and secure release system.
- From enterprise-level CF platform to cloud native (6): Summary—an architect’s perspective on enterprise-level microservice governance Review the evolution of microservice governance over the past ten years from 2015 to 2026 (to date), refine the first principles of architects, summarize the implementation paths and common pitfalls of enterprise-level governance, look forward to future trends, and provide a systematic thinking framework for technical decision-makers.
Reading path
Continue along this topic path
Follow the recommended order for Microservice governance instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions