Article

From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery

Review the manual approval model of traditional release governance, analyze the evolution of blue-green deployment and canary release, explore the new paradigm of GitOps and progressive delivery, and provide practical guidance for enterprises to build an efficient and secure release system.

Topic · Microservice governance Series From enterprise-level CF platform to cloud native: more than ten years of evolution of enterprise-level microservice governance 5/6

Microservices Release Governance Blue Green Canary Feature Flags Gitops Progressive Delivery Argo Cd

From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery

More than ten years ago, the production launch of many companies was still a highly ritualized collective action.

One night in the late autumn of 2015, the conference room of an enterprise-level R&D center was brightly lit. An important module is about to go online at 11 p.m. Development, testing, operation and maintenance, and project managers are all sitting on site, and architects from overseas headquarters are also connected through remote meetings. Going online is not a simple code deployment, but a complete change process: submit a CAB (Change Advisory Board) change application two weeks ago, complete the impact assessment one week ago, freeze the code three days ago, and execute a 64-item checklist one by one on the day of release.

This was not a special approach of individual teams, but a typical microcosm of the production changes of many large enterprises at that time. Stability is prioritized, process is prioritized, and manual confirmation is prioritized. Every launch is like a mini-battle.

In typical launch scenarios, that launch didn’t go smoothly. After switching to the new version, the monitoring system reported a strange performance indicator exception. The atmosphere in the conference room solidified instantly. After four hours of investigation, the team found that a certain configuration of the new version could not be reproduced in the test environment and would only be triggered under data pressure in the production environment. In the end, a rollback could only be performed, and the entire release window was delayed by two weeks.

From the perspective of technological evolution, although the traditional release governance model pursues ultimate risk control in theory, its essence is to regard release as a “dangerous event” rather than a “daily activity”. This way of thinking is naturally inconsistent with the rapid iteration required by microservice architecture.

Looking back at the evolution of release management by 2026, we can see a profound paradigm shift: from manual approval to automated judgment, from all or nothing to progressive delivery, from post-remediation to pre-emptive prevention. This article will review the key technologies and practices in this evolution process, and share industry observations and insights accumulated from the practices in the era of enterprise-level CF platforms to subsequent cloud native projects.

1. Dilemmas of the traditional release governance model

1.1 The birth and evolution of the CAB model

The concept of Change Advisory Board (CAB) comes from the ITIL framework, and its original intention is to use a centralized approval mechanism to evaluate, authorize, and control all changes that may affect the production environment. Judging from the practice in the era of enterprise-level CF platforms, this system has been in operation for many years and has formed an extremely strict process system.

A typical CAB approval process includes the following steps:

The first is the change application stage. The development team needs to fill in a detailed change application form, including change scope, impact analysis, rollback plan, test report, risk assessment, etc. In the early days of practice in the industry, it would take half a day just to fill out this form, because each item needed to be accurate to the specific modules and interfaces.

Then comes the Impact Assessment. CAB members include representatives from different departments: operations, security, compliance, business, etc. Everyone needs to review the change request from their own professional perspective. This kind of cross-departmental collaboration can theoretically comprehensively identify risks, but in practice it often evolves into a formalism of “review for the sake of review.”

Next is the approval decision. CAB meetings are usually held weekly and change requests need to be discussed on a case-by-case basis at the meeting. Judging from the practice in the era of enterprise-level CF platforms, the most extreme situation is: a functional change of medium complexity was delayed for three weeks before approval due to schedule conflicts among CAB members.

Finally there is the publish window. Once approved, changes can only be executed within a predetermined time window. These windows are often scheduled during low business hours (weekends or late at night), which means the release team needs to sacrifice personal time to get the work done.

1.2 The paradox between risk control and delivery efficiency

The core assumption of the CAB model is that more stringent approvals mean lower risks. However, industry observations suggest a different story.

In 2016, a team was responsible for the reconstruction of a core financial module. Following the CAB process, the team begins preparing materials for release a month in advance. However, when the release date approached, the business side suddenly proposed a new compliance requirement, which required adjustments to some logic. The adjustment is not technically complex, but it has missed the deadline for the CAB meeting. The choice faced by the team was: either release as planned and risk violations, or go through the CAB process again and delay it for at least two weeks.

This “last-minute change” scenario is common in traditional release management. The fundamental reason is: the approval cycle does not match the development cycle. When the approval process takes weeks, both business requirements and technical implementations continue to change, and misalignment between the two is almost inevitable.

The deeper problem is that the CAB model equates “risk control” with “the number of approval links.” But in fact, the effectiveness of risk control depends on three factors: completeness of information, timeliness of decision-making and controllability of execution. The CAB model invests too much in the first factor to the detriment of the second, and pays insufficient attention to the third factor (controllability of the execution phase).

1.3 The Curse of the Release Window

The release window system is a supporting measure of the CAB model. The logic is: limiting release frequency reduces risk exposure. However, this design comes with serious side effects.

The first is batch release. Since release opportunities are scarce, teams tend to package multiple changes into the same release window. This increases the complexity and risk surface of each release. Judging from the practice in the era of enterprise-level CF platforms, one release window once contained changes to more than twenty modules, and it took two days just to merge code conflicts.

Second is release fatigue. Posting late at night or on weekends can seriously impact your team’s work-life balance. In a traditional release window, some teams would release as often as twice a month, working from 10pm to 3am each time. This unsustainable model eventually led to the loss of key personnel.

Finally there is Feedback Delay. When the release cycle is measured in weeks or months, the development team’s cycle from code submission to getting feedback from the production environment is seriously stretched. This makes finding and fixing problems exponentially more expensive.

In early 2017, the industry began to think: Is there a better way to control risks while improving delivery efficiency? The answer to this question has led the industry to a world of blue-green deployment and canary releases.

2. Blue-green deployment: the starting point for zero-downtime release

2.1 Principle and implementation mechanism

The concept of Blue-Green Deployment is not new and can be traced back to Martin Fowler’s article around 2000. But it wasn’t until cloud computing became popular that this model truly became a viable option.

The core idea of blue-green deployment is simple: Maintain two sets of identical production environments (blue and green), and only one set of environments serves the outside world at any time. When a new version needs to be released, first deploy the new version on an inactive environment (such as green). After verification, the traffic is switched to the green environment and the original blue environment becomes inactive. If an abnormality is found, you can quickly switch back to the blue environment.

Four-step closed loop of blue-green deployment: current, deployment, switch and rollback

Figure 1: Four-step closed loop of blue-green deployment (current, deployment, switch and rollback)

From an implementation perspective, blue-green deployment needs to solve three technical problems:

Environmental Consistency. Blue and green environments must be consistent in terms of hardware configuration, software version, network topology, etc. Any discrepancies may cause changes that were validated in the green environment to cause problems after switching to production traffic. Judging from the practice in the early days of enterprise-grade CF platforms, infrastructure as code (IaC) was relied on to ensure the consistency of the environment. The configuration of each environment is stored in a version control system and is created and destroyed through automated tools such as Terraform.

Data layer processing. This is the most complicated part. When two environments share the same database, you need to ensure that both the old and new versions of the code can process the data correctly. The problem becomes even more problematic if the new version involves database schema changes. This issue is discussed in detail in the next subsection.

Traffic switching mechanism. The load balancer or reverse proxy needs to support fast traffic switching. In practice, we typically do this using DNS switching, load balancer reconfiguration, or routing rules at the API gateway layer. Switching can be granular down to individual users, making progressive rollout possible.

2.2 Blue-green practices of early enterprise-level CF platforms

In 2017, early enterprise-level CF platforms began to support blue-green deployment, marking the industry’s early adoption and promotion of this model.

The platform is implemented based on Cloud Foundry’s multi-space mechanism. Each application can be deployed to two spaces: one labeled “blue” and one labeled “green”. The platform provides a set of command line tools to manage traffic switching.

A typical release process is as follows:

# : (current)
cf push my-app-green -f manifest-green.yml

# : check and test
curl https://my-app-green.apps.example.com/health
./run-smoke-tests.sh

# : part()
cf map-route my-app-green apps.example.com --hostname my-app --weight 10

# : indicator, error
cf logs my-app-green --recent | grep ERROR

# : complete,
cf unmap-route my-app-blue apps.example.com --hostname my-app
cf map-route my-app-green apps.example.com --hostname my-app --weight 100

# : keepenvironment
# (keepcleanup)

This implementation has several significant advantages. The first is atomic switching. Through the weight adjustment of the load balancer, almost instantaneous traffic switching can be achieved, and users will hardly notice the interruption. Compared with traditional rolling updates, blue-green deployment provides a truly zero-downtime experience.

Second is Quick Rollback. If there is a problem with the new version, you only need to execute the command once to switch back to the old version. In a certain business-critical project, the team discovered an exception five minutes after a release, and it took less than two minutes from the discovery of the problem to the completion of the rollback. This kind of speed is unimaginable under traditional publishing models.

However, blue-green deployment does not come without costs. The largest cost is infrastructure overhead. Maintaining two sets of production environments means double the consumption of computing resources. On early enterprise-level CF platforms, costs are usually optimized through automatic scaling: resources are temporarily added during the release window, the blue environment is retained for a period of time after release, and then automatically destroyed to release resources.

2.3 Challenges of database schema changes

Blue-green deployments work well when handling application layer changes, but the complexity increases significantly when database schema changes are involved.

The fundamental problem is this: when two environments share the same database, the new and old versions of the code may have different expectations for data structures. For example, the new version adds a non-null field, but the code in the old version does not know about the existence of this field.

Judging from the practice in the era of enterprise-level CF platforms, the “expand-contract” (Expand-Contract) model for handling database changes has gradually formed:

Expansion Phase (1st Release):

Add new database objects (tables, columns, indexes) without deleting old objects
Newly added columns must be optional (nullable) or have a default value
The code supports both old and new schemas

Transitional phase (stable period):

Old and new code run in parallel
Data is synchronized between the old schema and the new schema
Verify the integrity and performance of the new schema

Shrinking Phase (2nd Release):

Remove support for new schema from old code
Clean up database objects no longer needed
Keep only new versions of code

This strategy decomposes a one-time destructive change into two non-destructive changes, which increases the complexity but ensures the safety of the release.

For more complex data migration scenarios, the enterprise-level CF platform also introduces the Shadow Database (Shadow Database) mode. Before releasing a new version, create a database copy of the new schema in advance and maintain real-time synchronization with the main database through the data synchronization tool. When released, the new version of the code connects directly to the shadow database, enabling true “zero downtime” data migration. Of course, this solution is more expensive and is usually only used in key business scenarios.

The ability to quickly roll back blue-green deployments is one of its most attractive features, but this capability also has its limits.

2.4 Costs and limitations of fast rollback

The first is data consistency rollback. If a new version of the code has been written to the database, rolling back to the old version may cause data inconsistency. For example, if the new version writes a record containing new fields, what will happen when the old version of the code reads this record? Judging from the practical experience of large enterprises, this problem is usually alleviated through the following strategies:

Set an observation period after switching to the new version (usually 15-30 minutes) during which key metrics are closely monitored
For critical business operations, enable Double Write mode during the observation period: write the old and new schemas simultaneously to ensure data integrity during rollback
If the change involves data migration, delay the destruction of the old environment until it is confirmed that no rollback is required

The second is the rollback complexity of stateful services. For stateful components such as databases, message queues, and caches, switching between blue and green deployments is not simple. On early enterprise CF platforms, these components were often treated as “shared resources” and changes were managed through schema versioning or instance isolation.

The last thing is Adaptation of organizational processes. The ability to roll back quickly changes the way operations teams think. Under the traditional model, release is “irreversible”, so the approval process is crucial. Under the blue-green deployment mode, release becomes reversible, and the focus of approval shifts from “preventing release” to “ensuring rollback capability.” This transformation requires matching adjustments in organizational culture.

3. Canary Release: From All or Nothing to Progressive Delivery

3.1 From simple weight to intelligent traffic segmentation

If blue-green deployment solves the problem of “zero downtime”, Canary Release further solves the problem of “risk control”. The core idea is: Expose the new version to a small number of users first, observe their behavior and system indicators, and then expand the scope after confirming safety.

The name “canary” comes from the tradition of coal miners: they take canaries down the mine. If there is poisonous gas in the mine, the canary will die before humans, serving as an early warning. In software releases, “canary users” play a similar role - they test new versions within a controlled risk range.

Flow control and observation closed loop released by Canary

Figure 2: Traffic control and observation closed loop released by Canary (weight, header, user grouping and indicator analysis)

The traffic segmentation strategy released by Canary has evolved from simple to complex:

Weight-based segmentation is the most basic form. For example, the initial phase routes 5% of traffic to the new version and 95% remains on the old version. This strategy is simple to implement, but there is a problem: users may “jump” between different versions. If the request is stateless (such as REST API), this will not cause a problem; but if session state is involved, the user experience will be affected.

User identity-based segmentation solves the session consistency problem. Once a user is assigned to a canary version, subsequent requests are routed to the same version. This can be accomplished via a hash of the user ID, a cookie, or a session. Judging from the practical experience of large enterprises, the hash value modulo of the user ID is usually used to determine routing: if hash(user_id) % 100 < 5, route to the canary version. This approach ensures user consistency while achieving uniform sample distribution.

Attribute-based segmentation further increases flexibility. The routing strategy can be determined based on the user’s region, device type, subscription level and other attributes. For example, a new version can be released to internal employees first, then expanded to users in specific regions, and finally made available to everyone. This strategy is particularly useful when you need to control the scope of influence.

Content-based smart segmentation is the most complex but also the most sophisticated form. For example, you can route only specific types of requests to the new version (such as routing only read requests and not write requests), or offload requests based on complexity (simple requests to the new version, complex requests to the old version). This strategy is especially valuable when testing the performance of new releases.

3.2 Indicator-based canary analysis

The value of canary release lies not only in “progressiveness” but also in “observability”. If there is no way to tell whether a canary release is “healthy”, progressive releases lose their meaning.

The core of canary analysis is to compare the differences in indicators between two versions. This sounds simple, but in practice there are many technical details involved.

The first is Indicator Selection. What indicators reflect the health of a version? Judging from the practical experience of large enterprises, indicators are usually divided into three categories:

System-level indicators: including error rate, latency (P50/P95/P99), throughput, CPU/memory usage, etc. These are the most basic monitoring indicators, and any abnormality will trigger an alarm immediately.

Business-level indicators: such as order conversion rate, payment success rate, user retention rate, etc. These metrics are directly related to business value, but it often takes longer to collect statistically significant data.

Custom Metrics: Metrics for specific changes. For example, if a new version improves the recommendation algorithm, we will monitor the recommendation click-through rate and user satisfaction.

Second is statistical significance. The canary version only serves a small number of users, and its indicators may fluctuate greatly. How can one tell whether an observed difference is a “real difference” or “random noise”? This requires the use of statistical methods, such as confidence intervals, hypothesis testing, etc.

For example, assume the error rate for the old version is 0.1% and the error rate for the canary version is 0.15%. Is this 0.05% difference significant? The answer depends on the sample size: if the canary version only processed 1000 requests, the 1-2 errors observed were probably random fluctuations; but if 1 million requests were processed, the 50 additional errors are likely to be real problems.

Judging from practical experience in the era of enterprise-grade CF platforms, a set of in-house tools were developed to automatically calculate the statistical significance of canary indicators. The tool uses Bayesian methods to estimate the confidence intervals of indicators and automatically triggers alerts when the canary version of the indicator exceeds a preset threshold.

Finally, there is Multidimensional Attribution. Problems in production environments are often related to specific conditions. For example, an error may only occur in a specific browser or in a specific region. Canary analysis needs to support multi-dimensional slicing and drilling to help quickly locate the root cause of problems.

3.3 Automated canary determination

As the frequency of canary releases increases, manual judgment of canary health becomes unsustainable. This has led to the development of automated canary determination tools.

Netflix’s open source Kayenta is a representative in this field. Kayenta breaks down canary analysis into several steps:

Data collection: Obtain the indicator data of the canary version and the baseline version from the monitoring system (such as Atlas, Prometheus).

Indicator comparison: Compare each indicator to determine whether there is a significant difference. Kayenta supports several comparison algorithms, including the Mann-Whitney U test (a nonparametric statistical test) and simple threshold comparisons.

Comprehensive score: Calculate a comprehensive “health score” based on the comparison results of all indicators. If the score is lower than the threshold, the canary is judged to have failed.

Automatic decision-making: Automatically perform subsequent actions based on the judgment results—continue to expand the traffic proportion, or roll back to the old version.

Within large enterprises, the Canary Analysis Platform was developed based on the concept of Kayenta. Compared to the open source version, this implementation adds the following features:

Business context awareness: Different business scenarios have different sensitivities to indicators. Payment services have a much lower tolerance for error rates than in-house admin backends. The platform allows defining different decision strategies for each service.

Automatic Baseline Selection: Canary analysis requires a “baseline” version to compare against. The traditional approach is to use the last stable version as the baseline, but what if the previous version itself has problems? The platform supports automatically selecting the most historically stable version as a baseline.

Progressive Scaling Strategy: Canary releases usually go through multiple stages: 1% -> 5% -> 25% -> 50% -> 100%. Each phase has specific observations and duration. The platform supports defining such progressive policies and automatically performs health checks at each stage.

3.4 Philosophical changes in production environment testing

Canary releases represent a fundamental shift in thinking: from “verify everything in test” to “cautiously experiment in production.”

Traditional software engineering philosophy emphasizes the consistency between the test environment and the production environment, believing that as long as the test passes, the production release is safe. However, this ideal state is almost impossible to achieve in practice:

Data Difference: There are huge differences between the data volume, data distribution, and data characteristics of the test environment and the production environment. Certain performance issues are only exposed with production-scale data.

Load difference: It is difficult for the test environment to simulate the real load pattern of the production environment, including peak traffic, burst requests, long-term running, etc.

Dependency Difference: External dependencies (third-party APIs, partner systems) usually use mock or sandbox versions in the test environment, and their behavior is different from the real system.

Canary releases acknowledge this reality: the only thing that truly resembles a production environment is the production environment itself. Therefore, instead of pursuing a perfect testing environment, it is better to design a mechanism that can be verified in the production environment with controllable risks.

This philosophical shift brought about many practical adjustments:

Observability first: Since everything cannot be verified before production, problems must be discovered quickly after production. This requires the system to have complete monitoring, logging and tracking capabilities.

Fast rollback capability: When a problem occurs, the rollback must be completed in minutes or even seconds. This promotes the development of technologies such as blue-green deployment and rapid configuration updates.

Graceful Downgrade Design: Problems with new versions should not cause the entire system to crash. Control problems to a local scale through mechanisms such as circuit breaker, current limiting, and downgrade.

Data Protection Mechanism: Production environment testing cannot be at the expense of user data. Through technologies such as shadow traffic (copying production traffic but not affecting users) and synthetic transactions (simulating user operations), real feedback can be obtained while protecting users.

4. Feature switch: decoupling release and deployment

4.1 LaunchDarkly and the rise of SaaS platforms

In 2018, LaunchDarkly, a SaaS platform that specializes in providing feature flags/toggles services, began to enter the industry’s field of vision. This exposure completely changed the understanding of publishing.

Traditionally, “releasing” means deploying new code to production and making it visible to users. Feature switches decouple these two actions: The code can be deployed to the production environment first, but the switch controls whether it is visible to users. This brings revolutionary flexibility.

LaunchDarkly’s core capabilities include:

Dynamic On/Off Control: Turn features on/off in real time via Web UI or API, without having to redeploy code. Switch state changes usually take effect in milliseconds.

Fine-grained targeting: Feature switches can be applied to specific users, user groups, regions, device types, etc. For example, open new features to internal employees only, or enable premium features only to paying users.

Progressive Release: By gradually increasing the proportion of users covered by the switch, an effect similar to canary release is achieved, but without complex infrastructure support.

A/B testing support: Quantify the business value of new features by comparing the performance of switch-on groups and switch-off groups in random groups.

LaunchDarkly is used heavily in large enterprise digital transformation projects. A typical application scenario is: a large functional module (such as a new user interface) involves the modification of hundreds of files on the front and back ends. The traditional method requires waiting for all components to be completed before release, which may take several months. After using feature switches, the team can gradually merge the code into the main branch, hide the unfinished parts through the switch, and achieve continuous integration without affecting users.

However, SaaS platforms also have their limitations. Data privacy is a major concern: decision logic for feature switches and user grouping information needs to be sent to a third-party service. For sensitive industries such as finance and healthcare, this may violate compliance requirements. Additionally, external dependencies mean additional points of failure—how will the app behave if the LaunchDarkly service is unavailable?

4.2 Design of self-developed characteristic switching system

After having experience using SaaS platforms, many large enterprises will soon encounter the same problem: feature switches can indeed separate “deployment” and “release”, but can switch rules, user groupings and experimental data rely on external platforms for a long time? In scenarios such as finance, medical care, government and enterprises, this problem is often not just technology selection, but the boundary issue of compliance, security and controllability.

Therefore, around 2019, some companies began to build internal feature switching systems. The goal is not to simply copy a SaaS product, but to find a long-term balance between functional integrity, real-time performance, security boundaries, and operation and maintenance costs.

This type of system is usually split into three levels.

The Configuration Layer is responsible for defining “what the switch is”. The switch configuration is stored in the Git repository and is managed using the “configuration as code” approach. Each switch corresponds to a YAML file, which clearly states the name, description, default state, target rules and responsible person. The advantage of this is that switch changes can be reviewed, rolled back, and tracked, instead of being scattered in a background page.

The distribution layer is responsible for delivering configurations to running services safely and quickly. The configuration is synchronized from Git to a distributed configuration center, such as etcd, and then pushed to the application node through a long connection. In this way, when the switch status changes, the application does not need to be redeployed or polled frequently, and the change delay can usually be controlled within 1 second.

The client layer is responsible for allowing business code to read the switch status at low cost. The application queries the switch through the lightweight SDK, and the SDK maintains a cache locally. Even if the application is temporarily disconnected from the configuration center, it can continue to work based on the cached value of the last successful synchronization, avoiding direct impact on business requests due to control plane failures.

Core Data Structure:

# configurationexample
flag_name: new_checkout_flow
description:
default_value: false
targets:
  - name: internal_testers
    rules:
      - attribute: email
        operator: ends_with
        value: "@example.com"
    value: true
  - name: beta_users
    rules:
      - attribute: user_segment
        operator: in
        value: ["beta", "early_adopter"]
    value: true
  - name: gradual_rollout
    rules:
      - attribute: user_id
        operator: percentage
        value: 10  # 10% User
    value: true

Key Design Decisions:

Local priority strategy: The SDK first checks the local cache and only queries the configuration center if the cache misses or expires. This ensures high performance of switch checking (local operation < 1ms) and high availability (network failures do not affect existing switches).

Consistent Hashing: For percentage-based progressive publishing, using consistent hashing of user IDs ensures that the same user always gets the same switch value in different requests, avoiding jumps in user experience.

Audit and Rollback: All switch changes are recorded in the audit log, supporting one-click rollback to the previous state. This is particularly valuable when misoperation leads to production accidents.

Performance Optimization: Feature switch checking is a high-frequency operation (maybe millions of times per second) and must be extremely optimized. The following strategies were adopted:

Local cache uses lock-free data structures
Switch rules are pre-compiled into efficient decision trees
Support batch switch query (one request to obtain multiple switch status)

4.3 Continuous deployment decoupled from release

The most direct value of feature switches is to achieve true continuous deployment (Continuous Deployment).

The traditional continuous delivery (Continuous Delivery) pipeline stops after deployment to the production environment, requiring manual decision-making whether to “release” (make it visible to users). Feature switches eliminate this manual link: the code can be automatically deployed to the production environment, but is hidden from users by default; when certain conditions are met (such as testing passing, business approval is completed), it is automatically “released” by turning on the switch.

This decoupling brings multiple benefits:

Release frequency has been greatly increased: In an enterprise-level platform team, the release frequency has been increased from once a month to multiple times a day, because each merge into the main branch can be automatically deployed and is no longer restricted by the release window and CAB approval.

Significantly reduced risk: The scope of each change becomes smaller, making it easier to locate and rollback problems. If a feature causes problems, you can simply turn off the switch without having to roll back your code deployment.

Parallel development is smoother: Multiple teams can work on the same code base, each controlling their own feature switches, reducing conflicts and coordination costs of branch merging.

Experiment-driven development: Product teams can conduct A/B testing and user experiments more flexibly, quickly verify hypotheses, and data-driven decisions replace lengthy requirements reviews.

Of course, the price of this model is the accumulation of technical debt.

4.4 Technical Debt Management: Switch Cleanup Strategy

Feature switches are powerful tools, but misuse can lead to serious technical debt.

When reviewing a project’s code in 2020, it was discovered that there were more than 200 feature switches in the code base, about 60% of which had been turned on for a long time (more than 6 months) but had never been cleaned up. These “zombie switches” pose the following problems:

Increased code complexity: Each switch introduces conditional branches, making the code logic difficult to understand and test. Some critical paths are nested with multiple switch judgments, forming a complex decision matrix.

Test Matrix Explosion: To verify the behavior of all switch combinations, the number of test cases grows exponentially. In practice, teams often only test default configurations, leaving code paths uncovered.

Performance Overhead: While the overhead of a single switch check is small, the cumulative effect of hundreds of switches cannot be ignored. What’s more, switch checks often occur on hot paths (such as the request processing chain), amplifying the impact.

Maintenance Burden: Each switch requires ongoing monitoring and documentation. When the product manager responsible for the switch leaves, the new team often doesn’t know the function of a certain switch and dare not turn it off easily.

In order to solve these problems, a strict switch life cycle management process was established:

Switch Classification: Classifies switches into short-term switches (experimental features) and long-term switches (operational configurations). Short-term switches must be removed after the characteristics stabilize, and long-term switches require an architectural review.

Expiration Reminder: The system automatically tracks the creation time and last change time of each switch. Switches that have not been changed beyond the preset threshold (such as 30 days) will trigger an alarm.

Cleaning Sprint: A special switch cleanup event is organized every quarter. The team reviews all switches, removes those that are no longer needed, and converts temporary switches into formal configurations.

Code Review: During the code review, it is mandatory to check whether the switch usage is reasonable and whether there is a corresponding cleanup plan.

Through these measures, the number of switches in the project was controlled within 50, and technical debt was effectively managed.

5. GitOps: the new paradigm of declarative release

5.1 Governance model of ArgoCD and Flux

In 2020, the concept of GitOps started to gain popularity, and the industry realized that this could be the next big evolution in release governance.

The core idea of GitOps is simple: store the desired state of the system declaratively in a Git repository, and the automation tool is responsible for synchronizing the actual state to the desired state. For a Kubernetes environment, this means that all resource configurations (Deployments, Services, ConfigMap, etc.) are stored in Git as YAML files, and GitOps tools (such as ArgoCD, Flux) continuously monitor the Git repository for changes and automatically apply the changes to the cluster.

GitOps release governance closed loop: Git trusted source, controller, reconciliation, environment and feedback

Figure 3: GitOps release governance closed loop (Git trusted source, reconciliation, synchronization and feedback)

In 2021, a medium-sized project migrated to ArgoCD and deeply realized the advantages of GitOps:

Version control as the only source of trust: All configuration changes are submitted through Git, with natural audit trail capabilities. Who made changes, when, and why are all clearly visible in Git history. This solves the common problem of “configuration drift” in traditional operations - that is, the actual configuration of the production environment is inconsistent with the documentation.

Declarative vs. Imperative: A traditional release script is usually a series of commands: “First create this, then update that, and roll back if it fails.” GitOps takes a declarative approach: “The desired state of the system is this, make sure the actual state matches it.” This approach is more robust because even if an intermediate step fails, the system will automatically retry until the desired state is reached without staying in a half-completed state.

Automated synchronization and self-healing: ArgoCD will continuously monitor the cluster status. If “drift” is detected (such as someone manually modifying resources through kubectl), you can choose to alert or automatically repair it. This “immutability” guarantees the consistency of the environment.

Multiple environment management: Through Git branch or directory structure, the configuration of multiple environments (development, testing, production) can be managed elegantly. The differences between environments are clearly visible, avoiding the confusion of “why there is no problem in the test environment but errors in the production environment”.

5.2 Version Control as the Only Source of Trust

GitOps emphasizes that “Git is the only source of trust”. This is not only a technical choice, but also a change in governance philosophy.

In the traditional model, there are multiple “sources of trust”: Git stores code, the configuration management database (CMDB) stores asset information, the document system stores architecture design, and the release system stores release history. Synchronization problems between these systems often occur, resulting in “data source conflicts”.

GitOps converges all configuration into Git, enabling a single source of trust. This brings multiple benefits:

Centralized permission management: Git’s permission control mechanism is mature and complete, and can accurately control who can modify which configurations. Compared with Kubernetes RBAC, Git’s permission auditing is more transparent.

Code review mandatory: All configuration changes must go through the Pull Request process and undergo code review before they can be merged. This avoids the dangerous operation of “modifying production configurations casually”.

Disaster Recovery Simplified: If the entire Kubernetes cluster is destroyed, it can be quickly rebuilt from a Git repository. Because Git stores the “desired state” of the system rather than the “change log”, the rebuild process is idempotent.

Compliance audit friendly: Git’s submission history naturally meets audit requirements and can generate detailed change reports to meet compliance requirements such as SOX and GDPR.

In practice, a multi-warehouse strategy is adopted to manage complexity:

Application code warehouse: stores business code, including CI configuration. After the build is successful, the container image is generated and the image tag is output.

Configuration Warehouse: Stores K8s manifests and references the image generated by the application warehouse. Differences between environments are managed through Kustomize or Helm.

Infrastructure Warehouse: Stores Terraform configuration and manages cluster-level resources (VPC, database, network policies, etc.).

This layered architecture ensures separation of concerns: developers focus on application code, operations staff focus on configuration, and platform teams focus on infrastructure.

5.3 Automated synchronization and drift detection

The core work cycle of ArgoCD consists of three steps:

Detection: Regularly check whether there are new commits in the Git repository, and whether the actual state of the cluster is consistent with the expected state declared in Git.

Difference Analysis (Diff): If inconsistency is found, calculate the difference. The difference may come from two aspects: Git has changes that need to be applied to the cluster, or the cluster state deviates from Git’s declaration (drift).

Sync: Apply changes to the cluster automatically or manually according to configured policies.

Drift detection is an important feature of GitOps. This situation often occurs in production environments: someone directly modifies resources through kubectl in order to urgently fix a problem; or a controller automatically adjusts resources (such as HPA adjusting the number of replicas). If these changes are not synchronized back to Git, they form “drift”.

ArgoCD provides a visual interface to display drift conditions and supports the following processing strategies:

Alarm only: Send an alarm when drift is found, but do not automatically repair it. Suitable for scenarios requiring manual review.

Auto-repair: Automatically restore the cluster state to the state declared by Git. This is recommended because it enforces the “Git is the only source of truth” principle.

Ignore specific fields: Some fields are automatically managed by the system (such as the status field of the resource, the number of replicas set by HPA), and the drift of these fields can be configured to be ignored.

In the practice of large enterprises, a progressive strategy is adopted: automatic repair is enabled in the development environment, alarms in the test environment + manual review, and alarms in the production environment + change approval. This layered strategy balances efficiency with risk.

5.4 Multi-environment promotion strategy

Another important scenario of GitOps is multi-environment promotion: after the code is verified in the development environment, how to safely promote it to the testing, pre-release and production environments.

The following promotion process has been designed:

Development Environment (Dev): After the developer submits the code, CI automatically builds the image and updates the development environment directory of the configuration warehouse. ArgoCD automatically syncs to the development cluster.

Test environment (Test): When the tests of the development environment pass, create a Pull Request to merge the changes from the dev branch to the test branch. Integration testing is triggered and merged after approval by QA.

Pre-release environment (Staging): Changes to the staging branch trigger more comprehensive verification, including performance testing, security scanning, and compliance checks. This is the final hurdle before production release.

Production environment (Production): Changes in the production environment need to be approved by the CAB (yes, GitOps does not completely eliminate the CAB, but changes its responsibility from “approving each change” to “approving the promotion policy”). After approval, cherry-pick the staging changes to the production branch.

This process ensures progressive validation of changes while clearly tracking the status of each environment through Git branches.

Optimization of promotion strategy:

With the deepening of practice, some problems with traditional promotion strategies have been discovered:

Delayed feedback: If performance testing is only performed in the staging environment, the problem will be discovered close to release, and the repair cost will be very high.

Accumulation of environment differences: The configuration of the development environment deviates from the production environment for a long time, resulting in the problem of “it can run on my machine”.

Merge Conflict: When multiple features are developed in parallel, conflicts often occur when merging between branches.

To address these issues, the following improvements were introduced:

Early Quality Level: Run some performance tests and security scans in the development environment to detect problems as early as possible.

Configuration synchronization automation: Regularly synchronize the configuration of the production environment back to other environments to maintain environmental consistency.

Trunk-based Development: Reduce long-standing feature branches, frequently merge into the trunk, and reduce merge conflicts.

6. Progressive delivery: from continuous deployment to intelligent release

6.1 Automatic rollback and self-healing

Progressive Delivery is the culmination of canary release, feature switching, and GitOps. It not only focuses on “how to publish”, but also “how to publish safely”.

Automatic rollback is a core capability of progressive delivery. The basic logic is: continuously monitor key indicators during the release process, and automatically perform rollback operations if abnormalities are found.

Flagger is a popular tool for progressive delivery that integrates with Kubernetes and service mesh (Istio, Linkerd) or Ingress controllers to provide automated canary releases and rollback capabilities.

Flagger’s workflow is as follows:

Analysis Phase: After deploying the new version, Flagger starts collecting metrics. It will compare the indicators of the new version and the old version, and use statistical methods to determine whether the new version is “healthy”.

Progressive expansion: If the analysis passes, Flagger gradually increases the traffic proportion of the new version: 1% -> 10% -> 50% -> 100%. Each stage has a preset observation time.

Automatic rollback: If the analysis fails at any stage (such as the error rate exceeds the threshold), Flagger automatically switches the traffic back to the old version and notifies relevant personnel.

Automatic promotion: If all stages are passed, the new version is promoted to a “stable version” and the old version is cleaned up.

In 2022, I used Flagger to deploy a key payment service and experienced the value of automatic rollback firsthand:

One Friday night, a new version was deployed automatically. After Flagger switched 1% of the traffic to the new version, the monitoring system reported a delay exception. Before manual intervention, Flagger completed the automatic rollback within 30 seconds. Afterwards, it was found that a certain dependent library of the new version was misconfigured, causing the database connection pool to be exhausted. Without automatic rollback, this issue could affect many more users and result in significantly longer response times.

6.2 Release determination based on observability

Another key feature of progressive delivery is release decisions based on observability.

Traditional release verification relies on predefined test cases. If the test passes, the release is considered successful. However, test cases cannot cover all scenarios, especially in complex distributed systems.

Determination based on observability takes a different approach: not trying to predict all possible problems, but quickly discovering problems through comprehensive monitoring. This requires:

Complete indicator system: includes not only system-level indicators (CPU, memory, network), but also application-level indicators (request delay, error rate, throughput), and business-level indicators (order volume, conversion rate, revenue).

Distributed Tracing: Able to trace the complete links of requests between multiple services and identify performance bottlenecks and error propagation paths.

Log aggregation: Centrally collect and analyze logs of all services, supporting fast retrieval and pattern recognition.

Anomaly Detection: Use machine learning algorithms to automatically identify abnormal fluctuations in indicators and reduce false positives and negatives.

In large-scale enterprise cloud-native projects, a unified observation platform was built that integrated Prometheus (indicators), Jaeger (tracing), Elasticsearch (logs) and customized anomaly detection services. Flagger obtains data from this platform and uses it to issue decisions.

6.3 Argo Rollouts in practice

Argo Rollouts is a sub-project of the Argo project, specifically designed to replace Kubernetes’ native Deployment controller and provide a richer release strategy.

Compared to Flagger, Argo Rollouts feature:

Deep integration with Kubernetes: Run as a custom controller, use CRD (Custom Resource Definition) to define release strategies, and seamlessly integrate with the Kubernetes ecosystem.

Multiple release strategies: Supports multiple strategies such as blue-green deployment, canary release, A/B testing, etc., and can be chosen flexibly.

Manual and automatic pause: Supports pausing release at specific stages and waiting for manual confirmation or external signals, suitable for scenarios that require manual review.

Integrated Analysis: The built-in analysis engine can automatically determine the health of the release based on data sources such as Prometheus, Datadog, and CloudWatch.

Implemented a complex publishing scenario using Argo Rollouts:

A certain database upgrade involves schema changes, which carries a high risk. We use Argo Rollouts’ “canary + manual pause” strategy:

First deploy the new version to 5% of the Pods
Pause publishing and wait for the DBA to manually confirm that the database status is normal
Upon DBA approval, continue to expand to 25%
Observe key indicators for one hour
If the indicator is normal, 100% release will be completed automatically; if abnormal, it will be rolled back automatically.

This “human-machine integration” approach not only retains the efficiency of automation, but also retains the flexibility of manual intervention for key decisions.

6.4 Paradigm shift from continuous deployment to progressive delivery

Looking back at the evolution of release governance, we can see the paradigm shift from “continuous deployment” to “incremental delivery”:

Continuous deployment emphasizes speed: the code is automatically deployed to the production environment after passing the test. It solves the “pain of publishing” problem and turns publishing from a “big deal” to a daily operation.

Progressive Delivery adds security to speed: not just release quickly, but release securely. It uses technical means (canaries, feature switches, automatic rollback) to control risks within an acceptable range.

This paradigm shift changes the core issues of release governance:

Shift from “Should publishing be allowed?” to “How to publish safely?” - Approval is no longer the only means of control, and technical means provide more granular control capabilities.

Shift from “verify everything before release” to “quickly identify issues and recover after release” - acknowledge that testing can’t cover every situation and compensate with observability and fast rollbacks.

Transform from “manual decision-making” to “data-driven decision-making” - release decisions are based on real-time indicators rather than manual judgment, reducing subjective factors and delays.

This shift does not deny the value of manual review, but focuses it on links that truly require human judgment: architecture design, risk assessment, and exception handling. Routine technical releases are handled by automated systems.

7. Architect decision-making framework

7.1 Release governance technology selection matrix

After practice from 2015-2026 (to date), this article summarizes a set of decision-making framework for release governance technology selection.

Technical solution	Applicable scenarios	complexity	cost	risk reduction capability	Main limitations
Blue-green deployment	Requires zero downtime, fast rollback	middle	High (double resources)	high	Database changes are complex
Canary release	Progressive Risk Control	Middle to high	middle	very high	Need perfect monitoring
Feature switch	Frequent releases, A/B testing	Low	Low	middle	Technical debt risk
GitOps	Declarative management, multiple environments	middle	Low	Middle to high	Steep learning curve
Progressive delivery	Critical business, high reliability	high	Middle to high	extremely high	Requires full observability

Consider the following factors when choosing:

Business criticality: Core paths such as payment and transactions require the highest level of risk control and are suitable for progressive delivery; internal tools can use lighter solutions.

Release Frequency: High-frequency release teams should invest in feature switches and GitOps; low-frequency releases may not require sophisticated automation.

Team Maturity: Technology selection must match the team’s capabilities. Implementing canary release without a complete monitoring system is tantamount to a blind man riding a blind horse.

Compliance requirements: Industries such as finance and healthcare may have specific auditing and approval requirements, and technical solutions need to be compatible with them.

7.2 Team maturity and tool selection

Adoption of release governance tools should match team maturity.

Level 1: Basic level

Features: Manual deployment, irregular release times, lack of automated testing
Recommended tools: basic CI/CD, simple blue-green deployment
Focus: Establish an automation foundation and standardize the release process

Level 2: Growth Level

Features: Automated deployment, basic monitoring, start paying attention to release quality
Recommended tools: feature switch, basic canary release
Focus: Increase release frequency and reduce release risks

Level 3: Mature

Features: Continuous deployment, complete monitoring and alarming, data-driven decision-making
Recommended tools: GitOps, automated canary analysis
Focus: Optimize the release process and improve release confidence

Level 4: Leading level

Features: Fully automated progressive delivery, advanced observability, rapid recovery capabilities
Recommended tools: Flagger/Argo Rollouts, machine learning-assisted anomaly detection
Focus: Continuous optimization and exploration of innovative practices

Teams should not skip levels in pursuit of “state-of-the-art” tools. There are too many teams in the industry that introduce canary releases without perfect monitoring. As a result, it is impossible to judge whether the canary is healthy, but it increases the risk.

7.3 Evolution path from manual approval to automation

For organizations still using the traditional CAB model, the evolution to automated release governance requires a gradual process:

Phase 1: Automation preparation

Establish a CI/CD pipeline to automate build and deployment
Improve the monitoring and alarm system to ensure that problems can be discovered
Standardize environment configuration to reduce environmental differences

Phase 2: Low-risk pilot

Select non-critical services as pilots and introduce blue-green deployment
Maintain CAB approval, but shift focus from “code review” to “deployment policy review”
Accumulate experience and confidence in automated publishing

Phase 3: Expanding the scope of automation

Introduce feature switches into the development process to decouple deployment and release
Exempt CAB approval for low-risk changes and establish a whitelist of “standard changes”
Introduce canary publishing, but still retain the manual confirmation process

Phase 4: Intelligent evolution

Introducing GitOps to implement declarative configuration management
Establish automatic release determination based on indicators
Complete capabilities for progressive delivery

Phase 5: Continuous Optimization

Optimize publishing strategy based on historical data
Introducing machine learning to assist risk prediction
Establish measurement and feedback loops for release governance

This evolution path may take several years, with each stage requiring matching adjustments in organizational culture, technical capabilities, and process specifications.

Conclusion

From 2015 to 2026 (to date), from the CAB conference room of the enterprise-level CF platform to the cloud-native automatic release pipeline, the industry has witnessed a profound change in release governance from “manual approval” to “progressive delivery”.

The core of this transformation is the update of the way of thinking: from “release is a dangerous event” to “release is a daily activity”, from “perfect verification beforehand” to “quick recovery afterward”, from “manual centralized control” to “automated decentralized decision-making”.

Technical tools provide the means to achieve this transformation: blue-green deployment eliminates downtime, canary release implements progressive risk control, feature switches decouple deployment and release, GitOps provides a declarative governance framework, and progressive delivery integrates all this into an intelligent release system.

But technology is only the foundation. The real challenge lies in transforming the organizational culture: How do you get teams to trust automated systems? How to find a balance between efficiency and risk? How to shift the CAB’s focus from “approving every change” to “designing a secure release strategy”?

Judging from practical experience in the era of enterprise-level CF platforms, successful transformation requires:

Patience: Don’t expect instant success. Every organization has a different starting point and evolves at a different pace.

Empirical: Use data and cases to prove the value of the new solution. Every successful automatic rollback is a powerful rebuttal to conventional thinking.

Fault Tolerance: Allow trial and error and learn from failures. Perfect release governance does not exist, continuous improvement is the goal.

People-oriented: Technology serves people. The ultimate goal of release governance is to allow the team to create value for users faster and more securely, not to pursue the technology itself.

Looking back at that tense late night in 2015, the industry has witnessed and participated in this change. The evolution of release governance continues, with new tools and models emerging. As an architect, your responsibility is to maintain clear judgment amid changes, choose appropriate technologies for the team, and create lasting value for the business.

The ultimate goal of release governance is not to eliminate all risks—this is impossible in complex distributed systems—but to establish the ability to deliver value quickly and confidently while keeping risks under control. From this perspective, we are still on the road.

About the author

Milome has more than ten years of experience in enterprise-level architecture design. He has served as a senior architect for enterprise-level CF platforms and has led the architecture design and implementation of multiple large-scale microservice platforms. Currently focusing on the research and practice of cloud native technology architecture and governance system.

*This article is the fifth article in the series “From Enterprise-level CF Platform to Cloud Native: More than Ten Years of Evolution of Enterprise-level Microservice Governance”. Other articles in the series include: the inner logic of architecture evolution, observability-driven governance, traffic governance evolution, and redefinition of elastic fault tolerance. *

Series context

You are reading: From enterprise-level CF platform to cloud native: more than ten years of evolution of enterprise-level microservice governance

This is article 5 of 6. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for Microservice governance instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery

1. Dilemmas of the traditional release governance model

1.1 The birth and evolution of the CAB model

1.2 The paradox between risk control and delivery efficiency

1.3 The Curse of the Release Window

2. Blue-green deployment: the starting point for zero-downtime release

2.1 Principle and implementation mechanism

2.2 Blue-green practices of early enterprise-level CF platforms

2.3 Challenges of database schema changes

2.4 Costs and limitations of fast rollback

3. Canary Release: From All or Nothing to Progressive Delivery

3.1 From simple weight to intelligent traffic segmentation

3.2 Indicator-based canary analysis

3.3 Automated canary determination

3.4 Philosophical changes in production environment testing

4. Feature switch: decoupling release and deployment

4.1 LaunchDarkly and the rise of SaaS platforms

4.2 Design of self-developed characteristic switching system

4.3 Continuous deployment decoupled from release

4.4 Technical Debt Management: Switch Cleanup Strategy

5. GitOps: the new paradigm of declarative release

5.1 Governance model of ArgoCD and Flux

5.2 Version Control as the Only Source of Trust

5.3 Automated synchronization and drift detection

5.4 Multi-environment promotion strategy

6. Progressive delivery: from continuous deployment to intelligent release

6.1 Automatic rollback and self-healing

6.2 Release determination based on observability

6.3 Argo Rollouts in practice

6.4 Paradigm shift from continuous deployment to progressive delivery

7. Architect decision-making framework

7.1 Release governance technology selection matrix

7.2 Team maturity and tool selection

7.3 Evolution path from manual approval to automation

Conclusion

You are reading: From enterprise-level CF platform to cloud native: more than ten years of evolution of enterprise-level microservice governance

Current series chapters

Continue along this topic path

From enterprise-level CF platform to cloud native (1): Architect's review - the gains and losses of microservice governance in the era of enterprise-level CF platform

From enterprise-level CF platform to cloud native (2): Observability-driven governance—from monitoring large screens to precise decision-making systems

From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh

Continue with this topic

From enterprise-level CF platform to cloud native (4): Redefining elastic fault tolerance—from Hystrix to adaptive governance

From enterprise-level CF platform to cloud native (6): Summary—an architect’s perspective on enterprise-level microservice governance

Go deeper into this topic

Subscribe to updates

Comments and discussion