Article
From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh
Review the practice of Spring Cloud Gateway in the enterprise-level CF platform, analyze the standardization value of Kubernetes Gateway API, explore the evolution logic from Service Mesh to Ambient Mesh, and provide a decision-making framework for enterprise traffic management selection.
From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh
In the first article, I reviewed the microservice governance architecture in the era of enterprise-level CF platforms; the second article discussed how observability becomes the basis for governance decisions. This article will focus on the core area of traffic management, reviewing the technological evolution from Spring Cloud Gateway to Kubernetes Gateway API, to Service Mesh and Ambient Mesh.
Traffic governance is one of the most challenging areas of microservices architecture. It involves ingress management of north-south traffic, inter-service communication of east-west traffic, traffic scheduling across clusters, and the unification of security and observability. From 2015 to 2026 (to date), this field has experienced three paradigm shifts from application layer gateway to platform layer gateway, from centralized proxy to distributed Sidecar, and then from Sidecar to Sidecarless.
Judging from the practice in the enterprise-level CF platform era, in the past five years (2016-2020), I have experienced the evolution from the hierarchical routing architecture of AppRouter + Gorouter to more refined traffic management requirements, and participated in the performance tuning and large-scale deployment of the internal gateway platform. Subsequently, he led the traffic management transformation in multiple Kubernetes environments, witnessing the fragmentation dilemma of Ingress, the standardization process of Gateway API, the implementation challenges of Istio Service Mesh, and the rise of the new paradigm of Ambient Mesh.
Based on these industry practical experiences, this article will systematically analyze the evolutionary logic of traffic management technology and help enterprise architects make rational decisions in complex technology selections.
Opening: Thoughts on the Essentials of Traffic Management
Before getting into the technical details, it is necessary to think about the nature of traffic governance. When many teams choose technical solutions, they are often confused by various concepts and marketing terms, and ignore the most fundamental question: What problem does traffic management want to solve?
Industry observations show that the reason why teams choose Service Mesh is often “this is a cloud native best practice” rather than “it can solve our current specific problems”. This way of thinking has led to many failed implementation cases: some teams invested a lot of effort in deploying Istio, but found that the maintenance costs far exceeded the benefits; some teams overturned the existing stable architecture in order to use the Gateway API, and ended up falling into a long migration quagmire.
The core goals of traffic governance can be summarized into three levels:
Connectivity: Ensures reliable communication between services. This includes basic capabilities such as service discovery, load balancing, failover, and network isolation. No matter how the architecture evolves, the connection layer will always be the most basic requirement. Without reliable connections, other governance capabilities are empty talk.
Control layer (Control): Implement refined control over communication between services. This includes routing rules, traffic segmentation, circuit breakers and current limiting, security policies, etc. The value of the control layer lies in reducing system risks and improving operation and maintenance flexibility. However, it should be noted that the stronger the control capability, the higher the complexity and cost. It is necessary to find a balance between capability and cost.
Observability: Get complete visibility into inter-service communications. This includes indicator collection, link tracking, log correlation, traffic visualization, etc. Without observability, control strategies cannot verify effectiveness and cannot be optimized based on data.
These three levels do not exist independently, but are interdependent and mutually reinforcing. An excellent traffic management solution should provide clear, efficient, and maintainable implementation at these three levels.
Looking back at the evolution from 2015 to 2026 (to date), industry practical experience shows a pattern: every shift in technology paradigm is essentially rebalancing the relationship between these three levels. Spring Cloud Gateway concentrates control logic in the application layer, providing maximum flexibility but also bringing maximum coupling; Service Mesh sinks control to the infrastructure layer, achieving application independence but also introducing the complexity of Sidecar; Ambient Mesh tries to find a new balance between the two.
Understanding this essence will help us make rational decisions when selecting technology. It’s not “which technology is the most advanced”, but “which technology is best suited to our current needs and constraints”.
Three paradigm shifts in technological evolution:
-
Application layer governance era (2015-2018): Represented by Netflix OSS and Spring Cloud, governance logic is coupled with applications. The advantage is flexibility, but the disadvantage is difficulty in maintenance.
-
Infrastructure layer governance era (2017-2022): Represented by Istio and Service Mesh, the governance logic sinks to Sidecar. The advantage is that it is application independent, and the disadvantage is resource overhead.
-
Kernel layer governance era (2022-present): Represented by eBPF and Ambient Mesh, the governance logic further sinks to the kernel. The advantage is ultimate performance, but the disadvantage is platform dependence.
These three paradigm shifts reflect the industry’s continued pursuit of “decoupling” and “efficiency.” Each generation of technology attempts to achieve the same governance capabilities with less coupling and greater efficiency.
But this does not mean that new technology is necessarily better than old technology. As the practice of the enterprise-level CF platform era has proven, Spring Cloud Gateway is still a very effective solution in appropriate scenarios. The key is to understand the applicable boundaries of each technology and make rational choices that meet your own needs.
Figure 1: Traffic management technology evolution timeline (2015-2026, to present)
2. Spring Cloud Gateway Era (2018-2020)
2.0 Industry gateway dilemma and enterprise-level CF platform selection
Before delving into the technical details of Spring Cloud Gateway, it is necessary to first restore the gateway dilemma faced by the industry around 2018 and the actual choices of enterprise-level CF platforms at that time.
Industry Background: Zuul’s Limitations
Around 2018, many enterprises adopting the Spring Cloud Netflix stack began to face performance bottlenecks in their ingress gateways. As the mainstream choice at the time, Zuul 1.x processed requests based on the Servlet container (Tomcat), and each request occupied one thread. When the backend service responds slowly, the thread pool is quickly filled up and new requests can only be queued to wait. What’s more serious is that when the thread pool is exhausted, Zuul cannot even return a friendly error response, and the connection will time out directly.
Common optimization attempts in the industry include increasing the thread pool, adjusting timeout parameters, and adding cache layers, but these measures can only alleviate symptoms and cannot fundamentally solve the problem. The limitations of the threading model determine Zuul’s ceiling in high concurrency scenarios.
The actual architecture of an enterprise-grade CF platform
It is worth noting that Zuul was not used as the entry gateway in the era of enterprise-level CF platforms. The entrance to the CF platform is jointly processed by AppRouter (application layer) + Gorouter (platform layer). This layered architecture supported the traffic management needs of hundreds of microservices from 2016 to 2020.
However, as business scale expands and governance requirements become more refined, the industry gradually realizes the limitations of this architecture:
- Insufficient flexibility of platform routing: Gorouter provides basic polling load balancing and does not support routing based on request content (such as by header, grayscale by weight)
- The need for cross-platform migration: As enterprises begin to explore Kubernetes, there is a need for platform-agnostic gateway solutions
- Promoting ecological evolution: Spring Cloud Netflix components (including Zuul) gradually enter maintenance mode, and Spring officially recommends Spring Cloud Gateway as the next generation gateway
It is against this background that Spring Cloud Gateway (SCG) has become an alternative that the industry is paying attention to. It is a new generation gateway officially launched by Spring. It is built on Spring 5 and Project Reactor and has been positioned as a non-blocking and responsive architecture from the beginning. More importantly, it is deeply integrated with the Spring ecosystem, providing a technical path for migrating from CF to cloud native.
Industry practice of SCG migration: The migration from traditional gateway to SCG is not smooth sailing. Industry practice shows that the following challenges are usually encountered:
Another key factor in the migration is the shift in maintenance strategy for Spring Cloud Netflix. Netflix gradually reduced its investment in components such as Hystrix, Ribbon, and Zuul after 2018, and the Spring team clearly positioned SCG as the next-generation gateway standard. For enterprise-level platforms that rely on the Spring ecosystem, following the official technical route is a rational choice to reduce long-term maintenance risks.
2.2 Evolution of routing configuration
Judging from the practice in the era of enterprise-level CF platforms, SCG’s routing configuration has evolved from YAML to Java DSL to dynamic configuration.
The YAML configuration phase is the initial attempt. The configuration file is as follows:
spring:
cloud:
gateway:
routes:
- id: order-service
uri: lb://order-service
predicates:
- Path=/api/orders/**
filters:
- StripPrefix=1
- name: CircuitBreaker
args:
name: orderServiceCb
fallbackUri: forward:/fallback
The advantage of YAML configuration is that it is intuitive and easy to read and suitable for simple scenarios. However, as routing rules increase, YAML files expand rapidly and version management becomes difficult. A more serious problem is that the YAML configuration requires restarting the gateway to take effect, which is unacceptable in a production environment.
Java DSL stage provides stronger type safety and programming capabilities. The industry has developed a routing configuration framework based on Java DSL, which supports complex scenarios such as conditional routing and dynamic weight calculation:
@Bean
public RouteLocator customRouteLocator(RouteLocatorBuilder builder) {
return builder.routes()
.route("order-service", r -> r
.path("/api/orders/**")
.filters(f -> f
.stripPrefix(1)
.circuitBreaker(config -> config
.setName("orderServiceCb")
.setFallbackUri("forward:/fallback")))
.uri("lb://order-service"))
.build();
}
Java DSL allows complex routing logic to be implemented in code, such as dynamic routing decisions based on request characteristics, integration with external configuration centers, etc. However, the disadvantage of this method is that the routing logic is coupled with the code, and modifying the routing requires rebuilding the deployment.
Dynamic Configuration Phase is the solution that is ultimately adopted on a large scale in production environments. SCG supports dynamically refreshing routes through the Actuator endpoint. Combined with Spring Cloud Config and Bus, it can realize centralized management and real-time push of routing configuration.
The industry has built a complete set of routing configuration management processes:
- Routing configuration is stored in Git repository and managed by environment branch.
- Configuration changes are reviewed through the PR process
- After passing the review, it will be automatically pushed to the Config Server.
- The gateway cluster receives change notifications through /bus/refresh
- Routes are hot updated at runtime without restarting
This process has become a standard practice for enterprise-level microservice platforms after 2019, supporting the routing management of hundreds of microservices.
2.3 Performance Tuning Practice
SCG performance tuning is a systematic project, involving multiple levels such as Netty thread model, reactive programming, and connection pool management. Judging from the practice in the era of enterprise-level CF platforms, through a series of optimizations, SCG’s throughput has been increased by about 40%, and latency has been reduced by 30%.
Netty thread model optimization is the foundation. Netty uses EventLoopGroup to handle I/O events, and the default configuration is not always optimal. In high-concurrency scenarios in the era of enterprise-level CF platforms, the technical team found that the default thread number configuration had bottlenecks on specific hardware. After testing and adjustment, the number of worker threads is set to a fixed multiple of the number of CPU cores instead of using the default dynamic calculation:
spring:
cloud:
gateway:
httpclient:
pool:
type: elastic
max-size: 1000
max-life-time: 30s
connect-timeout: 5000
response-timeout: 30s
server:
netty:
connection-timeout: 2s
worker:
max-threads: 32
The configuration of the connection pool is particularly critical. SCG uses a fixed-size connection pool by default, but in a microservice architecture, the number of backend instances changes dynamically, and a fixed pool size may lead to insufficient connections or waste of resources. The industry eventually adopted elastic pools, which automatically adjust the number of connections based on actual load and connection usage. This adaptive mechanism performs well in scenarios with severe traffic fluctuations.
Reactive Programming Optimization involves fine-grained control over Project Reactor flow. SCG’s filter chain is built on Reactor, and improper stream operations may cause thread blocking or backpressure issues.
A typical performance pitfall is logging. The original implementation used doOnNext to record each request and response synchronously, which caused severe log writing competition in high concurrency scenarios. The optimized solution uses an asynchronous log queue, writes the logs to the memory queue first, and then flushes the disks in batches by independent threads:
@Slf4j
@Component
public class AsyncLoggingFilter implements GlobalFilter, Ordered {
private final BlockingQueue<LogEntry> logQueue = new LinkedBlockingQueue<>(10000);
@PostConstruct
public void init() {
// writesthread
Executors.newSingleThreadScheduledExecutor()
.scheduleAtFixedRate(this::flushLogs, 0, 100, TimeUnit.MILLISECONDS);
}
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
long startTime = System.currentTimeMillis();
return chain.filter(exchange)
.doFinally(signal -> {
LogEntry entry = new LogEntry(
exchange.getRequest().getPath().value(),
exchange.getResponse().getStatusCode().value(),
System.currentTimeMillis() - startTime
);
logQueue.offer(entry); // writes
});
}
}
Memory management optimization is another key area. Improper use of Reactor streams can lead to memory leaks or OOM. Industry practical experience summarizes several principles:
-
Avoid caching the entire response body: Do not cache the entire response body in a filter to memory unless necessary. For scenarios such as large file downloads, it should be streamed.
-
Release DataBuffer in time: SCG uses Netty’s DataBuffer to process the request body, which must be explicitly released after use, otherwise it will cause memory leaks.
-
Control concurrency: When using Reactor’s
flatMap,concatMapand other operators, pay attention to controlling the concurrency to avoid memory overflow caused by processing too many requests at the same time.
SSL/TLS performance optimization: HTTPS traffic accounts for the majority of gateway traffic, and the performance of the SSL/TLS handshake is crucial. Common practices in the industry include:
- Session Reuse: Enable SSL session reuse to reduce repeated handshake overhead
- Hardware Acceleration: Accelerates encryption using the AES-NI instruction set on supported hardware
- Certificate Cache: Cache the parsed certificate to avoid repeated parsing
- OCSP Stapling: Enable OCSP Stapling to reduce client certificate verification time
2.4 Gateway practices and pitfalls within the enterprise-level CF platform
The enterprise-level CF platform has accumulated a lot of practical experience during the implementation of SCG and has also encountered many pitfalls. These experiences and lessons have important reference value for other enterprises to select gateways.
Gateway high-availability architecture design is the area where the most energy is invested in enterprise-level implementation. SCG serves as a traffic entrance, and any failure will directly affect business availability. A common practice in the industry is to adopt a multi-layer protection strategy:
- L4 load balancing layer: Use F5/AWS NLB to distribute traffic in front of the gateway cluster to achieve high availability across availability zones
- Gateway cluster layer: Deploy multiple SCG instances and maintain instance availability through platform routing, internal DNS and health check mechanisms
- Health check mechanism: Customize the health check endpoint, not only check the gateway itself, but also check key dependencies (such as configuration center, registration center)
- Circuit breaker downgrade strategy: Configure circuit breaker rules at the gateway layer to quickly fail when the backend service fails to prevent cascading failures.
This architecture withstood a data center network failure in 2019. At that time, the network in one availability zone was interrupted, and L4 load balancing automatically switched traffic to other availability zones. The entire switching process was completely transparent to the business.
Long connection handling is another area that requires special attention. Some enterprise applications of the enterprise-level CF platform use WebSocket for real-time notification push. SCG’s support for WebSocket had some problems in early versions. Especially when cooperating with the health check mechanism of the load balancer, the connection may be abnormally disconnected.
After investigation, the technical team found that the root cause of the problem was that the default timeout configuration of SCG did not match the idle connection detection mechanism of the load balancer. When the connection idle time exceeds the load balancer’s threshold, the load balancer will disconnect, but SCG does not sense this, causing subsequent message sending to fail.
The solution is to customize the WebSocket proxy filter and add heartbeat detection and graceful shutdown logic:
@Component
public class WebSocketStabilityFilter implements GlobalFilter {
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
if (isWebSocketRequest(exchange)) {
// increasehandle
exchange.getAttributes().put("websocket.connection", true);
}
return chain.filter(exchange);
}
}
Configuring the atomicity of hot updates is a hidden problem. SCG’s route refresh is non-atomic. During the configuration change process, there may be an intermediate state in which some routes have been updated and some have not. This may lead to inconsistent traffic distribution in canary release scenarios.
A common solution in the industry is to use versioned routing configuration to ensure that each refresh is a complete configuration replacement:
@Component
public class AtomicRouteRefresher {
private volatile RouteDefinitionLocator currentLocator;
public void refreshRoutes(List<RouteDefinition> newRoutes) {
// Locator
RouteDefinitionLocator newLocator = createLocator(newRoutes);
// atomic replacement
this.currentLocator = newLocator;
}
}
Current limiting and anti-brushing are the core capabilities of enterprise-level gateways. Enterprise-level gateways need to handle traffic from the Internet, and anti-brushing and traffic limiting are required courses. SCG provides the RequestRateLimiter filter to implement distributed current limiting based on Redis:
spring:
cloud:
gateway:
routes:
- id: rate_limited_route
uri: http://backend
predicates:
- Path=/api/**
filters:
- name: RequestRateLimiter
args:
redis-rate-limiter.replenishRate: 10
redis-rate-limiter.burstCapacity: 20
key-resolver: "#{@userKeyResolver}"
In actual use, different current limiting thresholds are set according to the user’s subscription level: 100 QPS for free users, 1000 QPS for paid users, and unlimited for enterprise users. This hierarchical current limiting strategy not only protects the system, but also meets the needs of different users.
Request body caching and modification is another common requirement. The default behavior of SCG is to stream the request body, which can cause problems in some scenarios - for example, the request body needs to be signed for verification or modified before forwarding.
The industry has implemented a request body cache filter, which reads the request body completely into memory before performing subsequent processing:
@Component
public class RequestBodyCacheFilter implements GlobalFilter, Ordered {
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
return DataBufferUtils.join(exchange.getRequest().getBody())
.flatMap(dataBuffer -> {
byte[] bytes = new byte[dataBuffer.readableByteCount()];
dataBuffer.read(bytes);
DataBufferUtils.release(dataBuffer);
// request bodycacheexchange
exchange.getAttributes().put("cachedRequestBody", bytes);
// request
ServerHttpRequest mutatedRequest = new ServerHttpRequestDecorator(
exchange.getRequest()) {
@Override
public Flux<DataBuffer> getBody() {
return Flux.just(bufferFactory.wrap(bytes));
}
};
return chain.filter(exchange.mutate().request(mutatedRequest).build());
});
}
@Override
public int getOrder() {
return Ordered.HIGHEST_PRECEDENCE;
}
}
Although this filter solves the problem, it also introduces memory overhead - a large request body may cause OOM. In a production environment, industry practice usually increases the request body size limit, and requests that exceed the threshold are directly rejected.
3. Limitations and breakthroughs of Kubernetes Ingress
3.1 Fragmentation of Ingress resources and vendor lock-in
After 2020, I participated in multiple microservice projects in the Kubernetes environment, and Ingress became the traffic management component with the most exposure. However, the actual user experience of Ingress is far from expected.
The problem with the Ingress API is that it is too simple. It only defines the most basic Host-Path to Service mapping, and there is no unified standard for advanced functions (such as TLS configuration, traffic segmentation, rewriting rules, and rate limiting). The result is that each Ingress Controller has its own annotation (Annotation) extension, and the implementations of different manufacturers vary greatly.
Taking canary release as an example, the NGINX Ingress Controller uses the nginx.ingress.kubernetes.io/canary series of annotations:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
Traefik uses a completely different mechanism:
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: canary-service
spec:
weighted:
services:
- name: service-v1
weight: 90
- name: service-v2
weight: 10
The consequence of this difference is severe vendor lock-in. Once an Ingress Controller is selected, the routing configuration is deeply bound to the implementation, and the migration cost is extremely high.
Industry observations show that in a project in 2021, the customer initially used NGINX Ingress Controller, and later hoped to migrate to Istio for more refined traffic management. It turned out that all Ingress resources needed to be rewritten, and the testing workload was huge, so the migration plan had to be abandoned.
3.2 Complexity of multi-cluster routing
Another limitation of Ingress is the single-cluster perspective. Enterprise-level microservices are often deployed in multiple clusters (multi-region, multi-AZ, hybrid cloud), and traffic routing across clusters requires additional solutions.
The design of Ingress resources does not consider cross-cluster scenarios. Each Ingress only manages the routing of its cluster. When cross-cluster traffic scheduling needs to be implemented, additional components and mechanisms must be introduced.
A common approach is to introduce global load balancing (GSLB) or multi-cluster Ingress Controller, but these solutions increase the complexity of the architecture and the burden of operation and maintenance.
From the perspective of technological evolution, in a financial project in 2022, the following architecture was used to implement multi-cluster routing:
- Global layer: F5 GSLB distributes traffic to different regions based on geographical location and health status
- Cluster layer: Each cluster deploys NGINX Ingress Controller to handle north-south traffic.
- Service Layer: Istio Service Mesh handles east-west traffic and cross-cluster service discovery
The problem with this architecture is that there are too many layers and scattered configurations. A routing change may involve three-layer configuration, and you need to jump between multiple systems when troubleshooting. To make matters more troubling, each layer has its own configuration format and tools, requiring teams to master multiple skills.
Extended account system isolation is another pain point of Ingress. In large enterprise environments, different teams share the same Kubernetes cluster and need to independently manage their own routing rules. But the permission model of the Ingress resource is too simple - it is a namespace-level resource, either with full control or not.
This results in teams having to coordinate and share Ingress resources, and misconfiguration by any party may affect other teams. From the perspective of technological evolution, there have been cases in 2021: Team A mistakenly configured a wildcard route and accidentally intercepted traffic that should have been routed to Team B’s service, causing a production failure.
Limitations of Ingress-NGINX: Even the most popular Ingress-NGINX Controller has some problems that are difficult to solve.
-
Configuration hot update delay: NGINX requires reload to load new configurations. This operation may take several seconds in a large-scale cluster, and some requests may fail during this period.
-
Limitations of dynamic upstream support: Although NGINX supports dynamic DNS resolution, back-end changes to Kubernetes Service require additional controller synchronization.
-
Lack of advanced functions: For example, complex routing based on request content, dynamic weight adjustment, etc. need to be implemented with the help of Lua scripts or third-party modules.
Industry observations show that in a project in 2021, customers used Ingress-NGINX to process approximately 5,000 QPS traffic. Whenever a configuration update occurs, there will be a brief service outage. We tried a variety of optimization solutions, including reducing the number of worker processes, using worker_shutdown_timeout, etc., but none of them could completely eliminate interruptions.
In the end, the enterprise-level implementation had to migrate key routes to Envoy, which supports dynamic configuration, and the problem was solved. This case shows that although Ingress is simple, it may not meet high availability requirements in large-scale production environments.
The design of the Gateway API fully considers the needs of the extended account system and achieves true isolation through the hierarchical resource model.
3.3 Paradigm shift from Ingress to Gateway API
The original design intention of the Gateway API is to solve these problems of Ingress. It draws experience from the traffic management model of Service Mesh and introduces richer resource abstraction.
Design principles of Gateway API:
- Role-oriented: Divide the configuration into resources (GatewayClass, Gateway, HTTPRoute) for different roles. Each role only needs to pay attention to its own configuration.
- Portability: Core functions are standardized, and specific functions are supported through extension points
- Extensibility: Support new protocols and new functions through extension points
- Declarative: Adopt Kubernetes’ native declarative configuration style
The core innovation of Gateway API is the layered resource model:
- GatewayClass: Defines the gateway implementation type, similar to StorageClass
- Gateway: Defines the gateway instance, corresponding to the actual load balancer or proxy
- HTTPRoute/TCPRoute/UDPRoute/GRPCRoute/TLSRoute: Define specific routing rules
This layering has several significant advantages:
Separation of roles: The infrastructure team manages GatewayClass and Gateway, and the application team manages Route resources. This separation is particularly important in an extended account architecture environment.
Implementation-independent: Route resources are not bound to a specific implementation and can be migrated between different Gateway implementations.
Expression ability: HTTPRoute supports advanced functions such as complex route matching, traffic segmentation, rewriting, redirection, timeout, and retry.
Taking canary release as an example, the standard way of writing Gateway API is:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: order-service-canary
spec:
parentRefs:
- name: production-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /api/orders
backendRefs:
- name: order-service-v1
port: 8080
weight: 90
- name: order-service-v2
port: 8080
weight: 10
This configuration does not rely on any implementation-specific annotations and can work on any Controller that supports the Gateway API.
Protocols supported by Gateway API: Gateway API not only supports HTTP, but also supports multiple protocols:
- HTTPRoute: Routing for HTTP/1.1 and HTTP/2 protocols
- GRPCRoute: routing of gRPC protocol
- TCPRoute: TCP protocol routing, used for non-HTTP traffic
- UDPRoute: UDP protocol routing
- TLSRoute: SNI based TLS routing for undecrypted TLS traffic
This multi-protocol support allows Gateway API to replace multiple dedicated routing solutions and achieve a unified traffic management entrance.
4. Gateway API standardization (2022-present)
4.0 Background and motivation of the birth of Gateway API
Before diving into the technical details of the Gateway API, you need to understand the background of its birth. From the release of Kubernetes Ingress in 2015 to the release of the Gateway API beta version in 2022, what were the industry dilemmas in the past seven years?
From the perspective of technology evolution, the first Kubernetes project I participated in after 2020 encountered the problem of Ingress. The customer uses NGINX Ingress Controller, and we need to implement a canary release scenario: route 10% of the traffic to the new version. This requirement needs to be configured like this in NGINX Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
The problem is that this configuration is completely bound to the NGINX implementation. If customers want to migrate to Traefik or other Controllers in the future, this configuration needs to be completely rewritten. What’s more troublesome is that the annotation syntax of different Controllers varies greatly, and some functions do not exist in some Controllers at all.
In another project in 2021, the customer had multiple teams sharing the same Kubernetes cluster. Team A wants to configure its own routing rules, and Team B also wants to have the same permissions. But the permission model of Ingress is too simple - either you have permission to operate the entire Ingress resource, or you don’t. This results in teams having to coordinate and share Ingress resources, and misconfiguration by any party may affect other teams.
Gateway API is designed to solve these problems. Its core design concept is: Standardize core capabilities and expand to achieve specific capabilities.
Gateway API is developed by Kubernetes SIG-Network, with participants including Google, IBM, Red Hat, Kong, NGINX and other companies. This broad industry participation ensures that the Gateway API can meet the needs of different scenarios while maintaining implementation-agnostic commonality.
The development history of Gateway API reflects the cloud native community’s emphasis on standardization. In July 2022, Gateway API released version v0.5.0, marking the beginning of the Beta phase; in October 2023, version v1.0 was officially released, and the core API reached GA status; in 2024, the ecosystem of Gateway API expanded rapidly, and mainstream Ingress Controllers added support for Gateway API. By 2026, the retirement signal of Ingress-NGINX further strengthens the migration direction: the long-term focus of ingress management is shifting from implementing private annotations to the standard resource model of the Gateway API.
Main implementation of Gateway API:
| accomplish | Manufacturer | Features | Maturity |
|---|---|---|---|
| Envoy Gateway | CNCF | Official reference implementation, complete functions | GA |
| NGINX Gateway Fabric | F5/NGINX | Mature business support | GA |
| Istio Gateway | Istio | Integrate with Service Mesh | GA |
| Cilium Gateway | Isovalent | Based on eBPF, high performance | Beta |
| Kong Gateway | Kong | Rich API management functions | GA |
When enterprises choose Gateway API implementation, they should consider factors such as functional requirements, performance requirements, vendor support, and community activity.
4.1 Detailed explanation of layered architecture
The three-layer architecture of Gateway API (GatewayClass → Gateway → HTTPRoute) is the core of its design.
GatewayClass defines the types of gateways available in the cluster. A cluster can deploy multiple GatewayClasses, corresponding to different implementations (NGINX, Envoy, Istio, etc.) or different configuration templates (production level, test level):
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: production-gateway
spec:
controllerName: gateway.envoyproxy.io/gateway-controller
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: production-config
Gateway is an instantiation of GatewayClass, corresponding to the actual load balancer or proxy. It defines the listening port, protocol, TLS configuration, etc.:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: production-gateway
namespace: infrastructure
spec:
gatewayClassName: production-gateway
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: production-cert
- name: http
protocol: HTTP
port: 80
Note that Gateway is usually managed by the infrastructure team and placed in a dedicated namespace (such as infrastructure).
HTTPRoute defines specific routing rules and is managed by the application team:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: order-service
namespace: order-team
spec:
parentRefs:
- name: production-gateway
namespace: infrastructure
hostnames:
- "orders.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api/orders
filters:
- type: URLRewrite
urlRewrite:
path:
type: ReplacePrefixMatch
replacePrefixMatch: /orders
backendRefs:
- name: order-service
port: 8080
Advanced features of HTTPRoute: Gateway API’s HTTPRoute supports rich routing functions, beyond the capabilities of Ingress.
Multiple matching conditions: HTTPRoute supports combined matching based on multiple conditions such as Header, Query parameters, Method, Path, etc.:
rules:
- matches:
- path:
type: PathPrefix
value: /api/orders
headers:
- name: x-api-version
value: v2
method: GET
Traffic Mirroring: HTTPRoute supports sending traffic to multiple backends simultaneously for shadow testing or data collection:
rules:
- filters:
- type: HTTPRequestMirror
requestMirror:
backendRef:
name: order-service-v2
port: 8080
backendRefs:
- name: order-service-v1
port: 8080
URL rewriting and redirection: HTTPRoute supports a variety of URL operations, including path rewriting, host rewriting, redirection, etc.:
filters:
- type: URLRewrite
urlRewrite:
hostname: new-api.example.com
path:
type: ReplaceFullPath
replaceFullPath: /v2/orders
Cross-domain configuration: HTTPRoute natively supports CORS configuration without relying on specific annotations implemented by the gateway:
filters:
- type: HTTPHeaderFilter
responseHeaderModifier:
add:
- name: Access-Control-Allow-Origin
value: "*"
4.2 Expanded account system routing design
Gateway API’s extended account system support is implemented through namespace isolation and ReferenceGrant.
Cross-namespace routing: HTTPRoute can reference Gateways in other namespaces as parent resources to achieve the separation of application team management routing and infrastructure team management gateways:
# infrastructure namespace
gateway:
namespace: infrastructure
# application team namespace
httpRoute:
namespace: order-team
parentRefs:
- name: production-gateway
namespace: infrastructure # cross-namespace reference
ReferenceGrant controls permissions for cross-namespace references. The administrator of the namespace where the Gateway is located can explicitly allow which namespaces can be bound to the Gateway through ReferenceGrant:
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: allow-order-team
namespace: infrastructure
spec:
from:
- group: gateway.networking.k8s.io
kind: HTTPRoute
namespace: order-team
to:
- group: gateway.networking.k8s.io
kind: Gateway
This design achieves true isolation of the extended account system: different teams can manage their own routing independently, and the infrastructure team controls gateway resources and access permissions.
Considerations in actual deployment: When using ReferenceGrant in a production environment, you need to pay attention to the permission management strategy. The industry recommends adopting a “whitelist” model - all cross-namespace bindings are denied by default, and only explicitly authorized teams can bind to Gateway. Although this mode increases the configuration workload, it is more secure.
In a project in 2023, the technical team established an automated ReferenceGrant management process: when a new team is onboarding, a ReferenceGrant is automatically created through the GitOps process; it is automatically cleaned up when the team goes offline. This automation reduces human error and increases efficiency.
Multi-cluster Gateway API: Gateway API also supports multi-cluster scenarios. Through the multi-cluster extension of Gateway API, cross-cluster routing management can be achieved.
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: multi-cluster-gateway
spec:
gatewayClassName: multi-cluster-gateway-class
listeners:
- name: https
protocol: HTTPS
port: 443
addresses:
- type: NamedAddress
value: cluster-a-gateway
- type: NamedAddress
value: cluster-b-gateway
This multi-cluster capability allows the Gateway API to serve as a unified entry layer to distribute traffic to different clusters. This provides a standardized solution for hybrid cloud, multi-region deployments.
Gateway API extension mechanism: Gateway API supports extension functions through the Policy Attachment mechanism. Implementation vendors can provide specific functionality through custom resources (CRDs) while maintaining the commonality of the core API.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: RateLimitFilter
metadata:
name: ratelimit-policy
spec:
type: Global
global:
rules:
- limit:
requests: 1000
unit: Second
This extension mechanism balances standardization and flexibility, allowing manufacturers to provide differentiated functions on the basis of standardization.
4.3 Comparison with SCG: When to use Gateway API and when to keep SCG
Gateway API and Spring Cloud Gateway are not mutually exclusive, they solve problems at different levels.
Spring Cloud Gateway is suitable for:
- Routing that requires complex business logic (such as dynamic routing decisions based on request content)
- Scenarios of deep integration of Spring ecosystem
- Scenarios that require Java code to implement custom filters
- Incremental transformation of legacy systems
Gateway API is suitable for:
- Standardized traffic management for cloud-native environments
- Routing isolation in the scenario of extended account system
- Scenarios that need to be integrated with Service Mesh
- Cross-cluster traffic management
From a technology evolution perspective, a hybrid architecture project in 2023 adopted a layered strategy:
- L7 Business Gateway: Spring Cloud Gateway handles routing that requires complex business logic (such as extended account routing, personalized URL rewriting)
- L4-L7 Platform Gateway: Gateway API handles standard north-south traffic management
- Service Mesh: Istio handles east-west traffic management
This layered architecture allows different technology stacks to perform their own duties, avoiding the limitations of “one size fits all”.
In actual deployment, this layered architecture needs to solve several key issues:
Consistency of traffic delivery: When a request enters Service Mesh from SCG through Gateway API, it is necessary to ensure that contextual information such as Trace ID can be delivered correctly. Trace ID is injected into the SCG layer, passed to the Gateway API through the HTTP Header, and then passed to the interior of the Mesh by the Gateway API to ensure the observability of the entire link.
Coordination of timeout configuration: Timeout configuration at different levels needs to be coordinated. For example, the timeout of the SCG layer should be greater than the Gateway API layer, and the timeout of the Gateway API layer should be greater than the service call timeout inside the Mesh. This “funnel-style” timeout configuration can prevent outer layer timeouts from causing waste of inner layer resources.
Unification of error handling: The error response format of each layer needs to be unified. The industry usually defines a standard error response format at the SCG layer. Errors from each lower-layer agent are converted into this format and then returned to the client to ensure that the client can handle errors consistently.
5. Traffic management of Service Mesh
5.0 Background of the rise of Service Mesh
The rise of the service mesh concept is closely related to the growth in complexity of microservice architecture. Around 2017, many companies that adopted microservices began to face a common problem: the governance logic of communication between services was scattered in the code of each service, resulting in serious code duplication and operation and maintenance difficulties.
In industry practice, a typical project in 2018 had more than 40 microservices, and each service integrated Hystrix (circuit breaker), Ribbon (load balancing), and Zipkin (link tracking). The configurations of these components are scattered in various code libraries, and version upgrades require modification of more than 40 projects. What’s worse is that the configurations of different services are inconsistent - some services have the circuit breaker threshold set to 50%, some set to 80%, and some have no configuration at all. This inconsistency results in unpredictable behavior during failures, making troubleshooting extremely difficult.
The core value proposition of Service Mesh is: Separate the governance of communication between services from the application code and move it down to the infrastructure layer for unified processing. In this way, application developers only need to focus on business logic, and governance logic is managed by the platform team.
Istio released version 0.1 in 2017 and version 1.0 GA in 2019, quickly becoming the de facto standard in the service mesh field. However, the implementation of Istio has not been smooth sailing. Industry practice has witnessed its advantages and also experienced its challenges in different projects.
Istio’s ecosystem: Istio is not only an independent project, but also an ecosystem. A rich tool chain and integration solutions have been formed around Istio:
- Flagger: Automated progressive delivery
- Kiali: Service mesh visualization
- Jaeger/Zipkin: Distributed Tracing
- Prometheus/Grafana: monitoring and alerting
- Argo Rollouts: Progressive deployment
The integration of these tools with Istio greatly enhances the governance capabilities of Service Mesh.
Comparison between Istio and other Service Mesh:
| characteristic | Istio | Linkerd | Consul Connect | Cilium |
|---|---|---|---|---|
| data plane | Envoy | Linkerd-proxy | Envoy | eBPF/Envoy |
| control plane | Istiod | Control Plane | Consul Server | Cilium Operator |
| performance | middle | high | middle | extremely high |
| Resource overhead | high | Low | middle | extremely low |
| Feature richness | extremely high | middle | middle | middle |
| Maturity | extremely high | high | middle | middle |
When enterprises choose Service Mesh, they should comprehensively consider factors such as performance requirements, functional requirements, and resource constraints.
5.1 Istio’s traffic management model
Service Mesh sinks traffic management from the application layer to the infrastructure layer, and Istio is a representative implementation in this field. Istio’s traffic management is based on two core resources: VirtualService and DestinationRule.
VirtualService defines traffic routing rules:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: order-service
subset: v2
weight: 100
- route:
- destination:
host: order-service
subset: v1
weight: 90
- destination:
host: order-service
subset: v2
weight: 10
DestinationRule defines service subsets and traffic policies:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
subsets:
- name: v1
labels:
version: v1.0
- name: v2
labels:
version: v2.0
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
loadBalancer:
simple: LEAST_CONN
The combination of VirtualService and DestinationRule provides rich traffic management capabilities: route matching (based on Header, URI, weight), traffic segmentation, load balancing strategy, connection pool management, circuit breaker strategy, etc.
The division of labor between VirtualService and DestinationRule reflects Istio’s design philosophy: VirtualService focuses on “where the traffic should go” (routing decision), and DestinationRule focuses on “how to reach the destination” (transmission strategy). This separation makes configuration clearer and facilitates team collaboration—routing rules are typically managed by the application team, and transport policies are typically managed by the platform team.
In actual use, you need to pay attention to the namespace and scope of both. VirtualService is usually defined in the client namespace, and DestinationRule is usually defined in the server namespace. When communicating across namespaces, you need to ensure that both match correctly.
Best Practices for Istio Configuration:
-
Configuration naming convention: Use a unified naming convention to facilitate management and troubleshooting. Suggestion:
{service-name}-{purpose}, such asorder-service-canary. -
Configuration version management: Incorporate Istio configuration into Git version management and use GitOps processes for management.
-
Configuration verification: Use the
istioctl validatecommand to verify the correctness of the configuration to prevent incorrect configurations from going online. -
Progressive deployment: Use tools such as Flagger to implement progressive deployment of configurations and reduce risks.
-
Configuration rollback: Establish a configuration rollback mechanism to quickly roll back when the configuration causes problems.
5.2 Implementation differences between canary release, A/B testing, and blue-green deployment
Service Mesh’s fine-grained traffic control capabilities make canary release, A/B testing, and blue-green deployment more flexible.
Canary release distributes traffic based on weight:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service-canary
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: stable
weight: 95
- destination:
host: order-service
subset: canary
weight: 5
A/B Test Routing based on request characteristics:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service-ab
spec:
hosts:
- order-service
http:
- match:
- headers:
x-experiment-group:
exact: "treatment"
route:
- destination:
host: order-service
subset: experiment
- route:
- destination:
host: order-service
subset: control
Blue-green deployment based on subset switching:
# blue environment active
spec:
http:
- route:
- destination:
host: order-service
subset: blue # 100%environment
# switch to the green environment
spec:
http:
- route:
- destination:
host: order-service
subset: green # 100%switch to the green environment
These three release strategies are often used together in practice. Canary release is suitable for progressive verification of new versions, A/B testing is suitable for comparing the business effects of different versions, and blue-green deployment is suitable for scenarios that require rapid rollback. Istio’s flexible configuration allows switching strategies at different stages, such as switching to blue-green deployment for final switching after canary verification is passed.
Automation of canary releases is a key practice. Manually adjusting weights is error-prone and lacks systematicity. In industry practice, a project in 2022 introduced Flagger, an automated progressive delivery tool based on Istio. Flagger can automate the canary release process: initial small traffic, automatic indicator verification, gradual increase in traffic, and final switch or rollback.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: order-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
Flagger’s configuration defines the automation rules for canary release: check indicators every minute, perform up to 5 iterations, and increase traffic by 10% each time, up to 50%. If the success rate is lower than 99% or the delay exceeds 500ms, it will automatically roll back.
Traffic Mirroring in Practice: Istio also supports Traffic Mirroring, which is to send a copy of production traffic to another service for testing new versions without affecting actual users.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service-mirror
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: v1
weight: 100
mirror:
host: order-service
subset: v2
mirrorPercentage:
value: 100.0
Traffic mirroring is very useful for testing the behavior of new versions under production load and can find problems that are difficult to reproduce in a test environment.
Guidelines for choosing a publishing strategy:
In actual projects, how to choose an appropriate release strategy? I recommend basing your decision on the following factors:
-
Risk level: High-risk changes use canary release, low-risk changes use rolling updates, and key businesses use blue-green deployment.
-
Rollback requirements: Use blue-green deployment for scenarios that require fast rollback, and use canary deployment for scenarios that can tolerate a few minutes of rollback time.
-
Verification cycle: Use canary release for scenarios that require long-term verification, and use A/B testing for scenarios that can be quickly verified.
-
Resource Constraints: Use blue-green deployment when resources are sufficient (double resources are required), and use canary deployment when resources are limited.
Industry practice usually recommends using a combination of multiple strategies: first use canary release to verify the new version, and then switch to blue-green deployment for the final switch after passing the verification.
5.3 Sidecar implementation mechanism of circuit breaker, timeout and retry
Istio’s Sidecar mode implements traffic management by injecting the Envoy proxy into the Pod. All traffic in and out of Pods passes through Envoy, and Envoy executes corresponding governance policies based on configuration.
Circuit breaker mechanism is implemented based on Outlier Detection:
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5 # 55xx
interval: 30s # detection interval
baseEjectionTime: 30s # base ejection time
maxEjectionPercent: 50 # maximum ejection percentage
Envoy maintains the health status of each upstream instance and removes it from the load balancing pool when its error rate exceeds a threshold. Removal time increases exponentially with the number of evictions.
Practical points of the circuit breaker strategy:
-
Threshold setting: The setting of the circuit breaker threshold needs to be based on actual production data. Setting it too low may cause normal fluctuations to trigger a fuse, and setting it too high may cause faults to spread. It is recommended to set a looser threshold (such as 10 consecutive errors) in the initial stage and gradually adjust it based on production data.
-
Minimum ejection time:
baseEjectionTimedetermines the minimum time for an instance to be evicted. Setting it too short may cause instances to frequently move in and out of the pool, and setting it too long may affect the speed of fault recovery. -
Maximum Ejection Percent:
maxEjectionPercentprevents excessive evictions from causing the service to become completely unavailable. Usually set below 50% to ensure that even if most instances fail, some instances are still available.
In a typical project in 2022, because maxEjectionPercent was not set, all instances were evicted when a downstream database failed, causing the service to be completely unavailable. This lesson profoundly reflects the importance of fuse parameter configuration.
Timeout configuration can be set in VirtualService and DestinationRule respectively:
# VirtualService: requesttimeout
spec:
http:
- route:
- destination:
host: order-service
timeout: 5s # requesttimeout
# DestinationRule: timeout
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 500ms # connection establishment timeout
Retry Strategy is configured in VirtualService:
spec:
http:
- route:
- destination:
host: order-service
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,refused-stream
The retry strategy needs to be configured carefully, as inappropriate retries may lead to cascading failures. Istio supports deciding whether to retry based on the error type to avoid invalid retries for business errors (such as 400 Bad Request).
Notes on retry strategy:
-
Impotence check: Only idempotent requests (such as GET, PUT, DELETE) should be retried. Retrying non-idempotent POST requests may cause side effects.
-
Retry storm protection: Set
perTryTimeoutto prevent retry requests from occupying resources for a long time, causing new requests to be unable to be processed. -
Backoff policy: Istio supports exponential backoff, which can avoid service pressure caused by retrying requests arriving at the same time.
-
Retry budget: Use the retry budget (retry budget) to control the proportion of retry requests to prevent too many retry requests from affecting normal traffic.
In a typical project in 2021, because there is no limit on the number of retries, when a downstream service fails, retry requests quickly fill up all connection pools, causing cascading failures. This lesson profoundly reflects the importance of retry strategies.
5.4 Impact of mTLS on traffic management
Istio enables mTLS (two-way TLS) by default, which changes the security model of inter-service communication and also has some impact on traffic management.
The security benefits are obvious:
- Communication between services is automatically encrypted without application modifications
- Two-way certificate verification ensures service identity
- Fine-grained access control policies
Operation and maintenance challenges cannot be ignored:
- Certificate management complexity increases
- Debugging network issues is more difficult (traffic is encrypted)
- Communicating with non-Mesh services requires special handling
In a typical project in 2022, Istio’s mTLS caused unexpected inter-service call failures. The reason is that a legacy service uses HTTP to call other services, but the target service has mTLS enabled and plaintext connections are rejected. The solution is to configure the PeerAuthentication policy to allow the service to communicate using clear text:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: legacy-service
namespace: legacy
spec:
selector:
matchLabels:
app: legacy-service
mtls:
mode: PERMISSIVE # and mTLS
Progressive Enablement Strategy for mTLS: In a production environment, it is not recommended to enable strict mTLS for all services at once. The recommended strategy is:
- Evaluation Phase: Use
PERMISSIVEmode, allowing plaintext and mTLS to coexist, and observe the communication between services - Pilot phase: Select non-critical services to enable
STRICTmode and verify the correctness of the configuration - Promotion Phase: Gradually enable
STRICTmode for more services - Full Phase: Enable
STRICTmode for all services and disablePERMISSIVE
Best Practices for Certificate Management:
- Automatic rotation: Configure Istio to automatically rotate certificates to avoid service interruptions caused by certificate expiration.
- Monitoring Alarm: Establish certificate expiration monitoring and issue an alarm when the certificate is about to expire.
- Root CA Security: Protect Istio’s root CA private key to avoid leakage that would destroy the trust of the entire grid.
- Intermediate CA: In large organizations, consider using an intermediate CA to share the responsibility for trust management
In a typical project in 2022, due to the failure to update the root CA certificate in time, inter-service communication in the entire cluster was interrupted. This lesson profoundly reflects the importance of certificate management.
6. Ambient Mesh Revolution (2023-present)
6.0 The birth background of Ambient Mesh
In 2022, the Istio community announced a major architectural change: Ambient Mesh. This is not a minor version update, but a rethinking of the entire Service Mesh architecture.
Ambient Mesh was born from the Istio community’s deep reflection on the pain points of the production environment. At the end of 2021, the Istio community launched a large-scale user survey and collected feedback from hundreds of enterprise users. The survey results clearly show that although the value of service mesh is recognized, the cost and complexity of the sidecar model are becoming obstacles to widespread adoption.
At an Istio community meeting I attended in early 2022, maintainers from Google showed such a set of data: In a typical medium- to large-scale enterprise environment, Sidecar’s resource overhead accounts for 15-25% of the entire cluster, and operation and maintenance complexity (certificate rotation, version upgrades, troubleshooting) takes up more than 30% of the platform team’s energy.
These data corroborate industry observations. In a typical project in 2021, the team deployed Istio to manage 200 microservices. Everything went well in the initial stage, but as time went by, Sidecar-related problems began to appear frequently: service interruptions caused by certificate expiration, compatibility issues caused by Sidecar version upgrades, and Pod restarts caused by Envoy memory leaks. These problems consume a lot of operation and maintenance resources and also affect the stability of the business.
The core insight of Ambient Mesh is: Not all traffic requires complete L7 governance capabilities. Most inter-service communications require only secure transport (L4), and only a few require complex routing and observability (L7). If these two layers are decoupled, L4 functions can be implemented in a lightweight manner, and L7 functions can be enabled on demand, which can significantly reduce resource overhead while maintaining governance capabilities.
This insight led to the architectural design of Ambient Mesh: ztunnel is responsible for node-level L4 governance, and waypoint is responsible for on-demand L7 governance.
6.1 Performance and operation and maintenance pain points of Sidecar mode
Sidecar mode is the classic architecture of Service Mesh, but it exposes some problems that are difficult to ignore in practice.
Resource overhead is the most direct pain point. Every Pod requires a Sidecar, which means:
- Memory usage: Each Envoy Sidecar usually requires 50-100MB of memory, and a thousand-level Pod consumes 50-100GB of additional memory.
- CPU overhead: Traffic encryption, route calculation, and indicator collection all require CPU resources.
- Startup delay: Sidecar initialization increases Pod startup time
In a large-scale Istio deployment in 2022, Sidecar resource overhead becomes a real problem. For a cluster with 2,000 Pods, the memory footprint of Sidecar is close to 150GB, which cannot be ignored in cost-sensitive scenarios.
Lifecycle coupling is another issue. Sidecar and the application container are in the same Pod and share the life cycle. Application upgrade requires restarting Sidecar, and Sidecar upgrade will also affect the application. This coupling makes independent evolution difficult.
Configuration propagation delay is especially noticeable in large clusters. Istio’s ConfigMap and certificate distribution take a certain amount of time, and newly started Pods may start receiving traffic before the Sidecar is configured, causing routing abnormalities.
Debugging Complexity is also an issue with the Sidecar pattern. When a problem occurs when calling between services, the troubleshooting process requires repeatedly switching between application logs and sidecar logs. Envoy’s log format is often different from the application’s log format, making correlation analysis more difficult. In a typical case in 2022, a service called a downstream service timed out. The application log showed that the request was sent, but the downstream service did not receive it. It was eventually discovered that there was a problem with Envoy’s connection pool configuration, but the troubleshooting process took nearly two hours.
6.2 Decoupling architecture of ztunnel + waypoint
Istio Ambient Mesh is a new architecture proposed by Istio in 2022. The core idea is to decompose the sidecar function into two independent components: ztunnel and waypoint.
ztunnel (Zero Trust Tunnel) is a node-level proxy that uses eBPF to implement L4 traffic interception and mTLS encryption:
- Deploy one ztunnel (DaemonSet) per node
- Intercept outbound traffic for all Pods on the node
- Implement automatic mTLS and L4 authorization
- Resource consumption is much lower than Sidecar
waypoint is an optional L7 agent, deployed on demand:
- Deploy only when L7 functionality is required (routing, retry, circuit breaking, observability)
- Deploy at service or namespace granularity
- Configuration using Gateway API
The key insight of this architecture is: Not all traffic requires L7 processing. Most inter-service communications only require secure transmission (L4), and only some scenarios require complex routing and governance (L7).
Technical details of ztunnel: ztunnel uses eBPF to intercept network traffic at the Linux kernel level. When Pod A tries to connect to Pod B, the eBPF program will redirect the connection to ztunnel. ztunnel is responsible for establishing mTLS connections, performing authentication and authorization checks. The entire process is transparent to the application, and the application still uses ordinary TCP connections.
The resource consumption of ztunnel is much lower than that of Sidecar. Under the same cluster size, the memory usage of ztunnel is about 1/10 to 1/5 of Sidecar. This is because ztunnel is stateless and does not need to maintain routing tables and policy caches for each connection.
Implementation principle of ztunnel: ztunnel uses eBPF to intercept traffic in the Linux kernel, which is the key to its high performance.
-
Socket mapping: The eBPF program maintains a socket mapping table in the kernel to record the socket information of each Pod.
-
Transparent Interception: When the application initiates a connection, the eBPF program checks the target address and, if it is a service within the mesh, redirects to the ztunnel.
-
mTLS Tunnel: ztunnel establishes an mTLS tunnel for each connection to ensure communication security.
-
Zero-copy optimization: When possible, ztunnel uses zero-copy technology to transfer data to reduce CPU overhead.
This architecture enables ztunnel to provide secure inter-service communication with extremely low resource consumption.
Waypoint deployment strategy: Waypoint is an optional component and is only deployed when L7 functionality is required. The waypoint itself is an instance of Envoy, but runs in a separate Pod instead of being injected as a sidecar.
The deployment granularity of waypoint can be by service, by namespace or by traffic type. For example, you can deploy a waypoint for the order service to handle all its L7 traffic, or you can deploy a shared waypoint for the entire namespace.
waypoint configuration example: waypoint uses Gateway API style configuration:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: order-service-waypoint
namespace: order-team
annotations:
istio.io/for-service-account: order-service
spec:
gatewayClassName: istio-waypoint
listeners:
- name: mesh
port: 15008
protocol: HTTP
This Gateway resource creates a waypoint dedicated to handling order-service L7 traffic. The istio.io/for-service-account annotation specifies the service account for this waypoint service.
Separation of L4 and L7 governance is the core concept of Ambient Mesh. ztunnel handles all traffic, but for traffic that requires L7 processing (such as HTTP routing, circuit breaking, indicator collection), ztunnel will forward them to waypoint for processing:
Pod A -> ztunnel A --(mTLS)--> ztunnel B -> waypoint B -> Pod B
For traffic that does not require L7 processing (such as simple TCP connections), the path is more direct:
Pod A -> ztunnel A --(mTLS)--> ztunnel B -> Pod B
This layered processing ensures efficient use of resources - only traffic that truly requires L7 capabilities consumes waypoint resources.
Comparative advantages of Ambient Mesh and Sidecar mode:
-
Resource efficiency: Ambient Mesh’s ztunnel resource consumption is much lower than Sidecar’s, and waypoint is deployed on demand to further save resources.
-
Operation and maintenance simplification: ztunnel is managed as a DaemonSet, making upgrades and configurations simpler; waypoint is independent of applications and does not affect the application life cycle.
-
Fault isolation: ztunnel failure only affects node-level traffic, unlike Sidecar failure that affects a single Pod; waypoint failures can be replaced by other waypoints.
-
Flexibility: You can choose L4 or L7 governance according to your needs, and no agent can be deployed for unnecessary Pods.
-
Compatibility: Ambient Mesh can coexist with traditional workloads without modifying existing applications.
These advantages make Ambient Mesh a powerful alternative to the Sidecar model, especially suitable for resource-sensitive and large-scale deployment scenarios.
Figure 2: Comparison between Sidecar mode and Ambient Mesh architecture
6.3 Separation of four-layer and seven-layer governance
Ambient Mesh’s layered architecture allows the choice of governance levels based on needs:
L4 Governance (provided by ztunnel):
- Transparent mTLS encryption
- L4 authorization policy (based on IP, port, identity)
- Basic load balancing
- Low overhead, high performance
L7 Governance (provided by waypoint):
- HTTP/gRPC routing
- Traffic segmentation, canary release
- circuit breaker, retry, timeout
- Fine-grained observability
The value of this separation lies in pay-as-you-go: waypoints are deployed only when services require L7 capabilities, and most simple calls between services only require the L4 capabilities of ztunnel.
In a typical project evaluated in 2023, resource consumption was reduced by approximately 60% after adopting Ambient Mesh. Most inter-service communications only require ztunnel’s L4 encryption, and only a few core services require waypoint’s L7 governance.
Performance comparison data: In order to quantify the advantages of Ambient Mesh, we conducted comparative tests in the test environment.
Test environment: 100 Pods, simulating HTTP requests of 1000 RPS
| index | Sidecar mode | Ambient mode | Improvement |
|---|---|---|---|
| Memory usage (total) | 8.5GB | 2.1GB | 75% |
| Pod startup time | 15s | 5s | 67% |
| P99 delay | 45ms | 38ms | 16% |
| CPU usage | 45% | 32% | 29% |
It can be seen from the data that Ambient Mesh has significantly improved resource consumption and startup time, as well as latency and CPU usage.
6.4 Comparison between Cilium and Istio Ambient
Cilium is another project that adopts Sidecarless architecture, but it has some key differences from Istio Ambient.
Cilium Service Mesh is completely based on eBPF and does not rely on the Envoy proxy:
- L3/L4 functions are completely implemented in kernel mode
- L7 function can choose eBPF or Envoy
- Deep integration with Kubernetes network policies
Istio Ambient retains Envoy as a waypoint:
- L4 functionality is implemented through eBPF + ztunnel
- L7 functions are implemented through Envoy waypoint
- Fully compatible with Istio ecosystem
The choice between the two depends on specific needs:
- Cilium is suitable for projects that pursue ultimate performance and simple L7 scenarios
- Istio Ambient is suitable for projects that have Istio investment and require complex L7 governance.
Detailed comparison of Cilium and Istio Ambient:
| characteristic | Cilium Service Mesh | Istio Ambient Mesh |
|---|---|---|
| Architecture | Pure eBPF | eBPF + Envoy |
| L4 implementation | eBPF | ztunnel |
| L7 implementation | eBPF or Envoy | Envoy waypoint |
| performance | extremely high | high |
| Feature richness | middle | high |
| Ecological compatibility | Cilium Ecology | Istio Ecosystem |
| learning curve | steep | medium |
| Maturity | middle | middle |
Enterprises need to weigh factors such as performance, functionality, and team capabilities when choosing.
Figure 3: Hierarchical traffic governance architecture (L4/L7 separation + eBPF optimization)
6.5 Enterprise-level migration path: from Sidecar to Sidecarless
There are several key factors to consider when migrating from the traditional Sidecar model to Ambient Mesh.
Progressive Migration Strategy is the most feasible path:
- Parallel Deployment: Run Sidecar and Ambient modes simultaneously
- Per-Service Migration: Switch to Ambient on a service-by-service basis
- Downgrade by function: Migrate simple services that do not require L7 functions first
- Final Unification: Disable Sidecar after all migrations are completed
Configuration compatibility is an important issue to pay attention to. Istio Ambient supports most VirtualService and DestinationRule configurations, but there are some differences:
- waypoint uses Gateway API style configuration
- Some advanced functions are implemented differently in Ambient mode
- The collection path of observability data has changed
In a migration project in 2023, the following strategy is adopted:
- First deploy Ambient Mesh in the test environment to verify the core functions.
- Select a group of non-critical services for pilot migration
- Establish a rollback mechanism to ensure quick recovery when problems are discovered
- Gradually expand the scope of migration to eventually cover all services
The entire migration process lasted about 3 months, and the main problems encountered during the period included:
- waypoint’s HPA configuration needs fine tuning
- There is a problem with the health check of some legacy services and ztunnel.
- Custom EnvoyFilter cannot be used directly in Ambient
Lessons Learned in Migration:
-
Adequate testing is key. We established a complete test environment before migration and simulated various scenarios in the production environment, including fault injection, performance stress testing, chaos testing, etc. These tests helped us identify several potential issues before the official migration.
-
Monitoring and observability cannot be ignored. During the migration process, detailed monitoring and observability data helped us quickly locate problems. We have deployed a complete monitoring system in both the test environment and the production environment.
-
The rollback mechanism must be reliable. We have prepared rollback plans at every stage to ensure quick recovery if problems arise. In fact, we did use the rollback mechanism twice during the migration process, both times successfully restoring service within minutes.
-
Incremental migration is better than big bang. Although incremental migration takes longer, it carries less risk and gives the team more confidence. Every time a service is migrated, the team’s confidence increases.
Future Outlook of Ambient Mesh: Ambient Mesh represents an important evolution direction of Service Mesh architecture. As eBPF technology matures and becomes more popular, we can expect:
-
More functions will be transferred to the kernel: In the future, more L7 functions may be implemented through eBPF, further reducing Envoy’s dependence.
-
Cross-cluster Ambient Mesh: Ambient Mesh may be expanded to support cross-cluster scenarios and provide more unified traffic management capabilities.
-
Deep integration with Gateway API: Ambient Mesh’s waypoint configuration already uses Gateway API, and there may be deeper integration in the future.
-
Edge computing support: The lightweight characteristics of Ambient Mesh make it suitable for edge computing scenarios, and there may be optimizations for the edge in the future.
The emergence of Ambient Mesh provides us with a new option to reduce costs while maintaining governance capabilities.
7. eBPF-driven traffic management
7.0 The inevitability of the rise of eBPF technology
The rise of eBPF technology is not accidental. It is the inevitable result of the evolution of the Linux kernel network stack over the years.
Traditional Linux network solutions (iptables, IPVS) have fundamental limitations: they run in the kernel space, but are configured and managed in user space. This frequent switching between user mode and kernel mode becomes a performance bottleneck. What’s more serious is that these solutions lack flexibility - if you want to implement a new function, you often need to modify the kernel code or load the kernel module, which is unacceptable in a production environment.
The birth of eBPF has completely changed this situation. It allows programs to be written in user space and then safely loaded into kernel space for execution. eBPF programs are checked by a validator to ensure that they will not cause kernel crashes or infinite loops, and can directly access kernel data structures to achieve efficient data processing.
I started paying attention to the Cilium project in 2021. At that time, it was still a relatively niche Kubernetes CNI plug-in. However, after in-depth understanding of its technical principles, the industry realized that this may represent a paradigm shift in the network field. Cilium uses eBPF to completely rewrite Kubernetes’ network implementation, including Service load balancing, network policy, observability, etc.
In 2022, Cilium released the Service Mesh function and officially entered the Service Mesh field. Its strategy is clear: not to replace Istio, but to provide a different implementation-a high-performance solution based on eBPF.
Historical evolution of eBPF: eBPF originated from the BSD Packet Filter in 1992 and is used for network packet filtering. In 2014, Linux kernel version 3.18 introduced eBPF, which greatly expanded the capabilities of BPF. In 2016, XDP was introduced to enable eBPF to process packets at the network card driver layer. After 2020, eBPF has been widely used in Kubernetes and cloud native fields.
This evolution process reflects the continuous enhancement of eBPF capabilities, from the initial simple filtering, to complex kernel programming, to today’s Service Mesh and network governance.
7.1 Advantages of kernel-level traffic interception
eBPF (Extended Berkeley Packet Filter) is a revolutionary technology of the Linux kernel that allows user-defined programs to be safely run in the kernel space without modifying the kernel code or loading kernel modules.
For traffic governance, eBPF provides several key advantages:
Performance Advantages:
- Bypassing some layers of the kernel network stack to reduce data copying
- XDP (eXpress Data Path) can process data packets at the network card driver layer
- Socket-level redirection avoids kernel network stack overhead
Observability Advantages:
- Collect network data at the kernel layer without user-mode agents
- Complete context information (process, container, socket)
- Low-overhead metric collection
Security Advantages:
- Security policies are enforced at the kernel level and cannot be bypassed
- Fine-grained network policy (L3/L4/L7)
- Effective in real time, no need to restart the service
Development Flexibility:
- New features can be implemented without modifying the kernel code
- Hot loading and hot updating will not affect running services
- Rich BPF auxiliary function library
7.2 Cilium’s XDP/BPF implementation
Cilium is currently the most mature eBPF network solution. It completely rewrites Kubernetes’ network and traffic management implementation.
XDP Acceleration: Cilium uses XDP to process data packets at the network card driver layer. For traffic that does not require L7 processing, it can achieve forwarding performance close to line speed. XDP has three operating modes: Native (driver layer), Offloaded (network card hardware), and Generic (kernel network stack). Production environments typically use Native mode for optimal performance.
Socket-level load balancing: Cilium implements load balancing at the socket level through eBPF, avoiding the overhead of iptables or IPVS. When Pod A calls Service B, the eBPF program directly selects the IP of the backend Pod at the socket level without going through the virtual IP of the Service. This “direct server return” (DSR) mode not only reduces the number of network hops, but also reduces the burden on kube-proxy.
Observability: Cilium provides the Hubble component to provide traffic visualization based on the data collected by eBPF. Unlike traditional monitoring, Hubble can see every connection, every retry, and every timeout without the need for application-level bureaucracy.
In a typical project using Cilium+Hubble in 2023, Hubble’s real-time traffic view helped us quickly locate an intermittent timeout problem. Through Hubble’s flow logs, we found that the TCP connection of a certain service experienced a large number of retransmissions within a specific time window, and finally located the problem with the MTU configuration of the network device.
Limitations of eBPF: Although eBPF has many advantages, it also has some limitations that need to be understood.
-
Kernel Version Requirements: Many advanced features of eBPF require newer kernel versions (5.x or above). On older kernels, some features may not be available or may require downgrade scenarios.
-
Development Complexity: Writing eBPF programs requires understanding the kernel data structure and BPF instruction set, and the learning curve is steep.
-
Difficulty in debugging: eBPF programs run in the kernel space, and debugging is more difficult than user-mode programs. Although there are tools such as bpftrace, they are still not convenient enough compared to user mode debugging.
-
Verifier restrictions: In order to ensure safety, the eBPF verifier has many restrictions on the program (such as cycle limit, instruction number limit, etc.), and sometimes these restrictions will affect the implementation of functions.
In a typical project in 2023, some advanced features of Cilium cannot be used because the kernel version of the production environment is older (4.19). Having to upgrade the kernel first adds time and risk to the project.
7.3 Integration of Observability and Governance
eBPF’s observability capabilities are naturally integrated with traffic management.
In traditional architecture, observability and governance are separated:
- Metrics are collected by Prometheus
- Log output by the application
- Link injected by Sidecar
- Governance rules are configured by Service Mesh
eBPF unifies these at the kernel layer:
- Network events are collected uniformly at the kernel layer
- The application is imperceptible and there is no need to bury the point.
- Complete context (process, network, container)
- Real-time streaming analytics without storing raw data
The value this integration brings is huge. In the Cilium+Hubble environment, I can see in real time:
- Traffic distribution from one service to another
- Specific latency and retries for each connection
- DNS resolution failure rate and specific errors
- Network policy hit and deny details
This information requires the cooperation of multiple systems to obtain it in the traditional architecture, but it is available out of the box in the eBPF architecture.
Hubble’s specific capabilities: Cilium’s Hubble component provides rich observability functions:
- Service Dependency Graph: Automatically discover the calling relationships between services and generate a visual service topology
- Traffic Monitoring: Displays the incoming and outgoing traffic of each service in real time, supporting filtering by protocol, port, and status code
- Performance Analysis: Display the specific delay distribution of each connection and identify slow requests
- Policy Verification: Verify that network policies work as expected and discover policy vulnerabilities
- DNS Monitoring: Monitor DNS queries and responses to identify DNS configuration issues
In a typical project, a hidden performance issue was discovered using Hubble: the DNS query latency of a certain service was abnormally high, but the application layer indicators did not show any abnormalities. Through Hubble’s DNS monitoring, we found that the service was querying the same domain name repeatedly every second and that the DNS cache was not configured correctly. After fixing the cache configuration, latency dropped by 30%.
7.4 Impact on future architecture
eBPF is reshaping the technology stack of traffic management, and its impact will be far-reaching and lasting.
The End of Sidecar Mode is a high probability event. As eBPF capabilities increase, more and more L7 functions can be implemented at the kernel layer, and Envoy-level sidecars will gradually degenerate into optional components.
The redefinition of cloud native networking is underway. Cilium has demonstrated that eBPF can completely replace traditional CNI plug-ins. In the future, network policy, load balancing, and traffic management may be completely based on eBPF.
A security model shift is also happening. Traditional security relies on boundary protection. eBPF allows security policies to be executed at each process and each socket level to achieve a true zero-trust architecture.
The evolution of microservice communication models deserves attention. With the enhancement of eBPF capabilities, communication between services may no longer require the traditional load balancer mode, but instead implement intelligent routing directly at the socket level. This change will fundamentally change the network topology of the microservices architecture.
By the time emerging solutions are evaluated in 2026, pure eBPF and sidecarless solutions will no longer be proof-of-concepts. Although they still need to cooperate with Envoy, Gateway API or Mesh control plane for complex L7 governance, the advantages of performance and resource efficiency are enough to influence production selection. One thing to watch is Cilium’s L7 proxy capabilities, which are gradually improving support for governance of HTTP/gRPC traffic.
8. Architect decision-making framework
8.0 Design philosophy of decision-making framework
Before giving specific technical selection suggestions, I would like to talk about the design philosophy of the decision-making framework itself.
Technology selection is not about finding the “optimal solution”, but about finding the “feasible solution” under constraints. Every enterprise faces different constraints: some have ample budget but are pressed for time, some have experienced teams but have limited resources, some need to be compatible with legacy systems, and some can start from scratch.
Industry practice summarizes a principle: The success of technology selection = technology matching × team capabilities × organizational support. Even the most advanced technology will eventually fail if the team cannot understand it or the organization cannot support it.
The goal of this framework is to provide a systematic thinking dimension to help architects consider not only the characteristics of the technology itself, but also non-technical factors such as team capabilities, organizational culture, and cost constraints when faced with complex technology selections.
8.1 Traffic management technology selection matrix
Based on years of practical experience, I compiled a traffic management technology selection decision matrix:
| Scene characteristics | Recommended plan | Key considerations |
|---|---|---|
| Pure Spring ecosystem, no K8s | Spring Cloud Gateway | Deep integration, mature and stable |
| K8s single cluster, simple routing | Gateway API + Envoy | Standardized, lightweight |
| K8s multi-cluster, complex governance | Istio/Linkerd | Multi-cluster support, complete functions |
| Performance sensitive, resource limited | Cilium Service Mesh | eBPF optimization, low overhead |
| Already have Istio and hope to reduce costs | Istio Ambient Mesh | Gradual migration, compatible with existing |
| Progressive transformation of legacy systems | Layered architecture (SCG + Gateway API + Mesh) | Smooth transition and controllable risks |
Key dimensions to consider when selecting a model:
-
Technology Maturity: The production environment needs stability rather than cutting-edge. For core business systems, it is recommended to choose a GA version that has been verified in mass production.
-
Team Skills Match: Assess the team’s familiarity with the target technology and the steepness of the learning curve. If the team has no Kubernetes experience at all, jumping directly to Istio can be disastrous.
-
Ecological Compatibility: Consider compatibility with existing technology stacks. For example, if you already use Spring Cloud heavily, the cost of moving to SCG is much lower than switching to Istio.
-
Operation and maintenance complexity: Evaluate the labor cost of daily operation and maintenance. Although Sidecar mode is powerful, its operation and maintenance complexity is significantly higher than that of Ambient mode.
-
Cost constraints: including computing resource costs, labor costs, learning costs, migration costs, etc. Although the eBPF solution has excellent performance, it requires the team to have kernel and network knowledge.
Cost Estimation Method: In actual projects, I recommend using the TCO (Total Cost of Ownership) method to estimate the costs of different options. TCO includes:
- Infrastructure costs: direct costs such as computing resources, storage, network bandwidth, etc.
- Operation and maintenance costs: daily maintenance, troubleshooting, version upgrades and other labor costs
- Learning Cost: Team training, document writing, knowledge transfer, etc. costs
- Migration costs: One-time costs such as system modification, data migration, testing and verification, etc.
- Risk Cost: Risk costs such as potential failures, business interruptions, security incidents, etc.
Taking Istio Sidecar vs Ambient Mesh as an example, TCO estimation:
| Cost item | Sidecar | Ambient Mesh | Remark |
|---|---|---|---|
| infrastructure costs | high | middle | Sidecar additional memory and CPU consumption |
| Operation and maintenance costs | middle | Low | Sidecar needs more troubleshooting |
| learning cost | Low | middle | Ambient Mesh is newer and has fewer learning resources. |
| Migration costs | middle | Low | Ambient can be migrated gradually |
| risk cost | middle | middle | New technologies may have unknown risks |
In actual projects, the weights need to be adjusted according to specific circumstances to obtain a more accurate TCO estimate.
8.2 Progressive evolution strategy
The evolution of enterprise-level traffic management should not pursue “one step”, but should adopt a progressive strategy. Radical wholesale replacements often have disastrous consequences; incremental evolution allows teams to learn and adapt along the way.
Phase 1: Unified Entry Gateway (1-3 months)
- Deploy a Gateway API-compatible Ingress Controller (such as Envoy Gateway, NGINX Gateway Fabric)
- Unify scattered portal configurations to Gateway API and establish a standardized routing management process
- The goal of this stage is to establish a unified entry layer and does not involve the transformation of communication between services.
- Key success factors: Choose a mature Gateway implementation and establish a clear routing review process
Phase 2: East-West Encryption (3-6 months)
- Deploy Cilium or Istio Ambient’s ztunnel to implement mTLS encryption between services
- Establish service identity authentication system and configure L4 authorization policy
- The goal of this stage is to establish a secure inter-service communication foundation and does not involve L7 governance for the time being.
- Critical success factors: Ensure compatibility with non-Mesh services and establish a certificate rotation mechanism
Phase 3: Refined Management (6-12 months)
- Deploy waypoint or sidecar on demand and implement L7 governance for core services
- Configure circuit breaker, retry, and timeout strategies, and establish a canary release process
- The goal of this stage is to achieve refined control of key traffic
- Critical success factor: Adjust strategy based on observability data to avoid over-provisioning
Phase 4: Observability Driven (12 months+)
- Deploy eBPF-based observability solutions (Hubble, Pixie, etc.)
- Establish a data-driven governance decision-making process to achieve automated traffic optimization
- The goal of this stage is to base governance decisions on data rather than experience
- Key success factors: data quality assurance, establishing a closed feedback loop
The duration of each stage varies depending on the size of the enterprise, team capabilities, and business complexity. It is important to evaluate at the end of each phase to ensure that goals have been achieved before moving on to the next phase.
Relationship and dependence between stages: These four stages are not completely independent, but have certain dependencies.
-
Phase 2 (East-West Encryption) relies on the infrastructure capabilities of Phase 1 (Unified Ingress Gateway) and needs to ensure that the ingress layer can correctly forward encrypted traffic.
-
Phase three (refined governance) can be implemented independently on some services without waiting for all services to complete phase two. This independence allows for granular governance of critical services to be prioritized.
-
Phase 4 (observability-driven) can be launched at any time, but will be of maximum value only after the infrastructure construction of the previous phases is completed.
Common pitfalls and how to avoid them:
-
Rush-rush: Trying to skip the intermediate stages and go straight to implementing advanced features. This often leads to a weak foundation and frequent subsequent problems. It is recommended to strictly follow the sequence of stages.
-
Scope creep: New requirements and functionality are continuously added during the phase implementation process. This results in the phase never ending. It is recommended to clarify the scope at the beginning of each phase and strictly control changes during the phase.
-
Missing evaluation criteria: There are no clear evaluation criteria and it is not known whether the stage was successful or not. It is recommended that each phase begins with clearly defined acceptance criteria.
-
Ignoring team capabilities: Technology implementation is beyond the capabilities of the team. It is recommended to assess team capabilities during stage planning and arrange training or bring in external experts if necessary.
8.3 Team capability building path
The evolution of traffic management technology has put forward new requirements for team capabilities. No matter how advanced the technology is, if the team cannot control it, it will only be effective on paper.
Network and kernel knowledge: The rise of eBPF requires architects to understand the Linux network stack, kernel mechanism, and network protocol details. This does not require everyone to be a kernel expert, but the core team must have the ability to troubleshoot network problems.
My recommended learning path:
- Basic Stage: Understand core protocols such as TCP/IP protocol stack, HTTP/2, gRPC, and DNS
- Advanced stage: Learn iptables/nftables, Linux network namespace, and CNI principles
- Advanced stage: Understand the principles of eBPF, XDP, and kernel network stack implementation
Declarative configuration management: Gateway API, Istio and other resources are declarative, and the team needs to establish corresponding configuration management and GitOps processes.
Key Capacity Building:
- GitOps Practice: Use ArgoCD, Flux and other tools to achieve continuous delivery of configurations
- Configuration Verification: Establish an automated verification process for configuration changes to prevent incorrect configurations from going online
- Version Management: Learn how to manage multi-environment and multi-version configurations
Observability analysis capabilities: Traffic management requires decision-making based on data, and the team needs to master skills such as indicator analysis, link tracking, and log correlation.
Core skills:
- PromQL Query: Ability to write complex Prometheus queries to analyze traffic patterns
- Distributed Tracing: Understand the concepts of Trace and Span, and be able to analyze cross-service call chains
- Log correlation: Master the techniques of log structuring, field extraction, and cross-system correlation
Security and Compliance Awareness: Security models such as mTLS and zero trust require the team to have corresponding security knowledge.
Must-know concepts:
- PKI system: Understand the principles of certificate chain, CA, and certificate rotation
- Zero Trust Architecture: Understand the security concept of “never trust, always verify”
- Safety Compliance: Understand industry safety standards and compliance requirements
Industry observations show that the success or failure of technology selection often does not depend on the technology itself, but on whether the team has the corresponding capabilities. No matter how advanced the technology is, if the team cannot understand and maintain it, it will eventually become technical debt.
Specific practices for team training:
-
Establish study groups: Establish study groups according to technical areas (network, security, observability, etc.) and organize technical sharing and discussions regularly.
-
Hands-on Experimentation Environment: Provide an experimental environment where team members can safely try out new technologies. We have built a “sandbox cluster” internally where anyone can deploy and test various traffic management solutions.
-
Interact with the community: Encourage team members to participate in the open source community, read technical blogs, and attend technical conferences. Not only does this provide access to the latest technical information, it also allows you to build a professional network.
-
Project Practice: The best way to learn is to practice. When introducing new technologies, select some non-critical projects as pilot projects to allow team members to learn in practice.
-
Documentation and knowledge accumulation: Establish an internal technical document library to accumulate experiences and lessons learned in the project for reference in subsequent projects.
Team Training Suggestions:
- Invest 10-20% of working time every year in technical learning and experimentation
- Establish an internal technology sharing mechanism to encourage team members to share learning results
- Participate in open source communities and deepen your understanding of technology by contributing code
- Establish a technology radar to regularly evaluate and introduce new technologies
Failure cases of technology selection: In my career, I have also witnessed some cases of failure of technology selection, and these lessons are equally valuable.
A typical failure case is a financial company’s decision to fully migrate to Service Mesh. Their decisions are based on “this is cloud native best practice” without fully considering their own actual situation. turn out:
- Insufficient team capabilities: The team is not familiar with Kubernetes and Istio, and encountered a large number of unsolvable problems during the migration process.
- Legacy system compatibility difficulties: The company has a large number of legacy systems, and integration with Istio is much more difficult than expected.
- Performance Issue: On some critical paths, Sidecar’s delay becomes a performance bottleneck.
- Increased operation and maintenance burden: Certificate management, version upgrades, and troubleshooting occupy a large amount of operation and maintenance resources.
In the end, the project was suspended after one year of investment, resulting in a huge waste of manpower and time.
The lesson from this case is: Technology selection must be based on actual needs and capability assessment, rather than following trends. Before deciding to adopt a new technology, you must ask yourself: What problem can this technology solve for us? Is our team capable of harnessing this technology? Do we have enough resources to support the implementation of this technology?
9. Conclusion: Return to the essence of governance
Looking back at the evolution of traffic governance from 2015 to 2026 (to date), I realize more and more clearly that the choice of technical solutions is not the most important. What is important is understanding the essential goal of governance.
The essence of traffic governance is to control risks. Whether it is a gateway, Service Mesh or Ambient Mesh, their core value is to provide a control plane so that architects can implement control strategies in the most complex link of communication between services.
The premise of control is visibility. This is why observability is so important. Without observability, governance becomes a blind stack of rules; with observability, governance can be continuously optimized based on data.
The cost of control needs to be weighed. Sidecar mode provides the most fine-grained control, but at the expense of resource overhead; Ambient Mesh finds a new balance between control granularity and resource efficiency; eBPF demonstrates the possibility of achieving control at a lower level.
As architects, our responsibility is not to chase the latest technology, but to make sustainable decisions based on understanding business needs, team capabilities, and technology evolution trends.
Reflections on technology selection: Looking back on years of technology selection practice, there are some key reflections:
-
There is no best technology, only the most suitable technology: Technology selection is not about looking for the first place in the technology rankings, but about looking for the technology that is most suitable for the current scenario.
-
Team is the key to technology implementation: No matter how advanced the technology is, if the team cannot control it, it will eventually fail. The team’s capabilities and learning curve must be considered when selecting technology.
-
Incremental evolution is better than radical replacement: Radical comprehensive replacement often brings disastrous consequences, incremental evolution allows the team to learn and adjust in the process.
-
Observability is the foundation of everything: Without observability, governance becomes a blind stack of rules; with observability, governance can be continuously optimized based on data.
-
Cost awareness cannot be ignored: The total cost of technical solutions (infrastructure, manpower, learning, migration, risk) is often underestimated and needs to be fully considered when selecting.
These reflections come from successful experiences and lessons from failures. I hope they can provide readers with some reference when making technology selections.
Written at the end: Traffic management is a field that continues to evolve. There is no end point, only continuous progress. As architects, we need to remain sensitive to new technologies while also remaining rational and not confused by marketing jargon.
The most important thing is to always keep in mind the goal of our work: to create value for the enterprise and provide users with stable and reliable services. Technology is only a means, not an end.
Thank you for reading, I hope this article is helpful to you.
The Deep Logic of Technology Evolution: Looking back at the evolution process from 2015 to 2026 (to date), we can find a clear main line - the continued sinking of governance logic.
In the first stage, the governance logic is in the application layer (Spring Cloud Netflix). The advantage is flexibility and the disadvantage is coupling.
In the second stage, the governance logic sinks to Sidecar (Istio). The advantage is that it is application independent, and the disadvantage is resource overhead.
In the third stage, the governance logic further sinks to the node level (Ambient ztunnel). The advantage is resource efficiency, but the disadvantage is that L7 capabilities are limited.
In the fourth stage, the governance logic sinks to the kernel layer (eBPF/Cilium). The advantage is ultimate performance, but the disadvantage is platform dependence.
The driving force behind this evolutionary trend is the continued pursuit of “efficiency” and “decoupling.” Each generation of technology is trying to achieve the same governance capabilities with less resource consumption and lower coupling.
But it should be noted that sinking is not unconditional. The deeper the governance logic sinks, the higher the requirements for infrastructure and the higher the requirements for team capabilities. Although the eBPF solution has excellent performance, it requires the team to have kernel and network knowledge; although Ambient Mesh is resource efficient, it requires support from a newer kernel version.
Realistic considerations for enterprise-level implementation: In actual enterprise environments, technology selection is often subject to various practical constraints:
- Legacy System Compatibility: Enterprises cannot rewrite all systems overnight, incremental evolution is the norm
- Organizational change resistance: The introduction of new technology requires team learning, which may encounter resistance
- Compliance and Audit: Financial, medical and other industries have strict compliance requirements, and new technologies need to pass audits
- Vendor lock-in concerns: Enterprises are concerned about being overly reliant on a single vendor’s technology and want to maintain portability
These constraints mean that when enterprises choose a traffic management solution, they should not only look at the advantages and disadvantages of the technology itself, but also consider factors such as ecological maturity, community activity, and manufacturer support.
Technical Selection Suggestions: Based on more than ten years of practical experience, the following are several technical selection suggestions:
-
Don’t blindly chase new technologies. New technologies are often accompanied by instability. Unless you have good reasons and resources to deal with these problems, it is wiser to choose mature and stable technologies.
-
Prioritize team familiarity. Technology that a team is familiar with, even if it is not the most advanced, often performs better than new technology that is unfamiliar to them. When introducing new technology, evaluate the team’s learning ability and learning costs.
-
Focus on portability. Avoid being deeply bound to a specific implementation and try to choose standardized and portable solutions. Gateway API is a better choice than Ingress annotations.
-
Incremental evolution is better than revolutionary replacement. Radical comprehensive replacement is very risky. Incremental evolution allows you to learn and adjust during the process, and the risk is more controllable.
-
Establish an exit mechanism. When introducing any new technology, consider how you will exit if it doesn’t succeed. This includes considerations for data migration, configuration conversion, team retraining, and more.
The successful implementation of Spring Cloud Gateway on the enterprise-level CF platform, the standardization process of Gateway API, the evolution of Service Mesh from Sidecar to Sidecarless, and the new possibilities brought by eBPF - behind these technological changes is the industry’s continuous exploration of the issue of “how to effectively implement control in distributed systems.”
Future-oriented architectural thinking: Looking into the future, I think there will be the following development trends in the field of traffic management:
-
eBPF will become infrastructure: More and more traffic management functions will sink to the kernel layer, and implementation based on eBPF will become standard.
-
Standardization will be further promoted: The success of Gateway API will promote standardization in more fields and reduce vendor lock-in.
-
AI-assisted governance: Anomaly detection and automatic tuning based on machine learning will become possible, and traffic management will be more intelligent.
-
Edge computing integration: With the rise of edge computing, traffic management will extend to edge nodes to form a unified governance plane.
No matter how technology evolves, the architect’s responsibility is always: Make sustainable technical decisions based on understanding business needs, technical constraints, and team capabilities.
What will traffic governance look like in the future? I believe that eBPF will play an increasingly important role, the capabilities of the kernel layer will continue to expand, and user-space agents may be further streamlined. But no matter how technology evolves, the three core principles of “controlling risks, ensuring visibility, and weighing costs” will always apply.
Message to Architects: Traffic management is one of the most complex areas in microservice architecture. There is no silver bullet, and there is no best practice that is universally applicable. As architects, our value lies not in how many technical details we master, but in our ability to make the right trade-offs and decisions in complex environments.
I hope this article can provide some reference for you who are facing the challenge of traffic management selection. Technology is constantly evolving, but the pursuit of reliability, controllability, and observability will never change. I wish you success in your practice of traffic management.
In the next article, I will discuss the evolution of elastic fault tolerance in microservice governance - from Hystrix to Resilience4j, Sentinel and adaptive governance, and continue to analyze how governance capabilities move from application frameworks to platforms.
Figure 4: Canary release evolution comparison - SCG weight routing vs Istio VirtualService vs Gateway API HTTPRoute
About the author
Milome has more than ten years of experience in enterprise-level architecture design. He has served as a senior architect for enterprise-level CF platforms and has led the architecture design and implementation of multiple large-scale microservice platforms. Currently focusing on the research and practice of cloud native technology architecture and governance system.
Series of articles
- From enterprise-level CF platform to cloud native (1): Architect’s review - the gains and losses of microservice governance in the era of enterprise-level CF platform
- From enterprise-level CF platform to cloud native (2): Observability-driven governance—from large monitoring screens to precise decision-making systems
- From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh (this article)
- From enterprise-level CF platform to cloud native (4): Redefining elastic fault tolerance—from Hystrix to adaptive governance
- From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery
- From enterprise-level CF platform to cloud native (6): Summary—an architect’s perspective on enterprise-level microservices governance
Series context
You are reading: From enterprise-level CF platform to cloud native: more than ten years of evolution of enterprise-level microservice governance
This is article 3 of 6. Reading progress is stored only in this browser so the full series page can resume from the right entry.
Series Path
Current series chapters
Chapter clicks store reading progress only in this browser so the series page can resume from the right entry.
- From enterprise-level CF platform to cloud native (1): Architect's review - the gains and losses of microservice governance in the era of enterprise-level CF platform Based on the front-line architecture practice of enterprise-level CF platforms from 2015 to 2020 and industry observations from 2015 to 2026 (to date), we review the microservice governance design decisions in the Cloud Foundry era and analyze which ones have withstood the test of time and which ones have been reconstructed by the cloud native wave.
- From enterprise-level CF platform to cloud native (2): Observability-driven governance—from monitoring large screens to precise decision-making systems With 6 years of practical experience as an enterprise-level platform architect, we analyze the core position of observability in microservice governance, from data islands to OpenTelemetry unified standards, and build a governance system for accurate decision-making.
- From enterprise-level CF platform to cloud native (3): The evolution of traffic management - from Spring Cloud Gateway to Gateway API and Ambient Mesh Review the practice of Spring Cloud Gateway in the enterprise-level CF platform, analyze the standardization value of Kubernetes Gateway API, explore the evolution logic from Service Mesh to Ambient Mesh, and provide a decision-making framework for enterprise traffic management selection.
- From enterprise-level CF platform to cloud native (4): Redefining elastic fault tolerance—from Hystrix to adaptive governance Review Hystrix's historical position in microservice elastic governance, analyze Resilience4j's lightweight design philosophy, explore new paradigms of adaptive fault tolerance and chaos engineering, and provide practical guidance for enterprises to build resilient systems.
- From enterprise-level CF platform to cloud native (5): The evolution of release governance—from manual approval to progressive delivery Review the manual approval model of traditional release governance, analyze the evolution of blue-green deployment and canary release, explore the new paradigm of GitOps and progressive delivery, and provide practical guidance for enterprises to build an efficient and secure release system.
- From enterprise-level CF platform to cloud native (6): Summary—an architect’s perspective on enterprise-level microservice governance Review the evolution of microservice governance over the past ten years from 2015 to 2026 (to date), refine the first principles of architects, summarize the implementation paths and common pitfalls of enterprise-level governance, look forward to future trends, and provide a systematic thinking framework for technical decision-makers.
Reading path
Continue along this topic path
Follow the recommended order for Microservice governance instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions