Better application networking and security with CAKES

By Christian Posta

Modern software applications are underpinned by a large and growing web of APIs, microservices, and cloud services that must be highly available, fault tolerant, and secure. The underlying networking technology must support all of these requirements, of course, but also explosive growth.

Unfortunately, the previous generation of technologies are too expensive, brittle, and poorly integrated to adequately solve this challenge. Combined with non-optimal organizational practices, regulatory compliance requirements, and the need to deliver software faster, a new generation of technology is needed to address these API, networking, and security challenges.

CAKES is an open-source application networking stack built to integrate and better solve these challenges. This stack is intended to be coupled with modern practices like GitOps, declarative configuration, and platform engineering. CAKES is built on the following open-source technologies:

In this article, we explore why we need CAKES and how these technologies fit together in a modern cloud environment, with a focus on speeding up delivery, reducing costs, and improving compliance.

Why CAKES?

Existing technology and organization structures are impediments to solving the problems that arise with the explosion in APIs, the need for iteration, and an increased speed of delivery. Best-of-breed technologies that integrate well with each other, that are based on modern cloud principles, and that have been proven at scale are better equipped to handle the challenges we see.

Conway’s law strikes again

A major challenge in enterprises today is keeping up with the networking needs of modern architectures while also keeping existing technology investments running smoothly. Large organizations have multiple IT teams responsible for these needs, but at times, the information sharing and communication between these teams is less than ideal. Those responsible for connectivity, security, and compliance typically live across networking operations, information security, platform/cloud infrastructure, and/or API management. These teams often make decisions in silos, which causes duplication and integration friction with other parts of the organization. Oftentimes, “integration” between these teams is through ticketing systems.

For example, a networking operations team generally oversees technology for connectivity, DNS, subnets, micro-segmentation, load balancing, firewall appliances, monitoring/alerting, and more. An information security team is usually involved in policy for compliance and audit, managing web app firewalls (WAF), penetration testing, container scanning, deep packet inspection, and so on. An API management team takes care of onboarding, securing, cataloging, and publishing APIs.

If each of these teams independently picks the technology for their silo, then integration and automation will be slow, brittle, and expensive. Changes to policy, routing, and security will reveal cracks in compliance. Teams may become confused about which technology to use, as inevitably there will be overlap. Lead times for changes in support of app developer productivity will get longer and longer. In short, Conway’s law, which states that an organizational system often end ups like the communication structure of that organization, rears its ugly head.

Sub-optimal organizational practices

Conway’s law isn’t the only issue here. Organizational practices in this area can be sub-optimal. Implementations on a use-case-by-use-case basis result in many isolated “network islands” within an organization because that’s how things “have always been done.”

For example, a new line of business spins up, which will provide services to other parts of the business and consume services from other parts. The modus operandi is to create a new VPC (virtual private cloud), install new F5 load balancers, new Palo Alto firewalls, create a new team to configure and manage it, etc. Doing this use case by use case causes a proliferation of these network islands, which are difficult to integrate and manage.

As time goes on, each team solves challenges in their environments independently. Little by little, these network islands start to move away from each other. For example, we at Solo.io have worked with large financial institutions where it’s common to find dozens if not hundreds of these drifting network islands. Organizational security and compliance requirements become very difficult to keep consistent and auditable in an environment like that.

Outdated networking assumptions and controls

Lastly, the assumptions we’ve made about perimeter network security and the controls we use to enforce security policy and network policy are no longer valid. We’ve traditionally assigned a lot of trust to the network perimeter and “where” services are deployed within network islands or network segments. The “perimeter” deteriorates as we punch more holes in the firewall, use more cloud services, and deploy more APIs and microservices on premises and in public clouds (or in multiple public clouds as demanded by regulations). Once a malicious actor makes it past the perimeter, they have lateral access to other systems and can get access to sensitive data. Security and compliance policies are typically based on IP addresses and network segments, which are ephemeral and can be reassigned. With rapid changes in the infrastructure, “policy bit rot” happens quickly and unpredictably.

Policy bit rot happens when we intend to enforce a policy, but because of a change in complex infrastructure and IP-based networking rules, the policy becomes skewed or invalid. Let’s take a simple example of service A running on VM 1 with IP address 10.0.1.1 and service B running on VM 2 with IP address 10.0.1.2. We can write a policy that says “service A should be able to talk to service B” and implement that as firewall rules allowing 10.0.1.1 to talk to 10.0.1.2.

Two simple things could happen here to rot our policy. First, a new Service C could be deployed to VM 2. The result, which may not be intended, is that now service A can call service C. Second, VM 2 could become unhealthy and recycled with a new IP address. The old IP address could be re-assigned to a VM 3 with Service D. Now service A can call service D but potentially not service B.

The previous example is for a very simple use case, but if you extend this to hundreds of VMs with hundreds if not thousands of complex firewall rules, you can see how changes to environments like this can get skewed. When policy bit rot happens, it’s very difficult to understand what the current policy is unless something breaks. But just because traffic isn’t breaking right now doesn’t mean that the policy posture hasn’t become vulnerable.

Conway’s law, complex infrastructure, and outdated networking assumptions make for a costly quagmire that slows the speed of delivery. Making changes in these environments leads to unpredictable security and policy impacts, makes auditing difficult, and undermines modern cloud practices and automation. For these reasons, we need a modern, holistic approach to application networking.

A better approach to application networking

Technology alone won’t solve some of the organizational challenges discussed above. More recently, the practices that have formed around platform engineering appear to give us a path forward. Organizations that invest in platform engineering teams to automate and abstract away the complexity around networking, security, and compliance enable their application teams to go faster.

Platform engineering teams take on the heavy lifting around integration and honing in on the right user experience for the organization’s developers. By centralizing common practices, taking a holistic view of an organization’s networking, and using workflows based on GitOps to drive delivery, a platform engineering team can get the benefits of best practices, reuse, and economy of scale. This improves agility, reduces costs, and allows app teams to focus on delivering new value to the business.

For a platform engineering team to be successful, we need to give them tools that are better equipped to live in this modern, cloud-native world. When thinking about networking, security, and compliance, we should be thinking in terms of roles, responsibilities, and policy that can be mapped directly to the organization.

We should avoid relying on “where” things are deployed, what IP addresses are being used, and what micro-segmentation or firewall rules exist. We should be able to quickly look at our “intended” posture and easily compare it to existing deployment or policy. This will make auditing simpler and compliance easier to ensure. How do we achieve it? We need three simple but powerful foundational concepts in our tools:

Declarative configuration

Intent and current state are often muddied by complexities of an organization’s infrastructure. Trying to wade through thousands of lines of firewall rules based on IP addresses and network segmentation and understand intent can be nearly impossible. Declarative configuration formats help solve this.

Instead of thousands of imperative steps to achieve a desired posture, declarative configuration allows us to very clearly state what the intent or the end state of the system should be. We can look at the live state of a system and compare it with its intended state much more easily with declarative configuration than trying to reverse engineer through complex steps and rules. If the infrastructure changes we can “recompile” the declarative policy to this new target, which allows for agility.

Writing network policy as declarative configuration is not enough, however. We’ve seen large organizations build nice declarative configuration models, but the complexity of their infrastructure still leads to complex rules and brittle automation. Declarative configuration should be written in terms of strong workload identity that is tied to services mapped to organization structure. This workload identity is independent of the infrastructure, IP addresses, or micro-segmentation. Workload identity helps reduce policy bit rot, reduces configuration drift, and makes it easier to reason about the intended state of the system and the actual state.

Workload identity

Previous methods of building policy based on “where” workloads are deployed are too susceptible to “policy bit rot.” Constructs like IP addresses and network segments are not durable, that is, they are ephemeral and can be changed, reassigned, or are not even relevant. Changes to these constructs can nullify intended policy. We need to identify workloads based on what they are, how they map within the organizational structure, and do so independently of where they are deployed. This decoupling allows intended policy to resist drift when the infrastructure changes, is deployed over hybrid environments, or experiences faults/failures.

With a more durable workload identity, we can write authentication and authorization policies with declarative configuration that are easier to audit and that map clearly to compliance requirements. A high-level compliance requirement such as “test and developer environments cannot interact with production environments or data” becomes easier to enforce. With workload identity, we know which workloads belong to which environments because it’s encoded in their workload identity.

Most organizations already have existing investments in identity and access management systems, so the last piece of the puzzle here is the need for standard integration points.

Standard integration points

A big pain point in existing networking and security implementations is the expensive integrations between systems that were not intended to work well together or that expose proprietary integration points. Some of these integrations are heavily UI-based, which are difficult to automate. Any system built on declarative configuration and strong workload identity will also need to integrate with other layers in the stack or supporting technology.

Open, standard integration points make it easier to compose necessary pieces and simplify the integration. Adoption of standards like OpenID Connect, SSO, OAuth, etc., or observability standards like OpenTelemetry, greatly simplify integration. Using declarative configuration makes integrating with automation tools like Flux, ArgoCD, and Backstage straightforward. Edge cases that require niche or proprietary technology should also be possible with standard integration points.

If we have these foundational elements in place—i.e., declarative configuration, workload identity, and a sensible path for integration—we can start identifying what we need from a modern API, networking, and security stack.

Modern networking and security requirements

All service and API communication in an organization needs to be secured, managed, and observed regardless of whether this traffic is ingress, east/west, or egress. The main capabilities we need in a modern application networking stack are:

These capabilities have to be implemented in terms of highly dynamic and heterogeneous cloud workload environments. As workloads come and go, scale up and down, or become unhealthy, etc., the networking infrastructure has to be able to reconcile and update. Additionally, we need these capabilities to be based on declarative configuration and workload identity and to work with standard integration points.

As networking is defined in terms of layers, we also see a modern networking solution in terms of layers:

We want to move from inconsistent, expensive, incompatible tools and teams working in silos to a modern, holistic, application networking solution. This solution is built from the ground up on cloud-native principles, delivered through platform engineering, and maps better than traditional approaches to organizational needs and pressures.

The CAKES stack is that solution.

Introducing CAKES

The landscape of Cloud Native Computing Foundation (CNCF) projects is a large and expansive set of technologies targeting modern cloud-native use cases. When evaluating a suitable stack for modern application networking, not only do we need technologies built with the principles discussed above, but we also want ones with vibrant, active communities (both users and vendors).

Communities often form around the technologies that have good foundations, that can be extended, and that have real proof points such as being adopted by large organizations at scale. One of the most compelling proof points is adoption by the major public clouds. As we choose technologies for the various layers in a modern networking stack, we must consider community and adoption.

CAKES is an acronym for a handful of open-source technologies that integrate nicely to solve modern application networking challenges. CAKES is also a metaphor for layering networking technologies in supporting ways to provide a holistic solution. The following technologies make up the CAKES stack:

The technologies in the CAKES stack adhere to the principles above of using declarative configuration, built with integration in mind and, where appropriate, based on workload identity. Additionally, they have vibrant open-source communities and proof-point deployments at large-scale organizations, most having been adopted by the major public clouds. CAKES can be used to slowly replace existing networking technologies and delivered through platform engineering APIs and tools.

Let’s look at the constituent layers in the CAKES stack from foundational pieces (bottom) and moving up the stack. We cannot talk about modern platforms without talking about Kubernetes as the foundation.

K layer – Kubernetes

The K in the CAKES stack stands for Kubernetes. Kubernetes is a powerful container orchestration and service abstraction tool used for deploying and managing services across a fleet of machines. The automation Kubernetes provides makes it possible to manage and scale services dynamically. Kubernetes is also cloud-agnostic and provides the foundation for a hybrid cloud or multi-cloud deployment. Platform engineering teams should be starting with Kubernetes and building from there.

Kubernetes is built from the ground up on the premise of declarative configuration. Kubernetes’ core declarative configuration API consists of deployments, pods, and services. There are other components, but by using these core API objects, platform teams can specify end-state intent of what a deployed service should look like in terms of number of replicas, configuration, and metadata. Kubernetes takes care of health checking, scaling up and down, restarts, etc. The Kubernetes core API can be extended with custom declarative configuration. This provides the foundation for the next layers in the CAKES stack.

C layer – CNI implementation

The C in CAKES stands for a CNI or, more specifically, something like Cilium or Calico (CNI implementations). This layer addresses concerns in the L3 and L4 layers of the networking stack on top of Kubernetes. This component needs to provide a few basic capabilities that form the foundation of the rest of the networking layers: basic network connectivity, network policy, and SNAT/DNAT between the Kubernetes overlay and the VPC. A CNI implementation is a required component in a Kubernetes cluster.

Cilium and Calico are popular open-source CNI implementations with an eBPF data plane (as is AWS VPC-CNI). eBPF is a revolutionary technology used to optimize the networking in Kubernetes for large numbers of workloads. eBPF can deliver big performance and route optimization benefits in networking by altering the networking path in the Linux kernel.

Cilium and Calico use declarative configuration, which extends the Kubernetes API and ties nicely into a GitOps approach for platform engineering. Cilium or Calico CNI provide basic connectivity between workloads in Kubernetes and implement coarse-grained network policy. A CNI integrates and works well with the other layers in the CAKES stack.

S layer – SPIFFE

The S in the CAKES stack stands for SPIFFE, which itself stands for Secure Production Identity Framework For Everyone. This layer is the foundation of workload identity in our networking solution. SPIFFE is an open-source specification for assigning cryptographic workload identity in a dynamic and heterogeneous environment and can be leveraged by other layers in the stack. For example, Istio Ambient mode uses SPIFFE in its mTLS workload identity design and can be used in authorization policy configurations.

SPIRE is an implementation of the SPIFFE specification and can be used to assign SPIFFE workload identity based on workload attestation. SPIRE can be used to eliminate username and passwords for proving identity by leveraging the details of the infrastructure on which a workload is deployed to “prove” identity. SPIRE can be leveraged to bring SPIFFE to workloads that run outside of Kubernetes as well.

A layer – Ambient mode architecture (Istio)

The A in CAKES stands for “ambient mode” or, more generally, the Istio service mesh. This layer solves things like secure zero-trust connectivity using SPIFFE workload identity and mTLS, authorization policies, traffic management and load balancing, service discovery, and L7 observability metrics, traces, and logs. Istio ambient mode is also driven by declarative configuration, which layers on top of any CNI and can be automated by a platform engineering team.

Istio’s ambient mode, specifically, is an open-source, sidecar-less service mesh that transparently provides workload identity based on mTLS that can then be used in declarative authorization policies. This approach forms the basis of a zero-trust networking architecture. It also provides powerful L7 capabilities to services like traffic splitting, traffic mirroring, locality-aware load balancing, request resilience with timeouts, retries, and circuit breaking, L7 networking observability metrics, logging, and tracing.

Because Istio’s ambient mode doesn’t require a sidecar proxy, it’s operationally much simpler for platform teams to deploy applications into (or remove applications from) the mesh. It can also be upgraded transparently from the applications running in the mesh, which means it can be patched and maintained on a consistent schedule without coordination with application teams. Lastly, because there is no sidecar, the cost of running the mesh in terms of resource overhead is cut by an order of magnitude compared to sidecar-based service meshes.

E layer – Envoy proxy

The E in the CAKES stack stands for the Envoy proxy. This layer provides the L7 functionality needed in an ingress/egress API gateway. Envoy provides functionality like TLS termination/origination, rate limiting, external authorization integration, load balancing, traffic routing, observability metric collection, and more. Envoy also happens to be used for service-to-service traffic in the “A” or service mesh layer.

Envoy is the de facto standard in the open-source community for L7 proxies. All of the major clouds use Envoy in their edge gateways or front-end load balancers. Envoy is highly extensible with its filter architecture and because of this, we see Envoy integrated in all kinds of infrastructure like API gateways, service meshes, and ingress/egress load balancers, and even deeply within CNI implementations. Envoy was specifically built to live in a highly dynamic, heterogeneous environment and is configured by a dynamic API and control plane. This allows Envoy to live nicely in a Kubernetes platform. Envoy can use TLS certificates based on SPIFFE/SPIRE for workload identity as it proxies connections.

A strategy for adopting CAKES

CAKES technologies can be incrementally adopted either individually or holistically through a platform engineering approach. Although platform engineering doesn’t require the use of Kubernetes, many organizations are leading the effort with Kubernetes to simplify, automate, and abstract the details of deploying their services and APIs. As the initial step to modernization, Kubernetes brings the benefits of a declarative configuration system, multi-cloud deployments, and cloud independence.

Incrementally building an organization platform based on user/developer feedback is the best approach we’ve seen. From a technology adoption perspective, focusing on developer needs typically starts with networking solutions focused on the CNI layer and an API gateway layer—that is, how to get traffic into an API and how to control basic ingress/egress between teams within Kubernetes. That’s where the “C” and “E” get introduced into the stack. The CNI layer (Cilium, Calico, AWS VPC-CNI, etc.) can deliver coarse-grained networking policy between Kubernetes namespaces. An Envoy-based API gateway, such as Gloo Gateway, provides a powerful way to authenticate, rate limit, and expose APIs deployed in Kubernetes to API consumers.

Kubernetes adoption strategies such as “lift and shift” are fairly common. Getting applications into Kubernetes improves the efficiency and economies of scale of the platform. The more APIs and services that get introduced into the Kubernetes-based platform, the greater the need for fine-grained observability and security authorizations and networking policy for cross-service traffic and eventually cross-cluster traffic. And that’s where the “A” and “S” parts of the stack come into the picture. The “A” for Istio ambient mode solves the challenges around service-to-service authentication using mTLS and without using a sidecar. With an mTLS approach in place, we can layer the “S” or SPIFFE implementation into the mix, which Istio ambient mode does automatically. If you’re looking to implement SPIFFE, Istio is the most straightforward way to accomplish this.

Introducing the “A” and “S” layers from the CAKES stack into a platform then enables a “shift” without “lift” strategy. We can extend the networking mesh to services that run outside of the Kubernetes estate and control networking, authentication/authorization, and networking policy to those APIs as well. Istio ambient mode can be deployed directly to VMs outside a Kubernetes cluster, for example, and be natively included in the mesh.

This approach of adopting Kubernetes, slowly bringing APIs and services to the platform, adopting a “lift and shift,” and then extending to a “shift without lift” is very similar to a “strangler pattern” for modern networking.

CAKES gives platform owners and engineers modern tools to improve developer experience, improve the speed of delivery, and improve compliance posture. CAKES is well-suited for platform engineering approaches and fits natively within GitOps workflows. Solo.io is at the forefront of driving the respective open-source projects that make up the CAKES stack and providing commercial support for this stack. To learn more about CAKES, visit https://www.solo.io/topics/cakes-stack/.

Christian Posta (@christianposta) is global field CTO at Solo.io supporting customers and end users in their adoption of cloud-native technologies. He is an author for Manning and O’Reilly publications, open-source contributor, blogger, and sought-after speaker on Envoy Proxy and Kubernetes technologies. Prior to Solo.io, Christian was a chief architect at Red Hat and at FuseSource and held engineering positions at Wells Fargo, Apollo Group, and Intel.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.