NFTables in Kube-Proxy
I have been working with Istio for the past three years while developing the Service Mesh platform at Tetrate.io. IPTables has been an integral part of traffic flow configuration in Istio and Kubernetes networking. Istio uses it to set up redirection rules for sidecars, while Kube-Proxy uses it to route traffic to pods for its services.
Although IPTables has been a de facto solution used by network administrators for firewall and traffic flow control, there are a few challenges when dealing with large IPTables rulesets. NFTables was introduced in the Linux kernel to address these challenges. As the successor to IPTables, it is actively being developed and is seeing growing adoption in the community.
The massive improvements in NFTables have attracted significant attention from the Kubernetes community, leading to the introduction of NFTables in Kube-Proxy in 2023. The community introduced an alternative Kube-Proxy backend in version 1.29 as a Beta feature, with plans to graduate it to GA in version 1.33.
In this blog post, I want to capture a few notes on what users should expect and how we can get started using this!
IPTables in Kube-Proxy
Kube-Proxy is a networking component installed on each node in a Kubernetes cluster, responsible for maintaining network rules for service-to-pod mapping. It watches for service and pod events from the Kube-API-Server and creates IPTables rules to route traffic addressed to a service’s virtual IP to the appropriate backend pod. I would recommend going through this amazing article to better understand the IPTable rules created by Kube-Proxy: Kubernetes NodePort and IPTable rules
Limitations of IPTables
IPTables has seen significant improvements over the years, but there are some fundamental issues that affect performance on both the Data Plane and Control Plane.
Due to the lack of built-in map support in IPTables, the number of IPTables rules in a Kubernetes cluster is directly proportional to the number of services and endpoints in the cluster. As a result, rulesets in large-scale Kubernetes deployments can become massive.
IPTables evaluates rules sequentially. When a packet enters the kernel for routing, it must be checked against all service rules before being routed to the service-specific chain. This results in an average latency of O(N) for the first packet, where N is the number of services in the cluster.
Additionally, IPTables rulesets do not support incremental updates. Any process attempting to update an IPTables rule must take a lock, load the entire ruleset from the kernel, modify the rule in the appropriate place, re-upload the ruleset, and then release the lock. The lack of granular control over rules makes incremental updates difficult, creating a bottleneck in large-scale clusters with thousands of services, endpoints, and a massive IPTables ruleset.
Here comes NFTables
NFTables was introduced to overcome the limitations of IPTables:
- Support for Maps and Concatenation: With the introduction of maps, NFTables enables approximately O(1) lookups for actions to be performed. Concatenation allows users to build tuples from packets for efficient lookups in maps.
- Support for Dynamic Ruleset Updates: NFTables represents rules as a linked list internally, allowing updates to individual rules without affecting the rest of the ruleset.
- Simplified Dual-Stack IPv4/IPv6 Administration: The introduction of the inet family enables handling both IPv4 and IPv6 traffic, eliminating the need for duplicate rulesets.
- Fully Configurable Tables and Chains: IPTables has predefined tables and chains that are registered regardless of whether they are used. In contrast, NFTables allows users to define their own tables and base chains, leading to performance improvements and easier management. Multiple users can manage their own tables and configure appropriate hook priorities for their packet processing pipeline.
- Simpler Syntax: A more intuitive and self-documenting syntax is always welcome! :)
- No Built-in Counters: This eliminates performance overhead when counters are not needed.
- Multiple Actions per Rule: NFTables supports multiple actions within a single rule, preventing the need for rule duplication when multiple actions must be taken.
These improvements lead to significant performance gains!
In Kube-Proxy, NFTables allows O(1) lookups when mapping a packet to a Kubernetes service. Packet processing times remain fairly constant regardless of cluster size.
The size of the ruleset in both IPTables and NFTables is still O(N), where N represents the total number of services and their endpoints. However, with the introduction of dynamic ruleset updates, the update size in Kube-Proxy is O(C), where C represents only the services and endpoints that have changed since the last sync, regardless of the total number of services and endpoints.
Trying out the nftables backend
Now that we understand the magic of nftables, let’s see it in action.
Setting up a kind cluster with nftables backend is very simple. You can just set the kube-proxy mode with a config like this:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: nftables-cluster
nodes:
- role: control-plane
image: kindest/node:v1.32.2
- role: worker
image: kindest/node:v1.32.2
networking:
kubeProxyMode: "nftables"
I deployed the bookinfo application in the cluster, which is a set of 4 services: Productpage, Details, Ratings, and Reviews. This application is managed as part of the Istio project, and I frequently use it for demo purposes.
As mentioned earlier, nftables doesn’t come with pre-defined tables and chains. Let’s start by taking a look at the registered tables:
docker exec -it nftables-cluster-control-plane bash
root@nftables-cluster-control-plane:/# nft list tables
table ip kube-proxy
table ip6 kube-proxy
table inet kindnet-network-policies
We will see 3 tables registered as part of nftables backend, out of which 2 have been registered by the Kube-Proxy. Even though nftables
allows a single table to handle IPv4 and IPv6 rules, it would still require separate sets and maps for each, so we just keep the rules in separate tables for simplicity.
Okay, let’s focus on the IPv4 rules. We can dump all the chains, maps, and sets registered in kube-proxy IPv4 table by running:
nft list ruleset ip
The entrypoint chains are registered to different hooks in the Linux kernel. Let’s first take a look at the NAT chains:
chain nat-prerouting {
type nat hook prerouting priority dstnat; policy accept;
jump services
}
chain nat-output {
type nat hook output priority -100; policy accept;
jump services
}
chain nat-postrouting {
type nat hook postrouting priority srcnat; policy accept;
jump masquerading
}
Okay, so there are three chains registered to PREROUTING, OUTPUT, and POSTROUTING hooks respectively. This is similar to the IPTable ruleset. The PREROUTING and OUTPUT handle the DNAT, and POSTROUTING chain handles the SNAT for traffic from outside the Pod CIDR.
Let’s jump to the service chain:
chain services {
ip daddr . meta l4proto . th dport vmap @service-ips
ip daddr @nodeport-ips meta l4proto . th dport vmap @service-nodeports
}
The first rule constructs a key with {Destination Address, L4 Protocol, Destination Port}
and looks it up in the verdict map service-ips
. Verdict maps map the constructed key to a verdict, which is basically the next action to be taken.
The service-ips
map looks like this:
map service-ips {
type ipv4_addr . inet_proto . inet_service : verdict
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
elements = { 10.96.0.10 . tcp . 53 : goto service-NWBZK7IH-kube-system/kube-dns/tcp/dns-tcp,
10.96.0.10 . udp . 53 : goto service-FY5PMXPG-kube-system/kube-dns/udp/dns,
10.96.34.36 . tcp . 9080 : goto service-T7OJ2CRA-default/productpage/tcp/http,
10.96.94.96 . tcp . 9080 : goto service-NKB2MNRK-default/reviews/tcp/http,
10.96.50.103 . tcp . 9080 : goto service-TJFYVQT5-default/details/tcp/http,
10.96.87.163 . tcp . 9080 : goto service-JOANWKTY-default/ratings/tcp/http,
10.96.0.1 . tcp . 443 : goto service-2QRHZV4L-default/kubernetes/tcp/https,
10.96.0.10 . tcp . 9153 : goto service-AS2KJYAD-kube-system/kube-dns/tcp/metrics }
}
We have the type of the map as an {IPv4 IP address, L4 Protocol, Destination Port} mapped to a verdict. Each element is a clusterIP of a service with their port and protocol information and is mapped to a chain for that specific service. The maps allow the O(1) lookup here, whereas in IPTables we would have to linearly match against all chains until we find the matching chain.
Let’s dig into one of our deployed applications, ProductPage’s chain now:
chain service-T7OJ2CRA-default/productpage/tcp/http {
ip daddr 10.96.34.36 tcp dport 9080 ip saddr != 10.244.0.0/16 jump mark-for-masquerade
numgen random mod 1 vmap { 0 : goto endpoint-B5XEP3S7-default/productpage/tcp/http__10.244.1.7/9080 }
}
The first rule does SNAT for traffic destined for this service coming from a source which is outside the Pod CIDR range. When the traffic comes from a source in the Pod CIDR, we don’t need SNAT and can directly route the response back to it.
The next rule again creates an inline verdict map. The verdict map has chains for individual endpoints. We do a random number generation and a modulus with the number of endpoints to load balance across the endpoints.
Let’s take a look at the node port routing rules which we skipped over earlier:
chain services {
ip daddr . meta l4proto . th dport vmap @service-ips
ip daddr @nodeport-ips meta l4proto . th dport vmap @service-nodeports
}
set nodeport-ips {
type ipv4_addr
comment "IPs that accept NodePort traffic"
elements = { 172.18.0.4 }
}
map service-nodeports {
type inet_proto . inet_service : verdict
comment "NodePort traffic"
elements = { tcp . 32459 : goto external-JOANWKTY-default/ratings/tcp/http }
}
chain external-JOANWKTY-default/ratings/tcp/http {
jump mark-for-masquerade
goto service-JOANWKTY-default/ratings/tcp/http
}
The second rule in the services chain checks if the destination address is part of the nodeport-ips
set, constructs a key from { L4 Protocol, Destination Port }, maps the key in the service-nodeports
verdict map, and routes to the external chain for the service. The external service marks the service for SNAT and routes it to the internal service chain we saw earlier to load balance across the endpoints.
The maps and self-documenting new syntax really make it easy to read through the rules and understand the traffic flow.
I would love to evaluate the performance improvements in the next iteration of this article as well. Till then, please take a look at the numbers published by Kubernetes project in their GA Announcement blog.
To migrate or not to migrate
NFTables are amazing! Still, we need to consider a couple of factors before we start migrating our clusters over to nftables backend:
- New code, new bugs: Although nftables is being used in production settings and has been fairly well-tested, it is still quite new and cannot be considered as stable as IPTables mode, which is still the default mode for kube-proxy.
- Dependency on newer Linux kernel: NFTables backend mode requires a 5.13 or newer kernel.
- Ensure support in Networking and Observability tools: Users need to ensure that the other networking and observability stack supports NFTables backend.
- Incompatibilities with IPTables mode: Although there is feature parity between the two modes, IPTables mode introduced some default behaviors which were less secure, performant, or intuitive. The new backend allowed the kube-proxy to move away from these bad defaults without breaking compatibility.
NFTables represents a significant advancement in Linux networking capabilities that’s now making its way into Kubernetes through the Kube-Proxy component. The transition from IPTables to NFTables brings substantial benefits in both performance and manageability that will become increasingly important as Kubernetes clusters continue to scale.
The key advantages we’ve explored:
- Near-constant O(1) lookup times regardless of cluster size
- Efficient incremental updates that only process changed services
- Intuitive, self-documenting syntax for easier troubleshooting
- More flexible architecture with user-defined tables and chains
While the performance improvements are compelling, especially for large clusters with thousands of services, the decision to migrate should be carefully considered. Organizations should evaluate their kernel versions, test compatibility with their existing networking and observability tools, and potentially run performance benchmarks in their specific environments.
This article reflects my understanding of the impact of the new Kube-Proxy mode on users and is based on multiple references listed below. Feel free to share any feedback or thoughts via email or reach out to me on Twitter (Yes, I refuse to call it X).