Dealing with pod density limitations on EKS worker nodes
So my team recently started running workloads on Kubernetes. We are using EKS — the “managed” Kubernetes offering available on AWS. We were adding more workloads to the Kubernetes instance deployed on our Dev environment. Soon, we ran into a weird issue — we were not able to schedule pods on K8s, because, get this — there were no IP addresses that K8s could assign to a pod. So pod scheduling was failing. This blew my mind — if we’d want to deploy production-grade services on k8s with high availability, proper monitoring, tracing, canary releases, etc, we’d need IP addresses in the 1000s. We were nowhere near close to that number. So I wanted to understand what the capacity limits were and why we ran into them.
So I ran the following commands to understand if there were any subnet limitations —
aws eks describe-cluster --name $(cluster-name) | \
jq '.cluster.resourcesVpcConfig.subnetIds'
Once I got the subnet-ids, I ran —
aws ec2 describe-subnets --filter \
Name=subnet-id,Values=subnet-0,subnet-1,subnet-2 | \
jq '.Subnets[].CidrBlock'
to get the available IPs on each subnet. The available IPs were in the range 172.26.*.*/27. So that’s 32 IP addresses per subnet (Translate CIDR to IP Range). We had three subnets — so that would be a total of 96 IP addresses.
Next, I wanted to understand if there was any constraint that was placed by EKS on the max number of pods that can be scheduled on each node —
kubectl get nodes -o json | jq '.items[].status.capacity.pods'
Turns out there was. The capacity was 17 pods per node. That’s quite a low number. It turns out, that EKS relies on the underlying VPC infrastructure to assign IPs to the nodes. So if your VPC has three subnets that have a total of 96 assignable IPs between them, then EKS has to pick a pod IP from within these 96 assignable IPs. But here is the thing though, even if your subnet has a larger number of assignable IPs, the number of pods that can be scheduled on a worker node is still constrained by the capacity of the number of IPs of the worker node’s elastic network interface. Check this out. We were using 3 t2.medium instances which have a limit of 3 ENIs and 6 IPs per ENI. So that’s a total of 18 IPs per worker node. One IP will be reserved for the node itself. Which means, each node only has 17 IPs that can be assigned to a pod. So with 3 t2.medium instances, our EKS worker node pool only has 51 assignable IPs — i.e, only 51 pods that can be scheduled. So even if we were to increase the IP ranges of our subnets, it wouldn’t matter because the only way to get more pods to be scheduled would be to add more worker nodes.
We really didn’t want to go this route as our instances were underutilized and we didn’t want to increase our worker node count unless we were hitting high utilization numbers on our existing worker nodes. So I did some digging around to find out how I can overcome this limitation. Turns out, k8s relies on a networking subsystem to attach a pod to a network. The networking subsystem has to adhere to a standard called CNI (Container Networking Interface). CNIs handle things like multi-host networking, IP address management, etc. See the full spec here. CNIs can be plugged in or be chainable. This is pretty cool because I can pick and chose which aspect of the network stack can be handled by which CNI plugin. AWS EKS is configured to use the VPC CNI plugin by default. It is this VPC CNI plugin that is responsible for assigning IP addresses to the pods. So all I had to do was to remove the VPC CNI plugin and configure my k8s cluster to use an alternate VPC CNI plugin instead. There are plenty of CNI plugins available — weave-net, flannel, cilium, etc. I decided to use weave-net — it was pretty straightforward to set up and I was already using one of their products — scope to view the k8s cluster state.
To get the k8s cluster to use the weave-net CNI for IPAM and multi-host networking, there were a couple of other steps I had to follow. Kubernetes looks at the /etc/cni/net.d file location on each node to load the CNI plugin. Kubernetes picks up the first “.conflist” file in that folder and loads the corresponding plugin. After I installed weave-net as a daemon set, there were two files in the folder — ‘10-aws.conflist’ & ‘10-weave.conflist’. I deleted the aws-node daemon set from the kube-system namespace
kubectl delete daemonset -n kube-system aws-node
and then deleted the ‘10-aws.conflist’ file from all the worker nodes as well. Then all I had to do was to restart all pods on my Kubernetes cluster and all the pods would have their IP addresses assigned from the default weave-net CIDR block — 10.32.0.0/12. If you are following along, you can check by running —
kubectl get pods -A -o json | jq '.items[].status.podIP'
Now that takes care of the IP problem but the problem of max pod capacity per worker node remains. This can be solved by restarting your worker nodes and specifying the
--use-max-pods
flag to false in the EKS bootstrap script. If this is not an option for you (it wasn’t for me — I had some temporary constraints which didn’t allow me access to the worker node machines), you can create a daemon set of privileged pods so you can ssh into the pod and then chroot into the host worker node of each daemon set pod and make some config changes -
kubectl-host: kubectl exec -it -n kube-system <privileged-pod> shpod: chroot /host/node: vi /etc/kubernetes/kubelet/kubelet-config.json
// remove the max-pods entry at the end of the file, save & exit
node: systemctl restart kubelet
And you are all set. Your worker node should now have it’s max pods setting set to 110 (the default Kubernetes number).