Don’t kill it with iron! How many pods can I start on an EKS node?

0/5 nodes are available: 5 Too many pods

The problem

Many of my clients have been facing this issue before: Pods cannot be scheduled in an EKS cluster whilst CPU and RAM utilization is often low. Lacking knowledge and time to research the topic, many decide to scale up the cluster by adding more nodes, incurring additional costs.

The origin of this issue is the number of IP addresses available on a kubernetes data plane node in AWS. Due to the default behaviour of the VPC CNI plugin (a.k.a. the aws-node daemonset), every pod gets assigned an IPv4 address from the subnets IP address space. From that, two problems can arise:

  1. The subnet is running out of available IPv4 addresses. This can occur in poorly designed network setups or in very large enterprise environments. The solution for that is assigning secondary CIDR blocks for the VPC and configuring the EKS cluster to use Custom Networking. It is the less likely problem to face and it’s not covered in this article in more detail.

  2. The EC2 instance can’t claim more IPv4 addresses. Every EC2 instance type has a limited number of ENIs and IP addresses that it can use. For example, a m6i.large instance with 2 vCPUs and 8GB RAM can use only 3 ENIs with 10 IPv4 addresses each. One address is claimed by the EC2 node itself, leaving 29 addresses to be used by pods running on that node. From my experience, this is a frequent problem and I’m going to summarize the solution in this blog post. Let’s observe the problem in practice:

    # Create an EKS Cluster
    eksctl create cluster --name my-cluster --without-nodegroup
    
    # Create a self-managed node group and set max-pods to a higher number
    # (here 1000)
    eksctl create nodegroup --cluster my-cluster --managed=false \
    --node-type m6i.large --nodes 1 --max-pods-per-node 1000 --name m6i-large
    
    # Test the default behaviour
    kubectl create deployment nginx --image=nginx --replicas=30
    kubectl get deployment -w
    

    As there are four pods running in the kube-system namespace and the total number of pods is 29, 25 pods (out of 30) will start and the remaining five will be pending. Describing one of them, we can see the following error:

    EKS Pod FailedScheduling

The solution

Luckily, there is a simple solution for that. Instead of attaching single IP addresses, it is possible to assign /28 IPv4 prefix lists to the ENIs of EC2 instances. This can be used to (at least in theory) run around 16x as many pods on each EC2 node. For that, a few configurations for the VPC CNI plugin are needed. For example:

# Delete the deployment to free addresses
kubectl delete deployment nginx  

# Configure VPC CNI to use /28 address prefixes
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
kubectl set env ds aws-node -n kube-system WARM_PREFIX_TARGET=1
kubectl set env ds aws-node -n kube-system WARM_IP_TARGET=5
kubectl set env ds aws-node -n kube-system MINIMUM_IP_TARGET=2

# Wait for a new aws-node pod to come up and test again
kubectl create deployment nginx --image=nginx --replicas=100
kubectl get deployment -w

Cool, it works! We more than trippled the amount of pods on our node! Can we do more?

kubectl scale deployment nginx --replicas=500
kubectl get deployment -w

EKS Node NotReady

IPv4 addresses are not the limiting factor any more, but the node is breaking down. This seems reasonable given the amount of processes it is running. This leads us to another question though.

How many pods should I run on a node?

AWS offers a simple shell script for calculating the maximum number of pods that should run on a specific instance type. By default, it calculates the number of pods with the default behaviour of the VPC CNI:

./max-pods-calculator.sh --instance-type m6i.large --cni-version 1.9.0-eksbuild.1
29

Using a flag, we can calculate the number of pods with IP prefix lists enabled:

./max-pods-calculator.sh --instance-type m6i.large --cni-version 1.9.0-eksbuild.1  --cni-prefix-delegation-enabled
110

The --max-pods-per-node parameter in the first code snippet above should be set to this value. 1000 is not a good value.

The alternative solution

An alternative solution is installing an alternate CNI plugin instead of the default VPC CNI plugin for cluster networking. This will result to IP addresses not being provided from the VPC CIDR, but from a virtual (potentially unlimited) address space. However, having another virtual IP address space within the environment will negatively impact transparency over your networking, and support for these alternative CNI plugins will not be provided by AWS.

While authoring this article, I experimented a bit with alternative CNI plugins and found that configuring them is fairly easy. However, the usage of IP prefix lists remediated the issue already, and thus there was no reason to further pursue this path. My strong recommendation goes for the VPC CNI plugin due to the abovementioned reasoning: transparency, ease and support.

Similar Posts You Might Enjoy

Place products on the AWS Marketplace - Seller registration

Let’s go and shop for something. I bet almost everyone of us has heard that sentence in the past. But where do we go then? A Supermarket, shopping center, or marketplace could be one of the answers. For this blog article, I’d like to give you some insights about how to publish your products within the AWS Marketplace. - by Patrick Schaumburg

Serverless Spy Vs. Spy Chapter 3: X-Ray vs Jaeger - Send Lambda traces with open telemetry

In modern architectures, Lambda functions co-exist with containers. Cloud Native Observability is achieved with open telemetry. I show you how to send open telemetry traces from Lambda to a Jaeger tracing server. Let’s see how this compares to the X-Ray tracing service. - by Gernot Glawe

Hostname Resolution and DNS with SAP on AWS

SAP systems running in a distributed environment have certain requirements regarding how to set the hostname and how those need to be resolved from other hosts. In our test landscape we use virtual hostnames to decouple the SAP instances from the underlying hardware which is running on a Red Hat Linux Server. This blog post will walk you through the components in AWS that fullfil those requirements and allow SAP instances to communicate while keeping administrative effort super low. - by Fabian Brakowski