This post is the last in a series discussing the Neon outages on 2025-05-16 and 2025-05-19 in our AWS us-east-1 region. In this post, we cover the IP allocation failures that persisted through the majority of the disruption. For further details, read our top-level Post-Mortem here.
Summary
Neon separates Storage and Compute to provide Serverless Postgres. Our Compute Instances run in lightweight Virtual Machines in Kubernetes, each Compute running in its own Pod.
On 2025-05-16, the Neon Control Plane’s periodic job responsible for terminating idle Computes started failing, eventually resulting in our VPC subnets running out of IP addresses in two of three availability zones. Configuration changes to AWS CNI to free up IP addresses, while beneficial in the immediate term, later prevented returning to a healthy state. A post-incident follow-up to revert this temporary state on 2025-05-19, resulted in similar issues.
During this investigation, we have learned a lot about the behaviour of the AWS CNI plugin, how it interacts with our highly-dynamic environment, and have filed an improvement PR.
This article covers how the incident happened and details about what we learned about AWS CNI through our post-mortem and root-cause investigation.
Glossary of terms
- AWS CNI: refers to the AWS VPC CNI plugin. A more in-depth description of AWS CNI is provided below.
- ipamd: part of AWS CNI, refers to the L-IPAM daemon
- AWS ENI: AWS Elastic Network Interface; ENIs are allocated to EC2 instances and are associated with a subnet
- AWS VPC: logically isolated virtual network provided by AWS
- AWS VPC subnet (or subnet): represents a range of IP addresses in a VPC
- Allocated IPs: AWS subnet IPs allocated to ENIs
- Assigned IPs: IP addresses assigned to Kubernetes Pods (most Pods in our clusters are Neon Computes)
- Total IPs: total IP addresses available for allocation in a subnet (or subnets)
2025-05-16: Running out of IP addresses
Neon operates Kubernetes clusters in 11 cloud regions. Our us-east-1 cluster in AWS typically operates a daily peak of 6,000 running databases (which we call Computes), with an incoming rate of 500 new Pods started every minute and a similar rate of terminating idle databases.
When the incident started, the job responsible for shutting down idle databases failed (we have described this in more detail in a separate post). As terminations were not processed, but creations continued, the number of running Computes quickly rose past our cluster’s typical operating conditions, reaching ~8k active Computes in the space of a few minutes.
At ~8k active computes, our AWS VPC subnets ran out of IPv4 addresses. This was unexpected, as we test our clusters for up to 10k Computes, and our subnets were sized to a total of 12k IP addresses!
A summary of the conditions that led to IP allocation unavailability
- With its default settings, AWS CNI reserves at least 1-2 extra ENIs worth of IP addresses on each node
- Our nodes can utilize up to 49 IPv4 addresses per ENI
- Our AWS us-east-1 region only had 12k total IP addresses instead of the 24k we have in other regions.
- During the incident, we had ~4k extra IPs allocated on nodes that didn’t have enough CPU or memory available for new compute Pods to be scheduled.
- As a result, we became unable to start new computes while only 8k of 12k IPs were assigned to compute Pods — at the time, this was confusing and unexpected.
Aside: Why only 12k IP addresses?
As one of our first regions, we hadn’t originally planned to run the cluster at this scale. Our load testing had indicated that our clusters could work with vertical scaling up to 10k Computes, but that after that, we would need to scale out horizontally.Â
Even though each of our three subnets was configured with a /20 CIDR block (half the size of our other clusters), we assumed we would always have sufficient available IPs due to the identified upper bound of 10k active Computes.
The rate of growth of our service in recent months has been faster than anticipated, so we’ve been working in parallel on deeper architectural changes to support horizontal scaling. We will post articles describing the new architecture after we launch it.
Background: What is AWS CNI? How does it work?
Explaining the behaviour we saw, requires some understanding of how AWS CNI works.
The Kubernetes Container Networking Interface (CNI) is the standard interface used for configuring Pod networking in Kubernetes. CNI plugins are called by the container runtime to set up (“add”) and tear down (“del”) networking for each Pod. “AWS CNI” is how we refer to the AWS VPC CNI plugin. At the time of this incident we were using AWS CNI v1.18.6.
Each Pod needs an IP address for networking within the cluster, and AWS CNI’s job is mostly assigning IP addresses to Pods, pulling from the appropriate VPC subnet. Internally, the CNI plugin itself makes RPC calls to ipamd — the host daemon on each node, responsible for allocating IPs from the subnet onto the ENIs attached to the EC2 instance and handing those out to Pods.
To isolate Pod starts from AWS API calls (and vice versa), ipamd keeps a pool of IP addresses – more than is strictly necessary for the number of Pods on the node. The pool is resized every few seconds by a separate reconcile loop, outside the context of any individual CNI request.
AWS CNI has several configuration options to influence how it manages its pool of IP addresses. We include details about our choice of options below.
A quick recap
Our AWS us-east-1 cluster typically operates with 5-6k active Computes. We run our Compute Pods on m6id.metal AWS instances, with 49 IP addresses per ENI (plus one IP address assigned to the network interface itself). In theory, these instances can support up to 737 pods each (or more, with prefix delegation) — in practice, we tend to run 100-400 pods per node.
It’s worth mentioning that not all databases are equal — the number of running compute Pods on any Kubernetes node is dynamic and depends on the size of the scheduled workloads. For example, a 128 CPU node can run 128 pods with 1 CPU each, 4 pods with 32 CPUs each, or any combination in between.
During Friday’s incident, our Control Plane became unable to terminate idle databases. This resulted in the number of active Compute Pods quickly rising from ~5k to ~8.1k. As new Pods exhausted all schedulable CPU and memory across the cluster, our cluster-autoscaler added more nodes.
At this point, we had old nodes without CPU or memory capacity, but with many additional allocated IPs that could never be assigned to Pods due to these scheduling constraints. This issue was not clear to us at the time.Â
As more nodes were added, we started observing IP allocation errors when new Pods were scheduled but were unable to start.
Why did we run out of IP addresses?
Prior to the incident on Friday, we were using the default AWS CNI configuration (WARM_ENI_TARGET=1
and WARM_IP_TARGET
unset, more on these later).
Prior to the incident, each of the cluster’s 3 subnets had 3.7-3.9k allocated IP addresses (stored in ipamd’s IP pools), with only 1.6-2.3k IP addresses assigned to Pods (~50% utilization). Each of our three subnets were configured with a /20 CIDR block. This meant we had up to (4096 – 5) × 3 = 12273 total IP addresses (5 IPs in each VPC subnet are reserved).
During the incident, with the sudden increase in running Pods, the cluster had assigned ~8.8k total IP addresses (71%). However, across our three subnets, 99% of all IPs were allocated, totaling ~12,200 out of 12,273.
Because only 8.8k IP addresses were assigned, we expected that the already allocated IP addresses would be assignable to Pods, but the result was different and unexpected: IPs allocated to old nodes were, in practice, unusable. These were allocated to nodes already at CPU/memory capacity and were also not being released by AWS CNI.
Because of this detail, it appeared that the subnets had sufficient unassigned IPs available to be used for new Pods.
In practice, as new nodes were added, they became unable to obtain sufficient IPs to match available CPU/memory capacity.
Why were there so many IPs allocated to nodes with no spare resources?
Overallocation of IPs has to do with AWS CNI’s behavior under the default settings, which has WARM_ENI_TARGET=1
and WARM_IP_TARGET/MINIMUM_IP_TARGET
unset:
- Whenever ipamd sees that the number of “available” IP addresses (allocated minus assigned) is less than WARM_ENI_TARGET × (IPs per ENI), it will attempt to allocate more.
- Allocating more IPs – if none of the existing ENIs have room – will attempt to allocate an ENI’s worth of IP addresses [1, 2, 3], specifically by the code block below:
- Releasing IPs is desired when the number of “available” IP addresses is more than (WARM_ENI_TARGET + 1) × (IPs per ENI) — e.g., if an ENI’s worth of IPs could be removed without falling below the target.
- However, releasing IPs can only happen when there’s an ENI with no assigned IPs, because this configuration limits removal to happen only at the ENI level.
- Critically, IP assignment to Pods randomly picks among all ENIs (because ENIs are stored in a hashmap, and in Go, hashmap iteration order is random!)
The target for available IPs means that ipamd must allocate 50-100 IP addresses above what’s needed by Pods (50 IPs per ENI for the m6id.metal instance type). Because we have a stable rate of incoming Pods in our cluster, the random distribution of Pods onto ENIs keeps all ENIs in use, once added we never free IP addresses back to the subnet.
We were surprised to find this, so we have opened a PR to AWS CNI to improve ipamd’s behavior under these circumstances.
As an example of just how severe this can be, consider a node that very occasionally peaks up to 400 active Pods, but normally has enough large workloads that it only has the CPU / memory capacity to support 200 active Pods. We might see a sequence of events such as the following:
- A burst of Pod starts causes the node to have 400 active pods, with no available IPs left
- ipamd sees that the number of “available” IPs is low, and allocates more (with a new ENI) from the VPC subnet, aiming to always maintain at least 50 available (extra) IPs
- As older Pods on the node are removed, their IP addresses are kept in cooldown for 30s before they can be reused – so IP addresses on the new ENIs must be used for new Pod starts during this period
- After the load subsides, the random distribution of Pods onto ENIs results in a high probability of having at least one Pod per ENI, causing all 50 IPs per ENI to remain allocated to the node (remember: the entire ENI must be unused to remove any of the IPs.)
- As a result: This node is left with 200 running pods but 450 allocated IPs!
This happened enough in practice that almost all the IP addresses in our VPC subnets were either assigned to Pods (~8.8k) or allocated to nodes with no remaining CPU/Memory capacity (~3.4k, including ENI device IPs). Once we ran out of IPs in the subnet, we were unable to start new compute Pods, which in turn meant that idle databases couldn’t be woken up.
That still leaves ~100 IP addresses in our subnets unaccounted for – we’re not certain why they weren’t allocated (and we are following up with AWS Support to understand why this was the case). However, an extra 100 IPs likely wouldn’t have helped much, since our cluster needed an additional ~3.4k IPs.
2025-05-16: WARM_IP_TARGET=1, Releasing IPs, IPs still not assignable
After our VPC subnets ran out of IP addresses, we looked to simultaneously unblock further compute Pod creation and fix our control plane’s periodic job so that old computes were terminated.
To unblock compute Pod creation, we set WARM_IP_TARGET=1. This had the immediate intended and expected effect – freeing allocated IP addresses from nodes that couldn’t use them, and allowing more Pods to start.
Once our control plane started successfully terminating idle Computes, we observed a significant drop in the rate of successful Pod starts. As we later found out, WARM_IP_TARGET=1
unexpectedly prevents new Pod starts for 30s after each Pod deletion.
Background: What does WARM_IP_TARGET do?
Above, we described how WARM_ENI_TARGET=<T>
works: ipamd ensures that there are at least T ENIs worth of extra IP addresses allocated to the node, only freeing them when an entire ENI is unused.
In contrast, when WARM_IP_TARGET=<N>
is set, ipamd attempts to maintain exactly N extra IP addresses on the node. More IP addresses are allocated when fewer than N IPs are available Extra IP addresses are freed if there are more than N available.
If both WARM_IP_TARGET
and WARM_ENI_TARGET
are set, WARM_IP_TARGET
takes precedence.
Why set WARM_IP_TARGET=1?
During the incident, we observed that our VPC subnets were out of IP addresses with only 72% of those IPs actually assigned to Pods. We inferred that those unused IPs must have been allocated to nodes with no room to start new Pods and looked for a quick way to free them up.
Setting WARM_IP_TARGET
appeared to be the most straightforward option.
At the time, we were operating AWS CNI with the default configuration options. During the incident, we misread the documentation, mistakenly believing that the default value for WARM_IP_TARGET
was 5 when unset, leading us to decide to “reduce” it to one.
However, when this parameter is unset, AWS CNI actually bases its logic on WARM_ENI_TARGET
, which has substantially different behaviour from using WARM_IP_TARGET
. Unbeknownst to us, this had much larger implications than just reducing the value, many of which we didn’t understand until much later.
What happened with WARM_IP_TARGET=1?
The immediate effects were as we expected: Thousands of IP addresses were returned to the VPC subnets from nodes that couldn’t use them, and subsequently allocated by nodes with CPU / memory capacity to start Pods. New Pods started on these nodes, and when we eventually hit ~10k concurrent Pods, our rate of starts slowed again due to limits we’d previously identified in load testing (e.g. kube_proxy sync latency).
Soon after, we stabilized our Control Plane’s failing job and ~7k idle computes were terminated. However, the rate of successful Pod starts remained far below the pre-incident baseline.
Investigation at the time showed that most of our new Pods were failing to start due to IP assignment issues — even though VPC subnet metrics confirmed that there were thousands of unallocated IP addresses that should have been available.
At the time, we couldn’t figure out why IP allocation was failing. The AWS CNI documentation mentioned that WARM_IP_TARGET can trigger rate-limiting on EC2 API requests, however, this is not a problem we expected with only 1-2 Pod starts per node, per minute. Shouldn’t ipamd be able to request more than that?
We spent much of the time following this incident digging through AWS CNI’s source code to understand its behaviour, cross-referencing with metrics and logs we’d captured from the time of the incident.
Why did WARM_IP_TARGET=1 prevent Pods from starting?
Broadly, AWS CNI has two methods of operation, depending on whether WARM_IP_TARGET and/or MINIMUM_IP_TARGET are specified (internally referred to as the “warm target” being “defined”).
We described the default above – if there isn’t a warm target, ipamd relies on WARM_ENI_TARGET
’s value to determine how many IPs to allocate. But with WARM_IP_TARGET
set, ipamd has the following behavior:
- When the number of unused IP addresses is less than the warm target, more IPs are allocated from the VPC subnet until there are
WARM_IP_TARGET
available IPs [1, 2, 3] (automatically adding ENIs as needed). - If the number of unused IP addresses is more than the warm target, unassigned IPs are returned to the VPC subnet (subject to rate limiting).
- As with other configurations, IP addresses go into “cooldown” for 30 seconds after being unassigned, during which they cannot be reused.
- Crucially, and perhaps unexpectedly, IP addresses in cooldown count towards the warm target.
This combination of factors means that setting WARM_IP_TARGET=1
can prevent all Pod starts on a node for 30 seconds after each Pod removal, because ipamd will ensure that there’s exactly one “available” IP address (even if that IP address can’t be assigned due to the 30-second cooldown period).
That’s why we only saw this problem after our control plane started terminating idle computes. When no Pods are removed, ipamd can sustain a high rate of Pod starts, in spite of the small warm IP target. Deleting Pods, however, can simultaneously prevent assigning existing allocated IPs while also preventing ipamd from allocating more (because WARM_IP_TARGET=1
is guaranteed to be satisfied while any IP address is in cooldown).
To make matters worse, our control plane retries compute creation if it doesn’t succeed within the timeout window (currently 2 minutes). Combined with our preexisting long Pod startup times, these retries exacerbated the problem as many of the successful Pod starts were deleted shortly before the rest of setup could continue – each time resetting the 30 second countdown to being able to start more Pods.
Back to the incident: What did we do at the time?
At the time, we didn’t understand why our rate of successful Pod starts was so low.
We thought that it was theoretically possible that there were still unused IP addresses somewhere, or maybe WARM_IP_TARGET=1
was misbehaving. So in a last-ditch effort to free up any other allocated IP addresses, we reduced WARM_IP_TARGET
even further, to zero, also setting MINIMUM_IP_TARGET=0
and WARM_ENI_TARGET=0
. This helped!
Unbeknownst to us at the time, setting WARM_IP_TARGET
to zero is equivalent to disabling it, resulting in ipamd using the WARM_ENI_TARGET
logic.
There were two key side effects of this configuration:
- Similarly to before the incident, ipamd effectively stopped returning IP addresses to the subnet
- New IP allocations from the subnet attempted to reserve as many IPs as could fit on the ENI (instead of one IP at a time)
Together, these resulted in enough IP addresses being allocated to the nodes, allowing the cluster to stabilize. The average number of allocated IP addresses on each node increased from ~75 per node to ~250, and our rate of successful Pod starts returned to normal.
We continued to monitor the cluster over the weekend and observed a stable state until the following Monday.
2025-05-19: AWS CNI config change goes wrong, reverting doesn’t help
The following Monday, as an incident follow-up, we decided to revert the final change to our AWS CNI configuration. We believed that the state of our us-east-1 cluster after Friday’s incident was not stable, and thought that switching back to WARM_IP_TARGET=1
would help.
At the time, we were concerned that AWS CNI’s behavior with WARM_IP_TARGET=0
was unspecified, and believed that WARM_IP_TARGET=1
would be more stable.
Rolling out WARM_IP_TARGET=1
triggered the same behavior where deleting Pods interfered with our ability to start new ones. Upon observing the same conditions, we then undid that change back to WARM_IP_TARGET=0
. However, IP assignment errors continued.
We compensated for high error rates by increasing the size of our pre-created compute pools. The IP assignment errors continued for hours afterwards, until a 86-second window where our control plane didn’t stop any Pods, allowing more IPs to be allocated and resolving the errors.
Why set WARM_IP_TARGET=1 again?
As is often the case, we knew much less then than we do now.
We were becoming less certain about the behavior of WARM_IP_TARGET=0
. In one place, the documentation said that zero was equivalent to “not setting the variable”, but that the default was “None”, leaving us uncertain about the actual behavior with that configuration. If zero were the default, that would have been the same configuration that originally caused us to run out of IP addresses.
We also suspected that Friday’s IP assignment issues may have been resolved by coincidence and not by setting WARM_IP_TARGET=0
. For example, we saw temporary improvements every time we restarted the aws-node
DaemonSet (which reinitializes ipamd) — the symptoms could have been resolved by the final restart.
Remembering that setting WARM_IP_TARGET=1
had initially helped on Friday, we believed that it was likely to be more stable than the unknown situation we found ourselves in.
This was a mistake at the time. Further, reverting back to WARM_IP_TARGET=0
was not sufficient to recover from the resulting degraded state.
What happened with WARM_IP_TARGET=1 this time?
IP assignment immediately started failing, and with it, our rate of successful Pod starts dropped to the same level as it was with WARM_IP_TARGET=1
on Friday.
This was unexpected. At the time, we thought Friday’s IP assignment errors were due to the cluster being left in a bad state after our VPC subnets ran out of IP addresses. Here, the errors started from a stable state — clearly inconsistent with our understanding.
Aiming to avoid further outages, we wanted to be sure of any additional configuration changes. We took some time to examine ipamd logs and eventually determined that there were likely specific issues with WARM_IP_TARGET=1
.
We reverted back to WARM_IP_TARGET=0
, but continued to see IP assignment errors.
Why didn’t reverting fix the issue?
It was very unexpected that issues persisted after reverting WARM_IP_TARGET back to 0. This was the healthy state through the weekend, so why didn’t it work now!?
The rate of errors had decreased enough for more Pods to get through, but the overall success rate was far below expectations:
In our investigation over the following weeks, we attained a deeper understanding of the AWS CNI codebase.
When WARM_IP_TARGET=0
and MINIMUM_IP_TARGET=0
, AWS CNI uses the behavior for WARM_ENI_TARGET
— even though we had WARM_ENI_TARGET=0
as well.
Under these conditions, ipamd will only allocate more IP addresses to the node if there are no available IP addresses. Together with our finding that IP addresses in cooldown are counted towards the number of available addresses.
This means that these settings only allow allocating more IP addresses if:
- All of the IP addresses on the node are assigned to Pods; and
- No Pods on the node were removed in the last 30 seconds
Setting WARM_IP_TARGET=1
released many of the IP addresses on our nodes. Setting it back to zero while we continued to have high Pod churn meant that ipamd never saw the necessary conditions to reallocate those IP addresses. This only happened because we had also set WARM_ENI_TARGET=0
.
What did we do to work around errors in the meantime?
Internally, our control plane maintains a pool of “warm” Compute pods, so that Pod starts are not on the hot path for waking an idle database.
We were only seeing ~10% of Pods failing to start, so we were able to compensate for failures by increasing the size of the pool.
This mitigated the user impact as errors continued behind the scenes.
What eventually caused the errors to stop?
Many hours after we’d mitigated user impact, we saw the IP assignment errors suddenly drop to near-zero.
We weren’t sure why at the time, but in the course of our deeper investigation, we found that this recovery was ironically due to the same issue that initially triggered Friday’s incident: Our control plane stopped shutting down idle computes for 86 seconds, due to an expensive query to the backing Postgres database. (We wrote more about this query and our changes to improve its execution plan in this related blog post.)
This brief gap with no Pod deletions meant there were no IP addresses in cooldown, so further compute starts allowed ipamd’s conditions for allocating more IP addresses to be satisfied. And indeed, there were simultaneous allocations across the cluster, which in turn reduced the Pod start failure rate.
Final thoughts
This incident resulted in significant downtime for our customers and we were determined to understand the conditions that led to it, so we can prevent it – and incidents like it – from happening again.
Throughout this investigation our team learned a lot about AWS CNI internals, and we’ve even submitted a pull request to help improve the behavior for others.
In keeping with our philosophy of learning from incidents, we decided to make the investigation public. We hope that other teams can benefit from what we’ve learned, helping us all move towards a more reliable future.