OOM Killer is Silent in Kubernetes

When it comes to resource management in kubernetes one of the most common pieces of advice you come across is that memory limit for a container should be equal to its memory request. If you're like me and like learning things the hard way, you probably lost a node due to a spike in memory usage by one of the pods. Setting requests and limits this way is safer but is a delicate act of balancing resource utilization and stability of your pods. If you set too high values your actual resource utilization can be low and if you set them too low your application will keep getting killed as it tries to consume more memory than you specified.

If you're frugal when it comes to handing out resources to your workloads or your workloads are poorly coded or handle highly unpredictable loads, you've probably seen containers being restarted with reason OOMKilled. This means that one of the processes inside the container tried to allocate amount of memory that would put the total memory usage above the limit specified for the container. In this case the process that got killed was the main process inside the container, which caused the restart.

However many apps run multiple processes and there is no guarantee that the main process will get killed. This means that if your app uses multiple processes and doesn't monitor them properly one of them might get killed without you ever knowing.

If you're running kubernetes 1.24 or higher and using prometheus you can collect and create alerts on cAdvisor metric container_oom_events_total. If you're on an older version of kubernetes you can use node exporter to collect node_vmstat_oom_kill and set up alerts to let you know when a process gets OOM killed. However this will only tell you on which nodes it happened and you'd need to figure out in which container it happened (you can use container_processes cAdvisor metric for that).

In either case, a naive implementation of the alerts could add noise to your container restart alerts in situations where the process that gets killed is the main process of the container.

Share:

If you've come this far with the article you may want to know a thing or two about me if you don't already. You can also read other blog posts or about stuff I've learned recently.

This website is open source. If you've come across a mistake please let me know there. For other types of feedback you can reach out to me through email or social-media.