Stop putting this into your Python Dockerfiles

October 16, 2022

ENV PYTHONDONTWRITEBYTECODE 1 is very common in python dockerfiles. However most of the time it is useless at best and harmful at worst.

Preventing python from caching compiled bytecode

CPython doesn’t execute the code directly. Instead it first compiles the code into bytecode which is cached as .pyc files in __pycache__ directory you know and hate. Having cached the bytecode, the interpreter can skip the compilation of modules that haven’t changed. If for some reason you want to prevent python from caching the bytecode, you can set PYTHONDONTWRITEBYTECODE environment variable to a non-empty string.

Why do it in dockerfiles

The first thing that comes to mind when thinking about potential problems with caching things is the possibility of getting a stale version of the resource we are caching. Since code doesn’t change during container runtime (unless you are doing some weird stuff) this is unlikely to cause any problems and can be ruled out as a reason to disable bytecode caching in containers.

Another potential issue is that pycs are not deterministic by default, which is explained in more detail and addressed by the PEP-552. This can certainly be fixed by not generating the pycs at all but this can also be fixed by using a different validation mode (e.g. by setting SOURCE_DATE_EPOCH environment variable). In addition to that most people don’t really care about perfect reproducibility in their image builds. They only build new images once when something changes, relying on cache from previous builds to avoid rebuilding layers whose dependencies didn’t change.

Finally the most likely motivation behind putting PYTHONDONTWRITEBYTECODE in dockerfiles is to reduce the size of the image. Since we want to make our images as small as possible it may make sense not to store the cache files. However in most cases PYTHONDONTWRITEBYTECODE saves us NOTHING AT ALL!

pip doesn’t care about this option

Since you usually aren’t running your app during the build process, no pyc files are generated for your code. Now you need to peek into a freshly created vritualenv to see if they are generated for your dependencies during installation. If you do that you’ll see that they indeed are and maybe your project has a lot of them, so eliminating them could surely shave off some MBs from your image. However you’ll be disappointed to find out that this doesn’t happen when you set PYTHONDONTWRITEBYTECODE=1:

export PYTHONDONTWRITEBYTECODE=1
virtualenv venv
. venv/bin/activate
pip install django
find . **/django/**/*.pyc

It should come as no surprise though, as pip doesn’t generate the bytecode by importing the packages but by explicitly compiling them. To prevent this you can pass --no-compile flag to pip install. poetry, on the other hand, does the opposite by default since version 1.4.0 (prior to that version you had no way of preventing it from generating the pycs) and instead exposes a --compile flag to perform the compilation.

How much does it save you

As explained above, PYTHONDONTWRITEBYTECODE doesn’t save you anything, but running pip install with --no-compile or a regular poetry install will net you decent savings on a decently-sized project. For example on an 18K LOC fastapi monolith I have at work, running poetry install without --compile saves 63MB, which is almost a 15% reduction in image size. This obviously doesn’t come for free, so let’s explore the downsides.

How it hurts you

Not generating means the modules need to be compiled during import time. Since most imports occur at startup this means that disabling bytecode caching mostly impacts the startup of the application. This is especially bad if you are using a pre-fork WSGI server like Gunicorn without preloading the app because each worker will need to go through the full compilation process, wasting CPU cycles and slowing down the worker startup. It will also impact worker restarts if you used something like gunicorn max-requests in your config. Another issue is that some dependencies may not be imported during the app startup, leading to them being imported when the first request hits a route that uses them leading to unexpected latency spikes. Heavy dependencies like pandas and numpy which take ages to import even with the bytecode cache compiled, can easily add a couple of seconds to the action that first triggered their import.

TL;DR

You probably want pip install --no-compile instead of PYTHONDONTWRITEBYTECODE=1, even though you probably shouldn’t use either one of them.

If you've come this far with the article you may want to know a thing or two about me if you don't already. You can also read other blog posts or about stuff I've learned recently.

This website is open source. If you've come across a mistake please let me know there. For other types of feedback you can reach out to me through email or social media.