Avoid memory leaks when caching instance methods in Python classes

While working on a customer's project last week I stumbled across the problem that caching instance method results in Python classes easily creates memory leaks, especially in long-running processes.

Although Python comes with built-in caching functionality (see functools library) and there are 3rd party caching libraries available like cachetools I couldn't find a proper solution that satified my needs. So I wrote my own, by using a very nice function from the cachetools libarary, and building a proper decorator around it.

Before using the code shown below, create a virtualenv and install the cachetools package upfront so that you can try things out yourself:

$ virtualenv pyenv
$ . pyenv/bin/activate
$ pip install cachetools

Now let's start implementing. A naive approach to apply caching functionality to a very simple could be:

from cachetools import cached

mycache = {}

class MyClass:
    @cached(mycache)
    def myfunc(self, a):
        print('myfunc called with', a)
        return a * 2

It uses the cached() decorator from the cachetools libary, and initializes it with a dictionary instance that serves as the actual container for the caching.

Now let's create an instance of MyClass and call myfunc() twice to see that the cache works:

>>> my = MyClass()
>>> my.myfunc(2)
myfunc called with', 2
4
>>> my.myfunc(2)
4

Here we can see that the second call doesn't actually call into to the myfunc() function but receives the result from the previous call from the cache. So far so good.

But: What happens if the my-instance gets deleted (and hence garbage collected)?
Does the cache lose its cached results from those calls above?

Well, let's see. In order to prove that an object is really deleted from the Python interpreter by garbage collection I'm applying a weak reference to the object which I want to delete. If the weak reference returns None, the previously referenced object has been indeed deleted.

Note: Garbage collection doesn't necessarily happen immediately after an object has been orphaned. Instead, the Python interpreter calls the garbage collector at certain trigger points, so it can take a while until objects are actually removed. However, the garbage collector can be called manually to enforce the cleanup process, which is what we do in the following example.

Let's demonstrate this:

>>> import weakref, gc
>>> myinst = MyClass()
>>> myinstref = weakref.ref(myinst)
>>> myinstref()
<__main__.MyClass object at 0x7f6565b11a30>
>>> del myinst
>>> gc.collect()   # enforce garbage collection
>>> myinstref() is None
True

So, this worked. Now let's go back to the my-instance created above, and also try to remove it:

>>> myref = weakref.ref(my)
>>> del my
>>> gc.collect()
>>> myref.ref()
<__main__.MyClass object at 0x7f6565a01c40>

Hmm, this didn't work quite as expected, the my instance is still alive. The reason behind this is actually the cache. When we had called my.myfunc(2) before, not only the parameter 2 and the result 4 got stored in the cache, but also the reference to self, just because self is part of the method signature. This can be made visible by inspecting the cache object, which is a plain dictionary:

>>> mycache
{(<__main__.MyClass object at 0x7f6565a01c40>, 2): 4}

As you can see the cache itself keeps a reference to the my instance which prevents it from being garbage collected. Now imagine you create thousands or millions of MyClass instances, and only ever make the same call my.myfunc(2) to it. The result would be a million entries in the cache for parameter 2 and result 4, just each with a reference to a different short-lived MyClass instance. VoilĂ , here is our memory leak.

The solution is to clear the cache, and then our my instance will eventually also be garbage collected:

>>> cache.clear()
>>> gc.collect()
>>> myref.ref() is None
True

Constructing a cache that frees instances when they are garbage collected

As we have seen before the weakref-module provides a nice way to keep references to objects without holding a firm grip on them. This approach can be used inside the cache as well. Here is an implementation that I came up with, on this nice Sunday afternoon.

In order to understand this implementation please make sure you are familiar with how decorators work in general. I will only explain details which are specific to the implemenation, not the entire principle of decorators.

from cachetools import cachedmethod
from weakref import WeakKeyDictionary

def methodcache(cache_factory):
    def wrapped_methodcache(method):
        def get_instance_cache(self):
            try:
                instance_cache = weak_cache[self]
            except KeyError:
                instance_cache = weak_cache[self] = cache_factory()
            return instance_cache

        weak_cache = WeakKeyDictionary
        cached_method = cachedmethod(get_instance_cache)(method)
        # Attach the weak dictionary to the decorator so we can access it from outside:
        cached_method.__cache__ = weak_cache
        return cached_method

    return wrapped_methodcache


# Application:
class MyNewClass:
    @methodcache(dict)
    def myfunc(self, a):
        return a * 2

The core of this decorator is the cachetools.cachedmethod() function (which itself is a decorator). It is initialized with a callback function (get_instance_cache() in this case) which is called each time the decorated method is called. It receives a reference to the class instance (MyNewClass() in the demo case above) as argument self.

The approach is that each decorated method will get a WeakKeyDictionary-instance attached to it. Since get_instance_cache() is defined at the same level as the weak_cache variable is assigned the WeakKeyDictionary, it has access to this weak dictionary instance. When it is called it looks up whether this weak dict has already stored a cache for this. If not, such a cache is created via the cache_factory() instance. Then the cache is returned to the cachedmethod decorator function, which itself then is responsible for doing the caching or looking up cached values.

I've to admit that this solution is not super easy to understand. but with some experience with decorators in general and a little digging into the code I hope everyone can profit from this solution.