Finding memory leaks in extension modules
After reading Ned Batchelder's post on his experience hunting a memory leak (which turned out to be a reference counting error) it occurred to me that even tough I have a script to check memory usage I should also really be checking reference counts with sys.gettotalrefcount(). And indeed, after adding this to my script I found one reference count leak. I still have faith in my script as it was before really since the reference leak in question was not making me loose memory - subtle bugs eh?
But how do you check an extension module for memory leaks? This seems pretty undocumented so here my approach:
-
First you really need a debug build of python, this helps a lot since you get to use sys.gettotalrefcount() and get more predictable memory behaviour. The most complete way to build this is something like this (the MAXFREELIST stuff adapted from this):
s="_MAXFREELIST=0" ./configure --with-pydebug --without-pymalloc --prefix=/opt/pydebug \ CPPFLAGS="-DPyDict$s -DPyTuple$s -DPyUnicode$s -DPySet$s -DPyCFunction$s -DPyList$s -DPyFrame$s -DPyMethod$s" make make install
-
Now run the test suite using valgrind, this is troublesome but a very useful thing to do. The valgrind memory checker will help you identify problems pretty quickly. It can be confused about Python however, but you only care about your extension module so you need to filter most of this. Luckily the python distribution ships with a valgrind suppression file in Misc/valgrind-python.supp that you can use, it's not perfect but helps. This is how I invoke valgrind:
$ /opt/pydebug/bin/python setup.py build $ valgrind --tool=memcheck \ --suppression=~/python-trunk/Misc/valgrind-python.supp \ --leak-check=full /opt/pydebug/bin/python -E -tt setup.py test ==8599== Memcheck, a memory error detector ==8599== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==8599== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info ==8599== Command: /opt/pydebug/bin/python -E -tt setup.py test ==8599== ==8599== Conditional jump or move depends on uninitialised value(s) ==8599== at 0x400A66E: _dl_relocate_object (do-rel.h:65) ==8599== by 0x4012492: dl_open_worker (dl-open.c:402) ==8599== by 0x400E155: _dl_catch_error (dl-error.c:178) ==8599== by 0x4011D0D: _dl_open (dl-open.c:616) ==8599== by 0x405AC0E: dlopen_doit (dlopen.c:67) ==8599== by 0x400E155: _dl_catch_error (dl-error.c:178) ==8599== by 0x405B0DB: _dlerror_run (dlerror.c:164) ==8599== by 0x405AB40: dlopen@@GLIBC_2.1 (dlopen.c:88) ==8599== by 0x8132727: _PyImport_GetDynLoadFunc (dynload_shlib.c:130) ==8599== by 0x81199D9: _PyImport_LoadDynamicModule (importdl.c:42) ==8599== by 0x81161FE: load_module (import.c:1828) ==8599== by 0x8117FAF: import_submodule (import.c:2589) ... running test ... FAILED (failures=4, errors=2) ==8599== ==8599== HEAP SUMMARY: ==8599== in use at exit: 1,228,588 bytes in 13,293 blocks ==8599== total heap usage: 280,726 allocs, 267,433 frees, 70,473,201 bytes allocated ==8599== ==8599== LEAK SUMMARY: ==8599== definitely lost: 0 bytes in 0 blocks ==8599== indirectly lost: 0 bytes in 0 blocks ==8599== possibly lost: 1,201,420 bytes in 13,014 blocks ==8599== still reachable: 27,168 bytes in 279 blocks ==8599== suppressed: 0 bytes in 0 blocks ==8599== Rerun with --leak-check=full to see details of leaked memory ==8599== ==8599== For counts of detected and suppressed errors, rerun with: -v ==8599== Use --track-origins=yes to see where uninitialised values come from ==8599== ERROR SUMMARY: 75 errors from 5 contexts (suppressed: 19 from 6)
Note that the output is very verbose, usually I actually start with --leak-check=summary. Firstly notice that valgrind gives a lot of warnings already before your extension module gets loaded, that's python's problems and not yours so skip over that. The stuff output after (and during) the output of the test suite is what interests you. Most importantly look at the definitely lost line if that's not zero the you have a leak. The possibly lost is just python's problem (which sadly might hide problems you created too). When you do have lost blocks valgrind will give you a stack trace to pinpoint it, but you'll have to swim trough lots of "possibly lost" stack traces of python to find it. Best is probably to grep for your source files in the output.
-
Next you should create function you want to execute in a loop, this should be exercising the code you want to tests for leaks. If you're really thourough possibly the entire test suite wrapped up in a function call would be good.
Wrap it all up in a script that checks the memory usage and reference counts on each loop and compares the start and end values. Getting memory usage might be tricky from python (or you can use PSI of course) so depending on your situation you might prefer to do this with an script from your operating system.
For PSI this is the script I currently use. I clearly have it easy since I can be sure PSI will be available :-). The reason I don't automate this script further (you could turn it into a unittest) is that I prefer to manually look at the output. Both memory and reference counting are funny and will most likely grow a little bit anyway. By looking at the output I can easily spot if it keeps growing or stabilises, there is only a problem if it keeps growing with every iteration (don't be afraid to run with many many iterations from time to time). When automating this you probably end up allowing some margin and might miss small leaks.
Hopefully some of this was useful for someone.
4 comments:
Unknown said...
I'm the guy who wrote that script including details on how to compile a debug Python.
You do not need to supply any suppressions (and should not) if you are using the debug Python. The suppressions are needed for a regular Python build as the memory management is highly "optimised" and almost every allocation would result in a valgrind message.
The way I test for memory leaks is to have a wrapper around my test suite. If a particular environment variable is set then the tests are run as many times as the environment variable says. (Basically run unittest.main(), catch SystemExit and rerun unittest.main().)
Then I run 5, 10 and 15 iterations of the tests under valgrind and compare the summary totals. If you do have any differences then turn on dumping of them at valgrind exit and compare from two runs. (You'll have to massage the valgrind output to exclude process id.)
Unknown said...
You are right about not needing the suppression file, I think this was an artifact from before I found out about the _MAXFREELIST macros (I only found out from your comment on Ned's post). Or maybe I just got confused when I experimented with it to see the difference and was using the wrong python build.
Comparing valgrind output from different number of iterations is a good idea. And that should be more easily automated, I should add a script for that too.
Anonymous said...
I'm glad my experience inspired you. I went and read your changeset, and it looks like the fix you made in Process_children should also be made in Process_kill. If your leak-finding didn't find the one in Process_kill, maybe you need to increase your test coverage! :)
It's an interesting case, having to properly manage the refcount on global singletons!
Unknown said...
Thanks for spotting that other leak! My memory checking script did not run the full test suite, now it does!
And it spotted the leak too, so test coverage seems fine ;-)
New comments are not allowed.