devork

E pur si muove

Pain

Tuesday, August 04, 2009

I've spent the last 2 days trying to get a stack trace from a crashing python extension module in windows. And I still haven't figured it out. That's sooo very motivating.

Give me GNU/Linux any day.

Tuesday, August 04, 2009 | Labels: |

8 comments:

Floris Bruynooghe said...

The test script is crashed by the time I can do anything. Normally I'd expect to be able to enable a core-dump and read a stack trace from it. But all I get is a window with a "debug" button, which sounds very promising. But it only says "Unhandled exception at 0x00000 in python_d.exe: 0xC000006: Access violation reading location 0x0000." (wrong number of zeroes in this quote)

Sure! I made a pointer error, I do that all the time - no big deal. But give me a stack trace so I know where to look! So I hit the "break" button in the hope to be taken to a useful bit but then I get "There is no source code available for the current location" show disasembly or not? Fair enough, I might be in a system lib - but just show the stack trace, that would be very useful.

This is the case when I use "python setup.py build --devel" and also when I go through the pain of setting a msvc solution up with the projects in and create debug builds that way, then invoke the debugger from inside msvc with a script crafted to use the compiled extensions from the msvc project. But none of that helps, it still only wants to show some random disasembly with no stack to explore. It also seems to ignore breakpoints I set from inside msvc after setting up the projects and invoking a debug run.

(Apologies, I know so little about msvc and windows that I don't even know how to meaningfully ask for help.)

Doug Napoleone said...

no problem. MSDev is hands down the best debugger for C/C++, BUT it has one massive, massive bug. The lack of a core dump facility. I consider that a bug.


Now, it is very difficult to make python get in a state where it will crash from an extension library before you can attach to it. A brute force means is to, as the first line in your script, have an:

import sys; sys.stdin.readline();

then invoke python with:

python -s myscript.py


This will stop the sites.py from loading and prevent some other minor startup time module loading (which might load your extension module).

From there you can attach to process, then hit enter in the window.

This might still not get you a usable stack trace as I fear the problem is one of two things:

1. You have a ref count issue in your extension and the PyObject is being garbage collected as you return it. This means that the python system does not have a chance to realize that its gone before it MUST access it as part of the handoff.

2. You are returning a NULL pointer without setting the python error/exception state. Most likely this is a NULL pointer PyObject* that is being stuffed into some other larger structure being returned as the debug python build should have protections against a bare NULL return.

Both of these will result in access violation exceptions like you are seeing, but well AFTER you code has returned, and ofter well after other python code has executed.

This is a royal PITA to debug on any operating system. You would not get a really usable stack on linux either (but you would have a better chance to figure out which PyObject* was corrupted from the dump, if you are an expert in the PyObject* structure.

Depending on how the access violation is occurring, you really want to be attached to the process and to enable all exception catches (will have to get back to you on where that dialogue is) to reliably get a stack trace with all information in MSDev.

My 'gut' is telling me that the exception is occurring either at the end of python runtime in the process shutdown or it is occurring at the time of the module_init.

Remember that DLL's are NOT shared libraries, and behave very differently. Key issues are that the libraries are loaded and unloaded in completely different orders. I have found that extension modules which keep state need to use the atexit module on windows because the DLL is unloaded AFTER the python process has terminated and is unloaded from memory. This adds major complications as all file handles are also now invalid (including STDIN STDOUT and STDERR).

Don't get me started on signals ;-)

Welcome to my hell.

Floris Bruynooghe said...

Yay! Finally managed to run it under debugging control. I wasn't quite as bad off as you where suspecting. My "can't attach to process" was not quite true, I knew which function I had to call to make it crash so could easily halt execution before to attach the debugger. What made me say that is that when starting the debug from inside visual studio (VS) I got the exception too even when I had a breakpoint at the function I was exptecting to fail, so I could never nose around in the code and didn't know if VS was managing to use the object files I wanted and could debug with them.

I will attempt to write up how I managed to debug extensions on windows once I've figured out the simplest way to get to that point.

What actually intrigues me most is you saying that VS/MSDev has no core dump facility. Does this mean that any post-mortem debugging is out? Is it impossible to find out the state of variables and the stack after an exception has happened? I tried setting "break at all exceptions" (Debug->Exceptions menu item) but still got completely nothing after the fact. I had to step through the code until I found where it broke, then set a breakpoint just before so I could look at the state. But that's a rather laboursome way of finding out the issue. Is there really nothing better? And what do you ask users to send you if there's no core dump?

As for the actual problem, it occurred earlier then I tought (part of why I didn't get to the break point before). I was expecting it to happen in the system-independent code but it was when calling type->tp_alloc on a new windows-only type where I forgot to add the PyType_Ready() call in the module initialisation. So tp_alloc was still pointing to 0x0 and calling that is no good.

But still, a core dump or stack trace would have pointed me directly to that.

(BTW, this is on public code, or will be when I check it in. I'm working on the Windows port of PSI - http://bitbucket.org/chirsmiles/psi)

And oh, so far I respectably disagree with MSDev being the best debugger. I haven't seen it do anything yet that I can't do with the Emacs/gdb/GUD combo. It does less actually given the core-dump issue.

Thanks for the help though!

Doug Napoleone said...

You can add special instrumentation to get a stack after the fact, but no you can not load a core and inspect the stack. There is no core dump binary unless you go through extra steps and instrument your system in some way (ala .NET madness).

This is more a limitation of the operating system than the compiler, but that is another discussion. There are gcc's for windows, but they run into barriers with the core dumping when it comes to any non-gcc compiled part of the stack, and even then there are other issues.

You should have gotten a crash right on the null deference. I do not know why you were not having the debugger dump you right in at the exact point. My guess is that you were not building a debug binary with a full .map file and missing some compile options did you turn off the lookahead compile optimizations? Did you disable the non-inline function inlining optimizations? Those will munge your stack as the stack has been optimized away in those areas of the code.

A NULL deference always dumps me in exactly where I need to be in my uses of MSDev. Very odd.


As for it being the best debugger, sorry it is. You just have no clue how to use it. I see my wife have similar arguments with GIMP users who say that they can do everything in GIMP that can be done in photoshop and that photoshop is crap. She is an expert in both, and photoshop wins hands down every time.

The problem is there is a learning curve and you have to buy into the windows world. If you don't want to put in that effort, then you will not get the benefits; that simple.

NOTE: I ONLY use MSDev for debugging, not as a general IDE. I refuse to put in the effort to get the added benefits I could get out of that part of it.

While our system primarily runs on our linux grid, we always make reprodumps and debug on windows because the debugger is so much better. The reprodump facility is part of our engine for being able to reproduce runs, and allows us to reproduce with different builds as well, something a core dump could never do.

Again, I have no clue what your problems with getting a stack trace are. I have never had those issues with a simple NULL dereference. The only time I have had them are with static runtime initialization and destruction where the OS/process bindings are not fully available and with the pre-SSE2 floating point exception signals. Even then it is easy enough to get MSDev to give you the proper information by rebuilding with some extra instrumentation options and disabling other optimizations.

Doug Napoleone said...

Here are the compile flags we use for a full-debug, non-optimized build:
(well the relevant parts)

-DDEBUG=1 /Gy /GF /c /Zl /Zm500 /Zi /Zc:forScope /WX /W4 /Wall /MTd /fp:precise /fp:except /D__STDC__ /MP2 /Od /RTCc /RTCs /RTCu /GS /Ob2 /EHsc-

Floris Bruynooghe said...

Thanks for those compiler flags, I tied modifying the normal distutils compiler flags to include these as well. Also added /MAP to the linker and re-linked python to include /MAP, but all with no success: still don't get the stack at the NULL dereference.

Is there anything you do the python compilation? Or do you just build the normal Debug build as shipped in the pcbuild solution file?

Floris Bruynooghe said...

For completeness sake I'm adding a note on my problem of MSDev not being able to show me where the NULL-dereference was.

I strongly suspect that my user was not in the correct group (debugger users).

New comments are not allowed.

Subscribe to: Post Comments (Atom)