E pur si muove

Pylint and dynamically populated packages

Thursday, December 04, 2014

Python links the module namespace directly to the layout of the source locations on the filesystem. And this is mostly fine, certainly for applications. For libraries sometimes one might want to control the toplevel namespace or API more tightly. This also is mostly fine as one can just use private modules inside a package and import the relevant objects into the file, optionally even setting __all__. As I said, this is mostly fine, if sometimes a bit ugly.

However sometimes you have a library which may be loading a particular backend or platforms support at runtime. An example of this is the Python zmq package. The apipkg module is also a very nice way of controlling your toplevel namespace more flexibly. Problem is once you start using one of these things Pylint no longer knows which objects your package provides in it's namespace and will issue warnings about using non-existing things.

Turns out it is not too hard to write a plugin for Pylint which takes care of this. One just has to build the right AST nodes in place where they would be appearing at runtime. Luckily the tools to do this easily are provided:

def transform(mod):
    if == 'zmq':
        module = importlib.import_module(
        for name, obj in vars(module).copy().items():
            if (name in mod.locals or
                    not hasattr(obj, '__module__') or
                    not hasattr(obj, '__name__')):
            if isinstance(obj, types.ModuleType):
                ast_node = [astroid.MANAGER.ast_from_module(obj)]
                if hasattr(astroid.MANAGER, 'extension_package_whitelist'):
                real_mod = astroid.MANAGER.ast_from_module_name(obj.__module__)
                ast_node = real_mod.getattr(obj.__name__)
                for node in ast_node:
            mod.locals[name] = ast_node

As you can see the hard work of knowing what AST nodes to generate is all done in the astroid.MANAGER.ast_from_module() and astroid.MANAGER.ast_from_module_name() calls. All that is left to do is add these new AST nodes to the module's globals/locals (they are the same thing for a module).

You may also notice the fix_linenos() call. This is a small helper needed when running on Python 3 and importing C modules (like for zmq). The reason is that Pylint tries to sort by line numbers, but for C code they are None and in Python 2 None and an integer can be happily compared but in Python 3 that is no longer the case. So this small helper simply sets all unknown line numbers to 0:

def fix_linenos(node):
    if node.fromlineno is None:
        node.fromlineno = 0
    for child in node.get_children():

Lastly when writing this into a plugin for Pylint you'll want to register the transformation you just wrote:

def register(linter):
    astroid.MANAGER.register_transform(astroid.Module, transform)

And that's all that's needed to make Pylint work fine with dynamically populated package namespaces. I've tried this on zmq as well as on a package using apipkg and its seems to work fine on both Python 2 and Python 3. Writing Pylint plugins seems not too hard!

New pytest-timeout release

Thursday, August 07, 2014

At long last I have updated my pytest-timeout plugin. pytest-timeout is a plugin to py.test which will interrupt tests which are taking longer then a set time and dump the stack traces of all threads. This was initially developed in order to debug some some tests which would occasionally hang on a CI server and can be used in a variety of similar situations where getting some output is more useful then getting a clean testrun.

The main new feature of this release is that the plugin now finally works nicely with the --pdb option from py.test. When using this option the timeout plugin will now no longer interrupt the interactive pdb session after the given timeout.

Secondly this release fixes an important bug which meant that a timeout in the finaliser of a fixture at the end of the session would not be caught by the plugin. This was mainly because pytest-timeout was not updated since py.test changed the way fixtures where cached on their scope, the introduction of @pytest.fixture(scope='...'), even though this was a long time ago.

So if you use py.test and a CI server I suggest now is as good a time as any to configure it to use pytest-timeout, using a fairly large timeout of say 300 seconds, then forget about it forever. Until maybe one day it will suddenly save you a lot of head scratching and time.

Designing binary/text APIs in a polygot py2/py3 world

Sunday, April 27, 2014

The general advice for handling text in an application is to use a so called unicode sandwich: that is decode bytes to unicode (text) as soon as receiving it, have everything internally handle unicode and then right at the boundary encode it back to bytes. Typically the boundaries where the decoding and encoding happens is when reading from or writing to files, when sending data across the network etc. So far so good.

All this is fine in an environment where it is possible to know the encoding to be used and where an encoding failure can simply be treated as a hard failure. However POSIX is notoriously bad at this, for many things the kernel just doesn't care and any bytes which go in will come back out. This means that for e.g. a filename or command line arguments the kernel does not care about it being valid in the current locale/encoding or even any encoding. When Python 3.0 was initially released this was a problem and by Python 3.1 the solution used was to introduce the surrogateescape error handler for decoders and encoders. This allows Python 3 to smuggle un-decodable bytes in unicode strings and the encoder will put them back when round-tripping. The classical example of why this is useful is when listing files using e.g. os.listdir() to then later pass them back to the kernel via e.g. open().

The downside of surrogate escapes is that the unicode strings now are no longer valid for many other normal string manipulations. If you try to write the result of os.listdir() to a file which you want to encode using UTF8 the encoding step will blow up, so this kind of brings the old Python 2 situation with bytes back. So any user of the API needs to be aware that strings may contain surrogate escapes and handle them appropriately. For a detailed description of these cases refer to Armin Ronacher's Unicode guide which introduces is_surrogate_escaped(s) and remove_surrogate_escaping(s, method='ignore') functions which are pretty self-explanatory.

But let's for now accept the surrogate escape solution Python 3 introduces, as long as the API documents this a user can handle it with the earlier mentioned helper functions. However when designing a polygot library API it is impossible to use the surrogateescape error handler since it does not exist for Python 2.7. And since the required groundwork was not backported either it is impossible to write a surrogateescape handler for Python 2.7, which I consider a glaring omission certainly given the timeline. So this pretty much makes the surrogateescape option not viable as a 2.7/3.x API.

So what options are there left for an API designer? One suggestion is to use native strings: bytes on Python 2.7 and unicode with surrogateescapes on Python 3.x. This means in either case there is no loss of data. But it also means the user of the API now has a harder time writing polygot code if they want to use the unicode sandwich. Given the difficulties to the user I'm not sure I'm a fan of this API.

Another correct, but rather unfriendly, option is to just consider the API to expose bytes and provide the encoding which should be used to decode it. In this case the user can choose the appropriate error handler themselves, be it =ignore=, =replace= or, on Python 3, surrogateescape. The advantage is that this would behave exactly the same on Python 2 and Python 3, however it leaves a casual user of the API a bit lost, certainly on Python 3 where receiving bytes of the API is not very friendly and feels like pushing the Python 2 problems back onto them.

Yet another option I've been considering is provide both APIs: one exposing the bytes, with the attributes possibly prefixed with a b, and one convenience API which decoded the bytes to unicode using the =ignore= error handler. This really seems to pollute the API but might still be the most pragmatic solution: it behaves the same on both Python 2 and Python 3, does not lose any information, allows easy use of the all-unicode inside text model yet still allows explicit handling of decoding.

So what is the best way to design a polygot API? I would really like to hear peoples opinions on which API would be the nicest to use. Or hear if there are any other tricks to employ for polygot APIs.

Saturday, February 08, 2014

Don't be scared of copyright

It appears there is some arguments against putting copyright statements on the top of a file in free software or open source projects. Over at Rich Bowen argues that it is counter productive and not in the community spirit (in this case talking about OpenStack). It seems to me the main arguments are the following:

  • It is intimidating to new people
  • It gets too verbose
  • Encourages contribution for the wrong reasons
  • It is hard to decide when to add a name
  • It is even harder to decide when to remove a name
  • The VCS keeps track of contributions anyway

Lastly and perhaps most improtantly he asks:

[...] why do you care? What are you trying to protect against? If you're trying to protect against your contribution being taken by the community and used for other purposes, perhaps contributing to an Apache-licensed code base isn't the smartest thing to do.

Now I think the last question is the most important to answer: you want to assert your copyright on a file to avoid your work from being re-licenced against your will.

That to me is really the crux of the issue and it certainly does not go against the spirit of free or open source software. In fact every additional author asserting their copyright under the license chosen for the project makes the committment of the project to this free or open source license even deeper. It is an attempt to protect against some hypotetical future lawyer who might one day try to claim someone was allowed to do something which was not in the spirit of the free or open source project. For every additional person or organisation listed as holding copyright it becomes harder to ever re-license the work. And this is a good thing.

Now I choose the words "assert your copyright" carefully. Not being a lawyer I am not sure what is the best way to do this. Personally I remember the FSF recommending to put a copyright line with the license in every file so I trust them that this is a fairly legally sound approach. But likewise I'm fine with a LICENSE.txt and AUTHORS.txt file, it is far nicer to work with as developers however I can imagine lawyers being able to attack such a system easier.

As to addressing the other minor points: on the social issues I can't really counter much. Yes it would be a shame if people would be scared away for no reason.

As for deciding when to add or remove a person to the copyright: this might always remain tricky, but if in doubt add the person. And simply never ever remove a person. However the claim that the VCS would be capable of tracking who is the author of what fragment is a bit frivolous, it is fairly common for patches to be applied by a committer instead of the author. And even if originating from a pull request there might easily be reasons to change the commit for some typos, squashing commits etc. I've seen the original authors of a commit disappear before, so really wouldn't just leave it up the VCS to do this bookkeeping.

So in short, the more people are listed as owning copyright on a project the healthier it is and the more I trust it. Please do not be scared away by other people or organisations being listed as copyright holders.

Subscribe to: Posts (Atom)