E pur si muove

Designing binary/text APIs in a polygot py2/py3 world

Sunday, April 27, 2014

The general advice for handling text in an application is to use a so called unicode sandwich: that is decode bytes to unicode (text) as soon as receiving it, have everything internally handle unicode and then right at the boundary encode it back to bytes. Typically the boundaries where the decoding and encoding happens is when reading from or writing to files, when sending data across the network etc. So far so good.

All this is fine in an environment where it is possible to know the encoding to be used and where an encoding failure can simply be treated as a hard failure. However POSIX is notoriously bad at this, for many things the kernel just doesn't care and any bytes which go in will come back out. This means that for e.g. a filename or command line arguments the kernel does not care about it being valid in the current locale/encoding or even any encoding. When Python 3.0 was initially released this was a problem and by Python 3.1 the solution used was to introduce the surrogateescape error handler for decoders and encoders. This allows Python 3 to smuggle un-decodable bytes in unicode strings and the encoder will put them back when round-tripping. The classical example of why this is useful is when listing files using e.g. os.listdir() to then later pass them back to the kernel via e.g. open().

The downside of surrogate escapes is that the unicode strings now are no longer valid for many other normal string manipulations. If you try to write the result of os.listdir() to a file which you want to encode using UTF8 the encoding step will blow up, so this kind of brings the old Python 2 situation with bytes back. So any user of the API needs to be aware that strings may contain surrogate escapes and handle them appropriately. For a detailed description of these cases refer to Armin Ronacher's Unicode guide which introduces is_surrogate_escaped(s) and remove_surrogate_escaping(s, method='ignore') functions which are pretty self-explanatory.

But let's for now accept the surrogate escape solution Python 3 introduces, as long as the API documents this a user can handle it with the earlier mentioned helper functions. However when designing a polygot library API it is impossible to use the surrogateescape error handler since it does not exist for Python 2.7. And since the required groundwork was not backported either it is impossible to write a surrogateescape handler for Python 2.7, which I consider a glaring omission certainly given the timeline. So this pretty much makes the surrogateescape option not viable as a 2.7/3.x API.

So what options are there left for an API designer? One suggestion is to use native strings: bytes on Python 2.7 and unicode with surrogateescapes on Python 3.x. This means in either case there is no loss of data. But it also means the user of the API now has a harder time writing polygot code if they want to use the unicode sandwich. Given the difficulties to the user I'm not sure I'm a fan of this API.

Another correct, but rather unfriendly, option is to just consider the API to expose bytes and provide the encoding which should be used to decode it. In this case the user can choose the appropriate error handler themselves, be it =ignore=, =replace= or, on Python 3, surrogateescape. The advantage is that this would behave exactly the same on Python 2 and Python 3, however it leaves a casual user of the API a bit lost, certainly on Python 3 where receiving bytes of the API is not very friendly and feels like pushing the Python 2 problems back onto them.

Yet another option I've been considering is provide both APIs: one exposing the bytes, with the attributes possibly prefixed with a b, and one convenience API which decoded the bytes to unicode using the =ignore= error handler. This really seems to pollute the API but might still be the most pragmatic solution: it behaves the same on both Python 2 and Python 3, does not lose any information, allows easy use of the all-unicode inside text model yet still allows explicit handling of decoding.

So what is the best way to design a polygot API? I would really like to hear peoples opinions on which API would be the nicest to use. Or hear if there are any other tricks to employ for polygot APIs.


New comments are not allowed.

Subscribe to: Post Comments (Atom)