E pur si muove

Judging performance of python code

Saturday, April 03, 2010

Recently I"ve been messing around with python bytecode, as a result I now know approximately how the python virtual virtual machine works at a bytecode level. I've found this quite interesting but other then satisfying my own curiosity had no benefit from this knowledge. Until today I was wondering which code would be more efficient:

a += 1
if b is not None:
    a += 2
if b is None:
    a += 1
    a += 3

From a style point of view I'd prefer to write the first: I find it slightly more readable since it's got fewer indented blocks and it uses one line less on my screen. But my gut feeling tells me the second is faster, particularly if b is not None (this because in the first sample we'd do 2 add operations instead of one if b is not None). But now I can verify my gut feeling! All I need to do is compile both fragments and use the disassembler to investigate:

>>> co1 = compile(sample1, '', 'exec')
>>> dis.dis(co1)
  2           0 LOAD_NAME                0 (a) 
              3 LOAD_CONST               0 (1) 
              6 INPLACE_ADD          
              7 STORE_NAME               0 (a) 

  3          10 LOAD_NAME                1 (b) 
             13 LOAD_CONST               2 (None) 
             16 COMPARE_OP               9 (is not) 
             19 POP_JUMP_IF_FALSE       35 

  4          22 LOAD_NAME                0 (a) 
             25 LOAD_CONST               1 (2) 
             28 INPLACE_ADD          
             29 STORE_NAME               0 (a) 
             32 JUMP_FORWARD             0 (to 35) 
        >>   35 LOAD_CONST               2 (None) 
             38 RETURN_VALUE
>>> co2 = compile(sample2, '', 'exec')
>>> dis.dis(co2)
  2           0 LOAD_NAME                0 (b) 
              3 LOAD_CONST               2 (None) 
              6 COMPARE_OP               8 (is) 
              9 POP_JUMP_IF_FALSE       25 

  3          12 LOAD_NAME                2 (a) 
             15 LOAD_CONST               0 (1) 
             18 INPLACE_ADD          
             19 STORE_NAME               2 (a) 
             22 JUMP_FORWARD            10 (to 35) 

  5     >>   25 LOAD_NAME                2 (a) 
             28 LOAD_CONST               1 (3) 
             31 INPLACE_ADD          
             32 STORE_NAME               2 (a) 
        >>   35 LOAD_CONST               2 (None) 
             38 RETURN_VALUE

My analysis of this is pretty simple: count the number of instructions for when b is None and for when b is not None.

b is Noneb is not None

So ultimately the best performance depends on whether b will be None or not. However the difference in the best case is only one instruction, but the difference in the worst case is a whole mighty 4 instructions! This would seem to confirm my gut feeling: the ugly code is better. It also makes me wonder if this is not the sort of optimisation a compiler should be doing: create the bytecode for sample2 regardless of the source code (I'm not a compiler guy and do realise it might not be as simple as that since python is a dynamic language which allows you to change stuff, including executed code, at runtime, yada yada).

There's one more catch tough: I seriously doubt each python instruction can be executed in the same time! So let's use timeit to actually verify this. I'm omitting the trivial code, but this is the result:

$ python3 
sample1, b is None: 0.128307104111
sample2, b is None: 0.128338098526
sample1, b is not None: 0.244062900543
sample2, b is not None: 0.12109208107

To be honest, the result is exactly as speculated: the first sample is slower when b is not None, all others are pretty much the same. The one odd thing is the pretty much doubling of time for the bad case, this suggest the python virtual machine is spending most of the time doing the INPLACE_ADD instruction while all others are probably very quick.

Anyway, in conclusion I guess you can speculate about performance and get an idea by knowing what bytecode will be generated. But at the end of the day you'll get a simple and better answer by simply using timeit. So knowing something about python bytecode still hasn't gained me any benefit. It was still an interesting exercise tough.

Saturday, April 03, 2010 | Labels: |


Matthew Webber said...

Determining the comparative performance by counting the number of instructions only works if the time required to perform each opcode is the same, which I suspect is not the case.

Tinned_Tuna said...

In response to Matthew Webber, Python instructions do not necessarily execute in the same time as one-another.

However, since the instructions seen in the two examples are vastly similar, I think that it can be assumed that each instruction takes a similar length of time, making the comparison valid in this case. I doubt it would hold for many real-world applications.

writeonly said...

How about ``a += 2*(b is not None)``? If you're going to go ugly, go all the way :)

>>> c = compile( "a += 2*(b is not None)",'','exec')
>>> dis.dis(c) # 10 instructions

Pingveno said...

Function calls in Python (including INPLACE_ADD) are relatively expensive. Jumping around or 'is' comparison is practically free, relatively speaking. Even a C function call like int.__add__ takes time to set up. Hooks get you a lot more flexibility (e.g. Decimal), but it's not free.

If I understand correctly, a JIT loves numerical computation like this because simple in place integer addition can be compiled down to a few instructions instead of a full function call.

Anonymous said...

You ask whether Python could optimize the first into the second... the problem is that it can't in general because __add__ may have side effects (or may be implemented in a completely different manner, e.g. turn the rhs into a string an concatinate. Thus

a += 1
a += 2


a += 3

can only be assumed to do the "same thing" if one knows something about the type of a.

Floris Bruynooghe said...

@anonymous: Ah yes, that would be a good reason. I only just realised that it is actually the INPLACE_ADD instruction which looks up __add__ etc. That makes a lot of sense though.

Nadav said...

You may be interested in this article about python bytecode optimizations:

It talks about optimizing python in bytecode level.

New comments are not allowed.

Subscribe to: Post Comments (Atom)