IronPython hammers CPython when not mutating class attributes
Earlier today I posted the second article in what is turning out to be a short series in the investigation into why the performance of IronPython is around 100× slower than CPython, when running the front-end of my OWL BASIC compiler.
Try storing the counter as a global variable instead of a class-level member of Node — I think you’ll notice a dramatic improvement.
The modified benchmark program looks like this:
counter = 0 class Node(object): def __init__(self, children): global counter counter += 1 self._children = children def make_tree(depth): if depth > 1: return Node ([make_tree(depth - 1), make_tree(depth - 1)]) else: return Node() def main(argv=None): global counter if argv is None: argv = sys.argv depth = int(argv) if len(argv) > 1 else 10 root = make_tree(depth) print counter return 0 if __name__ == '__main__': import sys sys.exit(main())
A dramatic improvement!
Well, Curt wasn’t wrong. This made a phenomenal difference with IronPython completing in only 12% of the time taken by CPython – over 8× faster with a binary tree depth of 20.
Let’s look in detail at the results. All results are from a dual quad-core 1.86 GHz Xeon with 4 GB RAM, and as before each benchmark was run five times, and the shortest time of the five taken.
The three test environments are:
- Python 2.5.2 x86 32-bit
- Jython 2.5rc2 on Java Hotspot 1.6 32-bit
- IronPython 2.0 on .NET 2.0 x64
Here we can see how IronPython’s performance has been improved hugely by this simple change. Although startup time dominates for the smaller problem size, now both Jython and IronPython surpass CPython at around half-a-million nodes.
Removing start-up time, which may be irrelevant for long-running processes, gives us the following chart:
Again there is a lot of noise in the data below 1000 nodes, but it is clear that Jython scales better than IronPython, which in turn is scaling better than CPython.
Up until now I’ve been using a log-log scale in the charts because of the wide variation in performance between the different implementations, but now the performance gap is much closer, it’s difficult to get a sense of just how much faster IronPython is on the modified benchmark. Let’s throw in a log-linear plot to help us appreciate what’s going on:
It’s perhaps easier to see now that IronPython is doing in 14 seconds what takes CPython 114 seconds to achieve!
Finally, let’s plot those results as we did before, as multiples of CPython performance:
It is easy to see that in this chart, once we pass half-a-million tree nodes (a tree depth of 19) that both Jython and IronPython are significantly beating CPython.
In this particular case, IronPython is slow because of the update to Node.counter. Currently, any update to a class will increment the version number for the class, which will have the effect of invalidating any rules compiled for that class. Effectively, the same rules are getting compiled over and over again. Moving the counter to a global should result in performance on par with that of CPython.
which is absolutely correct, except that he’s underselling the relative gain. IronPython is not only on a par with CPython, it can outperform it by a factor of eight!
With this knowledge in hand, I can now approach optimization of my OWL BASIC compiler, which lies back at the start of this illuminating tale.
- Avoiding mutation of Python class attributes can have significant benefits for IronPython performance.
- Both IronPython and Jython scale better than CPython by this benchmark, and have superior performance for large trees of nodes.