5 comments

Hi everyone,

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

The goal of this project was to increase the speed of numerical applications in both the NumPyPy library and for arbitrary Python programs. In PyPy we have focused a lot on improvements in the 'typical python workload', which usually involves object and string manipulations, mostly for web development. We're hoping with this work that we'll continue improving the other very important Python use case - numerics.

What it can do!

It targets numerics only. It will not execute object manipulations faster, but it is capable of enhancing common vector and matrix operations.
Good news is that it is not specifically targeted for the NumPy library and the PyPy virtual machine. Any interpreter (written in RPython) is able make use of the vectorization. For more information about that take a look here, or consult the documentation. For the time being it is not turn on by default, so be sure to enable it by specifying --jit vec=1 before running your program.

If your language (written in RPython) contains many array/matrix operations, you can easily integrate the optimization by adding the parameter 'vec=1' to the JitDriver.

NumPyPy Improvements

Let's take a look at the core functions of the NumPyPy library (*).
The following tests tests show the speedup of the core functions commonly used in Python code interfacing with NumPy, on CPython with NumPy, on the PyPy 2.6.1 relased several weeks ago, and on PyPy 15.11 to be released soon. Timeit was used to test the time needed to run the operation in the plot title on various vector (lower case) and square matrix (upper case) sizes displayed on the X axis. The Y axis shows the speedup compared to CPython 2.7.10. This means that higher is better.

In comparison to PyPy 2.6.1, the speedup greatly improved. The hardware support really strips down the runtime of the vector and matrix operations. There is another operation we would like to highlight: the dot product.
It is a very common operation in numerics and PyPy now (given a moderate sized matrix and vector) decreases the time spent in that operation. See for yourself:

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work

This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:

Check alignment at runtime to increase the memory throughput of the CPU
Support the AVX vector extension which (at least) doubles the size of the vector register
Handle each and every corner case in Python traces to enable it globally
Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel

Cheers,
The PyPy Team

Automatic SIMD vectorization support in PyPy

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

What it can do!

NumPyPy Improvements

Future work

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

The PyPy blogposts

Recent Posts

Archives

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

What it can do!

NumPyPy Improvements

Future work

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Comments

The PyPy blogposts

Recent Posts

Archives

Tags