Prototyping mathematical code in Python with the Scipy/Numpy libraries and then switching to Cython for speed often works well, but there are limitations. The main problem that has been bugging me recently is the speed of matrix function calls. What happens is that your Cython code needs to compute, say the outer product of two vectors, and so makes a call to Numpy. At this point everything switches to very slow interpreted Python code which does a few checks etc. before calling into the underlying fast BLAS library that does the actual work. For large matrices the cost of this wrapping code wasn’t a big deal, but for small matrices it can be a huge performance hit.
To solve this problem I’ve created a Cython wrapper for many of the more common BLAS functions. It’s called Tokyo: I often name code after cities and both Tokyo and BLAS/LAPACK were big, fast and very foreign to start with! At the moment Tokyo only wraps the BLAS routines for vectors and general matrices with single and double precision. If you want to add other things such as banded matrices, complex numbers or LAPACK calls: just look at what I’ve already done and add the functions you need. I’ve also added a few extra functions that I find useful when doing matrix calculations. The idea is that Tokyo will eventually encompass all of BLAS and LAPACK.
The speedup you can expect varies a lot depending on the size of the vectors and matrices involved and the operation being computed. An outer product on length 4 vectors can be over 150 times faster, while a 20×20 matrix multiplication is about the same speed. After installing Tokyo, execute the single and double precision speed tests to see how much faster the Tokyo routines are compared to using Numpy on your machine. For my own reinforcement learning research I’ve managed to get between 5 and 20 times speedups with various algorithms.
Here’s a simple example of a vector outer product, first in pure Python with Numpy:
import numpy as np
x = np.array( [1.0, 2.0, 3.0, 4.0] )
y = np.array( [7.0, 8.0, 9.0, 0.0] )
print np.outer( x, y )
And now a fast version in Cython using Tokyo:
import numpy as np
cimport numpy as np
import tokyo
cimport tokyo
cdef np.ndarray x = np.array( [ 1.0, 2.0, 3.0, 4.0 ] )
cdef np.ndarray y = np.array( [ 7.0, 8.0, 9.0, 0.0 ] )
print tokyo.dger( x, y )
You see I use the BLAS names for functions, here “dger” is short for “double precision general rank-1”. BLAS naming seems confusing to start with but after a little study it’s actually quite simple. As per usual in Cython, giving x and y a cdef type isn’t necessary, but it allows Cython to produce faster C code.
As I suspect that quite a few other people have run into this matrix speed problem I figure it’s worth sharing the code. Finally, thanks to Dag Sverre Seljebotn’s Scipy 2009 tutorial which got me started on writing this.
Get Tokyo here: www.vetta.org/software/tokyo_v0.3.tgz
Thanks for posting this. I’ve used daxpy on a 64×64 double matrix deep inside a nested loop for an over 40x speed up.
Very useful, thank you for doing this! Now if only BLAS had a function for elementwise multiplication…