SciPy – some more thoughts

With over 20,000 visitors in 2 days I discovered what my poor web server’s maximum capacity is: about 70 visitors a minute.  It doesn’t sound like that much to me, but a number of times I had to ask my ISP to resuscitate the thing.  I guess all the interpreted PHP code and database access required to build each page is quite computationally expensive.    Anyway, I thought I’d write a short addendum for anybody who is still interested.

First of all, things like SciPy are obviously good tools for certain types of problems. If you’re writing a front end system for a corporate database system or a text editor — forget about it.  That said, plain Python is a pretty good way to code most things in terms of being a language that is somehow both concise and simple.  In my opinion the only real downside is speed: it’s about 100 times slower than C, or around 50 times slower than java.  Anyway, that’s another topic, here I’m more focused on using Python in combination with SciPy.  If you’re dealing with complex transformations on data that has many dimensions, for example simulating networks or machine learning, SciPy is likely to be good tool for the job.

One of the most popular comments on reddit basically said that while writing a program to use matrices can produce very short code, it’s far too difficult to debug. I don’t buy this. Firstly, in my experience, if you get the matrix computations wrong the output is normally complete junk. That’s good news, because the really hard bugs to fix are ones where the output is almost right, or is right most of the time. You also can be pretty confident that all the fancy algorithms doing efficient matrix multiplications etc. are correct. Furthermore, because the code is so short there are only a few places where you have to look to work out what you’re doing wrong. Finally, if you’re working in areas such as machine learning or finance then the normal way to describe things is already matrix notation. If you want to, say, construct a non-parametric density estimate by using a multivariate Gaussian kernel, then your starting point is going to be a set of matrix equations.

What are the problems with SciPy?  As pointed out by one of the commenters to my last post, one major problem is documentation.  This is common with open source projects it seems: people like to write code but they don’t like to write documentation.  The basics seem to be covered ok, but as you get further into the libraries it’s not so good.  And it’s not just that the documentation is missing, but rather the structure of the whole SciPy system is confusing.  For example, let’s say you want to interpolate in multiple dimensions.  You go to the interpolate module and what you find is that it only does up to 2 dimensions.  So you’re out of luck, right?  No, the trick is to look in the image processing library ndarray which has interpolation is as many dimensions as you want.  For a new user this is all really confusing.  Hopefully, as the whole system matures these things will improve.  As the documentation is wiki style, once I’m a bit more knowledgable I’m going to do my part and help out in a few places.  It’s really the least I can do.

Then there is the question of performance.  I’ve been told that the standard Ubuntu SciPy package doesn’t use matrix libraries that are all that efficient.  If you want to have fast SciPy then you need to install BLAS, LAPACK and ATLAS yourself.  Using ATLAS you can then compile the libraries so that they will be automatically optimised for your particular CPU, memory cache sizes etc.  You can then get SciPy to use these optimised libraries.  Just look under the installing SciPy for all the instructions.  I haven’t done this yet but people who have report significant speed increases.

If your code makes relatively few calls (say tens of thousands or less) on relatively big matrices (say with thousands of elements) then (depending on what calls you’re making) your code will probably spend most of its time busy in the matrix library.  With an optimised library you’re then doing pretty well in terms of performance.  You might want to look at things like Parallel Python to take the performance to the next level.

Where performance becomes a problem is when you have the reverse situation: many calls (millions) on relatively small matrices (hundreds of elements or less).  When what tends to happen then is that all the computation time is spent going from Python into the library and back out again. The time spent actually inside the library doing computations is negligible. In this case your code would be much faster if it was written in C++ and directly called the BLAS libraries. Other than trying to rethink your algorithm so that each matrix operation does a bigger chunk of the total work, I’m not sure what to do about this. Using something like Pyrex doesn’t really help all that much (Pyrex takes Python with a few extras and compiles it into C that can then be compiled into a binary library that Python can use). The problem is that even in Pyrex code you still have the overhead of going into and back out of the SciPy library. You can directly access matrices (ndarrays) from within Pyrex in a way which is very efficient, however you’re then left writing the matrix code yourself — ok if you want to do something trivial, but not a good idea if you need to efficiently multiply two matrices.  So in short, if you need to make a large number of calls on relatively small matrices then I’m not sure how you can do this in a way that can compare to C++ used with BLAS.  If I ever find a solution, I’ll let you know.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

7 Responses to SciPy – some more thoughts

  1. Tobu says:

    You could use weave to avoid crossing the language boundary inside a loop. There is a tutorial and here are some alternatives.

  2. Stephen Donnelly says:

    The eve-online cluster uses twisted-python?

    Also if your server is bogging down, you could offload visitors to the LJ feed,

  3. Shane Legg says:

    Tobu: I’m not sure how weave would help much?

    If I have some nice ndarrays that I’m working with and want to do something like an efficient dot product then I’m going to hove to somehow cross from the ndarray world into the BLAS world. I guess that Numpy has to do a bit of magic to make this happen and that’s what slowing down the calls. I guess that I would have to essentially replicate this… which might end up being just as slow?

    Maybe I should look into the guts of Numpy to see what is actually going on… 🙂

  4. hmm..
    Realy strange.
    Pyrex(Cython) did not help you?
    >>If you want to have fast SciPy then you need to install BLAS, LAPACK and ATLAS yourself.

    Why not?
    If you work for a customer, you can create separate installation for customer-platform, don’t you?

    It’s not a criticizm – I suppose you’re more experienced in this area (and I’m not even novice yet)

    1. Thanks. I will track your blog
    2. I apologize for my English =)

  5. Shane Legg says:

    @ Alexander:

    The problem is that the top level of scipy is in python. Thus when you call some scipy matrix operations from pyrex code it has to go from compiled C into the python interpreter and then through a few levels of python wrapper code before going into the compiled BLAS libraries to perform the computation. Same coming back out from a matrix operation. As interpreted python is around 100 times slower than C, this wrapper code kills the performance when making a large number of calls to small matrices. Ideally somebody should make a pyrex wrapper for BLAS to avoid this overhead…

  6. Shane Legg says:

    @ Alexander:

    Regarding fast SciPy. The problem is that you need to use ATLAS to tune the BLAS and LAPACK code to your system and then compile it yourself for your specific hardware. The standard libraries you get aren’t optimised for specific hardware. At least that is what I understand… I haven’t bothered doing this myself yet.

  7. Pingback: Recent Links Tagged With "lapack" - JabberTags

Comments are closed.