With over 20,000 visitors in 2 days I discovered what my poor web server’s maximum capacity is: about 70 visitors a minute.Â It doesn’t sound like that much to me, but a number of times I had to ask my ISP to resuscitate the thing.Â I guess all the interpreted PHP code and database access required to build each page is quite computationally expensive. Â Â Anyway, I thought I’d write a short addendum for anybody who is still interested.
First of all, things like SciPy are obviously good tools for certain types of problems. If you’re writing a front end system for a corporate database system or a text editor â€” forget about it.Â That said, plain Python is a pretty good way to code most things in terms of being a language that is somehow both concise and simple.Â In my opinion the only real downside is speed: it’s about 100 times slower than C, or around 50 times slower than java.Â Anyway, that’s another topic, here I’m more focused on using Python in combination with SciPy.Â If you’re dealing with complex transformations on data that has many dimensions, for example simulating networks or machine learning, SciPy is likely to be good tool for the job.
One of the most popular comments on reddit basically said that while writing a program to use matrices can produce very short code, it’s far too difficult to debug. I don’t buy this. Firstly, in my experience, if you get the matrix computations wrong the output is normally complete junk. That’s good news, because the really hard bugs to fix are ones where the output is almost right, or is right most of the time. You also can be pretty confident that all the fancy algorithms doing efficient matrix multiplications etc. are correct. Furthermore, because the code is so short there are only a few places where you have to look to work out what you’re doing wrong. Finally, if you’re working in areas such as machine learning or finance then the normal way to describe things is already matrix notation. If you want to, say, construct a non-parametric density estimate by using a multivariate Gaussian kernel, then your starting point is going to be a set of matrix equations.
What are the problems with SciPy?Â As pointed out by one of the commenters to my last post, one major problem is documentation.Â This is common with open source projects it seems: people like to write code but they don’t like to write documentation.Â The basics seem to be covered ok, but as you get further into the libraries it’s not so good.Â And it’s not just that the documentation is missing, but rather the structure of the whole SciPy system is confusing.Â For example, let’s say you want to interpolate in multiple dimensions.Â You go to the interpolate module and what you find is that it only does up to 2 dimensions.Â So you’re out of luck, right?Â No, the trick is to look in the image processing library ndarray which has interpolation is as many dimensions as you want.Â For a new user this is all really confusing.Â Hopefully, as the whole system matures these things will improve.Â As the documentation is wiki style, once I’m a bit more knowledgable I’m going to do my part and help out in a few places.Â It’s really the least I can do.
Then there is the question of performance.Â I’ve been told that the standard Ubuntu SciPy package doesn’t use matrix libraries that are all that efficient.Â If you want to have fast SciPy then you need to install BLAS, LAPACK and ATLAS yourself.Â Using ATLAS you can then compile the libraries so that they will be automatically optimised for your particular CPU, memory cache sizes etc.Â You can then get SciPy to use these optimised libraries.Â Just look under the installing SciPy for all the instructions.Â I haven’t done this yet but people who have report significant speed increases.
If your code makes relatively few calls (say tens of thousands or less) on relatively big matrices (say with thousands of elements) then (depending on what calls you’re making) your code will probably spend most of its time busy in the matrix library.Â With an optimised library you’re then doing pretty well in terms of performance.Â You might want to look at things like Parallel Python to take the performance to the next level.
Where performance becomes a problem is when you have the reverse situation: many calls (millions) on relatively small matrices (hundreds of elements or less).Â When what tends to happen then is that all the computation time is spent going from Python into the library and back out again. The time spent actually inside the library doing computations is negligible. In this case your code would be much faster if it was written in C++ and directly called the BLAS libraries. Other than trying to rethink your algorithm so that each matrix operation does a bigger chunk of the total work, I’m not sure what to do about this. Using something like Pyrex doesn’t really help all that much (Pyrex takes Python with a few extras and compiles it into C that can then be compiled into a binary library that Python can use). The problem is that even in Pyrex code you still have the overhead of going into and back out of the SciPy library. You can directly access matrices (ndarrays) from within Pyrex in a way which is very efficient, however you’re then left writing the matrix code yourself â€” ok if you want to do something trivial, but not a good idea if you need to efficiently multiply two matrices.Â So in short, if you need to make a large number of calls on relatively small matrices then I’m not sure how you can do this in a way that can compare to C++ used with BLAS.Â If I ever find a solution, I’ll let you know.