SciPy – the embarrassing way to code

I’ve programmed in many languages before, indeed I’ve spent at least a year working in Basic, C, C++, C#, java, assembler, modula-2, powerhouse and prolog.  One thing I’ve never done before is Matlab, well except a few basic exercises for some course I did way back.  A couple of years ago I started using python and more recently I’ve started to use the scipy libraries which essentially provide something similar to Matlab.  The experience has been unlike anything I’ve coded in before.  The development cycle has gone like this:

1) Write the code in python like I would write it in, say, java.  I have data stored in some places, then I have algorithms that iterate over these data structures computing stuff, calling methods, changing values and doing various complex things in order to implement the desired algorithm.  10 pages of code, somewhat general.

2) Then I realise that in a few places I don’t need to iterate over something, I can just use some vectors and work with those directly.  7 pages of code, a little more general.

3) Then I realise that part of my code is really just running an optimisation algorithm, so I can replace it with a call to an optimiser in one of the scipy libraries.  5 pages of code, and a bit faster now.

4) Then I try to further generalise my system and in the process I realise that really what I’m doing is taking a Cartesian space, building a multi-dimensional matrix and then applying some kind of optimiser to the space.  3 pages of code, very general.

5) Finally I’m like, hey, how far can I push this?  With some more thought and spending a few days trying to get my head around all the powerful scipy libraries, I finally figure out that the core of my entire algorithm can be implemented in an extremely general and yet fast way in just a few lines.  It’s really just a matrix with some flexible number of dimensions to which I am applying some kind of n-dimensional filter, followed by an n-dimensional non-linear optimiser on top of an n-dimensional interpolation and finally coordinate mapping back out of the space to produce the end results.  2 pages of code, of which half is comments, over a quarter is trivial supporting stuff like creating the necessary matrices, and just a few lines make the necessary calls to implement the algorithm.  And it’s all super general.

Now this is great in a sense. You end up throwing away most of your code now that all the real computation work is being done by sophisticated mathematical functions which are using optimised matrix computation libraries. The bottleneck in writing code isn’t in the writing of the code, it’s in understanding and conceptualising what needs to be done. Once you’ve done that, i.e. come up with mathematical objects and equations that describe your algorithm, you simply express these in a few lines of scipy and hit go.

It’s not just with my financial software either. I recently implemented a certain kind of neural network using nothing but scipy and found that the core of the algorithm was just one line of code — a few matrix transformations and calls to scipy functions.  I hear that one of the IDSIA guys working on playing Go recently collapsed the code he’s been working on for six months down to two pages.

The downside to all this is that you spend months developing your complex algorithms and when you’re done you show somebody the result of all your efforts — a page or two of code.  It looks like something that somebody could have written in an afternoon.  Even worse, you start to suspect that if you had really known scipy and spent a few days carefully thinking about the problem to start with, then you probably could have coded it in an afternoon.  It’s a little embarrassing.

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

53 Responses to SciPy – the embarrassing way to code

  1. Арт Зарн says:

    Can you provide examples of the code from different cycles?

  2. Shane Legg says:

    It might be too embarrassing :-)

    In many cases whole chunks of my code went due to using things like multi-dimensional optimisation and interpolation libraries, for example scikits and ndimage. In other cases it was from very simple things such as realising that iterating over two arrays to work out which combinations were simultaneously 1 is the same as taking the outer product of these vectors.

    At some point we might release the source code for the simulator, but probably not until we have worked on it some more and written a couple of papers…

  3. Jonathan Feinberg says:

    What you’ve described, “The bottleneck in writing code isn’t in the writing of the code, it’s in understanding and conceptualising what needs to be done,” is common to all highly abstract programming languages. Writing Haskell, for example, involves an hour of meditation followed by the emission of a fold expression.

  4. I got used to coding like this on a summer course (way beyond my actual understanding level) on numeric methods for partial differential equations. We used Matlab back then.

    Then, in the next trimester, I took a neural networks course. The actual course assignments didn’t require much coding, but I spent at least a week writing a somewhat generalized backpropagation learning algorithm in terms of matrices.

    The code is unreadable now, if not followed with its (long lost now) paper documentation, which consisted of some scrabbling and hollow matrix shapes.

    Now, I find matrix-oriented programming fascinating and (I was told in my summer course) it’s great for automatic parallelization but it’s somewhat of an anti-pattern for most purposes (like financial code, for example).

    It also may lead you (Sapir-Whorf hypothesis-wise) to phrasing problems in inappropriate terms. The whole of economics spent the 20th century mis-describing the functioning of a modern economy in matrix form, and I’m not just referring to input-output analysis, but to the greater program of neo-walrasian/debrevian program of general equilibrium.

    • chris b says:


      I hope this is a factor in bringing array languages to be more widely recognized and used. Good to hear that it garnered a mention by your course instructor.

      • chris b says:

        Whoops- I was trying to quote this: “it’s great for automatic parallelization”

  5. Shane Legg says:

    Dayvan: I agree that matrix code can get hard to understand, especially if you’re not sharp on the intuitions of all the different kinds of matrix operations. It’s probably for this reason that half my code is now comments.

    As for matrix analysis in general. Linear (matrix) algebra is often used in theory because it’s easy to algebraically solve. However, in my code I’m applying all sorts of non-linear transformations as well as the linear matrix stuff…

  6. pozorvlak says:

    Welcome to the wonderful world of array programming languages :-)

  7. I code php and have no idea of most of what your talking about. But my way of looking at it is that coding is best when it is a learning experience that enriches you. It is not the destination but the journey.

  8. Just as embarrassing as starting off with a huge slab of marble and ending up with a 17 foot statue of David.

    Minimalism is elegance. Arriving at minimalism takes a lot of work.

  9. artist says:

    use Perl;
    get Amazed;

  10. Craig DeForest says:

    Yup, that’s pretty common with the good vectorizing languages. PDL is the same way — folks start out using set/at (perl access routines for individual array elements) and at the end of the day have code that is 10x shorter and runs 1,000x faster than when they started.

    I think your insight about conceptualization goes back to the early 1970s, when a very few people managed to properly grok APL (everyone else kept hacking away with FORTRAN).

  11. Penis says:

    Perfection is achieved, not when there is nothing left to add, but when there is nothing left to take away… and it reduces the labor required to document the solution, as well.

    The flip side to this is that the best is the enemy of the good — perfection can take a very long time, especially during the initial implementation.

    (BTW, this happens to me all the time in R.)

  12. al says:

    It sounds a lot like the process in mathematics of refining a proof until it is short and elegant. Not embarrassing at all if you’re the one doing it. You have to start from a version of the problem and solution you can understand and then optimize the steps until something more clear emerges.

  13. I know exactly what you mean. I am not sure how vast the scipy documentation is, but I always come up against hurdles of too much, and too little information for modules.
    This is also why I post on forums to the developers of the library how they would go about and do it, because it usually takes them a few seconds to think of the best way to do it, and then you go “arrr, I see that” and don’t waste as much time.
    What is your proposed solution to this horrible problem that afflicts us all?

  14. Shane Legg says:

    Graham: Yes, besides getting my head around what some of these abstract mathematical libraries do, the biggest problem for me is usually a lack of documentation.

  15. Hugh Myers says:

    Switch to APL and drop the page count to .5 exclusive of comments. Of course opacity goes to &100, but hey…

    –hsm

  16. Bruce Harris says:

    You would really like APL.

    You might try J, and modernization of APL, that makes it even more illegible.

    (I think APL was legible, just you needed to know it, to read it, like algebra proofs.)

  17. Pingback: Writing code is easy; designing software is hard at Pensieri di un lunatico minore

  18. Speaking of modernized APL, there is also the K language. The No Stinking Loops wrote a ray tracer in 7 lines of code.

  19. Cyde Weys says:

    Aww c’mon, no way can you write this blog entry and then not post your code! You know we all want to see it. Release it under the GNU GPL or something if you’re worried about how it might be used. The readers’ collective potential to benefit from seeing good scipy code outweighs any potential for embarrassment on your part (which is an unfounded fear, by the way).

  20. Tony says:

    Hey, at least you really understand what you are doing and have very clean and efficient code (that’s also easy to understand!).

    A lot of people would stop at 10 page long buggy code. Now _that’s_ embarrassing.

  21. Rob Steele says:

    You are now ready for R (http://www.r-project.org/). Go forth and multiply.

  22. andrew` says:

    Could you please link me to the guy who was doing the go programming?

  23. kd says:

    Hehe,

    Sounds like programming in R. I was explaining to my wife the other day what it was like to spend several days writing three lines of code. You’ve done a much more eloquent job of that here.

  24. Wayne says:

    These are the basics of writing code that many don’t appreciate.

    Version 1 – long bloated and buggy
    Version 2 – better than Version 1
    Version 3 – shorter than Version 2
    Version n – Perfection

  25. Anonymous says:

    You should not be embarrassed, you should be proud. As Feynman says: “If you can’t explain it to a six year old, you don’t really understand it.” Adapted to programming, this reads: “If you can’t code it in 2 lines, you don’t really understand it.”

  26. Randy MacDonald says:

    As an APL/J developer, my mantra is:
    . if you can say it, it’s done;

    or, if you will: if you need 2 lines to code it, you don’t understand it. Six-year-olds have short attention spans.

  27. Damian Eads says:

    As a seasoned developer of C/C++ for scientific code, once I introduced Python/numpy/scipy into the picture, it became much easier to prototype new ideas. Hard to vectorize, low-level code can be written in C/C++ and easily interfaced with Python/numpy. With some polishing, the prototype code makes its way into the final product. That’s hard to achieve with MATLAB–you often find yourself getting locked into slow, hard-to-maintain code in a language with inflexible semantics. One of the biggest headaches with MATLAB has got to be read-only parameters–when your data sets get larger, spanning into the gigabytes, your hosed due to copying.

    I also found the C/C++/Numpy/Scipy combination made it much easier to farm across a cluster because there is no need to worry about licenses, which means science gets done quicker! Not to mention, there are a number of cunning distributed tools that have been developed.

    Python can be troublesome for everyone when new programmers start using it. It gives you a lot of rope so you have to learn how to behave. I think other languages (e.g. Ocaml) do a better job of getting new programmers to learn good discipline and behavior). Java, while statically typed, has been responsible for producing a generation of programmers who unnecessarily drown themselves in large class hierarchies.

    I believe one of the reasons behind numpy/scipy’s success is the avoidance of object-orientation where it’s unneeded. One finds themselves just passing matrices to functions very quickly and seamlessly. Other libraries require you to create a hierarchy of objects with complicated factory objects before a simple numerical task can be done. In this way, numpy/scipy makes it easy to be as productive as one would be with MATLAB. Now, Python still has object orientation (MathWorks claims MATLAB does but it is unusable), which is useful for high-level manipulative code, commonplace in large science projects.

    One reason why Numpy/Scipy has taken off more than other projects like Scilab and Octave is separation of concerns. The Python folks focus solely on the development and maintenance of the language, its interpretor, and its tools while the Scipy folks focus on the problem of numerical and scientific computation. The Octave and Scilab folks have to solve both problems, which spreads their work more thinly.

    My two cents,

    Damian

  28. abhinav says:

    I can totally relate to your experience. I just submitted my graduate thesis in which I implemented a code for computing critical Rayleigh number. The entire code is like just two pages and I spent 4 month on developing or rather shortening it using Scipy libraries. Now the professors ask if that’s all I have done in four months. What can I say :)

  29. Russell Wallace says:

    Four months to produce two pages is embarrassing if you’re a writer, and some kinds of programming – generating reports in COBOL, say – can reasonably be considered a form of writing. I’d be embarrassed if I spent four months writing two pages of COBOL.

    But this type of programming isn’t like writing; what you’re doing is science.

    How long did it take Newton to produce three lines describing the laws of motion and one more line describing the law of gravity? Think of it in those terms and you might feel better :)

  30. Ivo says:

    Just when I started to like Scipy and its ability to do the Matlab work, I tried some trivial signal processing stuff, and realized that I cannot trust it. I wanted to find the variance a signal …

    To my surprise, it produced negative or complex results, depending on the run. It is worth remembering that variance is a non-negative number.

    Here is a simple example, that involves just normal random variables and I did ‘from scipy import *’ :

    >>> n=random.randn(100)+random.randn(100)*sqrt(-1)
    >>> var(n)
    (-0.44836999382963283+0.12879192396840586j)

    >>> stats.var(n)
    (-0.45289898366629577+0.13009285249333927j)

    I was expecting to see a number close to this to 2:
    >>> mean(abs(n-mean(n))**2)
    1.9590482576219312

    Anybody knows what does the ‘var’ function do?

  31. Ivo says:

    I contacted SciPy and I was told that they calculate variance as E[(x-mean(x))(x-mean(x))] instead of E[(x-mean(x))(x-mean(x))*], and that was an oversight on their part. It will be corrected for the next release.

  32. Shane Legg says:

    Ivo: Yeah, it sounded like a bug in the SciPy code. Anyway, good to know it’s getting fixed.

  33. Damian Eads says:

    Ivo,

    Scipy has yet to package a 1.0+ release, and there is still much to do before it gets there. You should give back to the community by writing regression tests for Scipy. Scipy will only continue to improve if people volunteer their time and hard work. Or you can pay MATLAB, give nothing back to other scientists, and lock yourself into a proprietary work environment.

    Damian

  34. Ivo says:

    Damian,
    I did not use SciPy much at the time when I noticed that bug, so my first reaction was to ask around first. Following that, I contacted Travis Oliphant, and he suggested that I write a patch for the var() function, which I did and sent it directly to him. I do not know if that will find its way into the release or not, but I hope it will as that ‘bug’ is significant enough to make SciPy unreliable tool for anyone who wants to do any processing.

    After that, I tried to write some more complex programs using SciPy, and it seems to work fine. I’ve noticed a few things that one has to worry about:

    1. There is an overlap between some matplotlib and scipy functions, so if one imports everything with a ‘*’, it is a good practice to import pylab first, and scipy second to override some functions that are not related to plotting.

    2. One has to be careful with data types used for arrays. Following Matlab logic, I did not care about the data types when I preinitialized some vectors (arrays) before their usage in loops, only to find that they can never become complex if I did not initialize them as complex.

    3. There might be a memory managemenent issue with large matrices, as the program failed when I was using no more than 200 MB worth of data.

    That having said, I like SciPy. If I encounter any problem that is worth fixing, I will be happy to write a patch (fix), but I am not good with procedures, so I will probably have to send it directly to whoever is on the scipy mailing list and hope that it will be sufficient.

  35. Damian Eads says:

    Hi Ivo,

    Thanks for submitting the patch and helping make Scipy better!

    “I do not know if that will find its way into the release or not, but I hope it will as that ‘bug’ is significant enough to make SciPy unreliable tool for anyone who wants to do any processing”

    By that same logic, if one finds a bug in MATLAB, they could consider it unreliable to do “any processing”. It is not often I compute variances on complex data. I have found the var() function to work on the data I have, tested by comparing its output with values I know are correct. But really, all software has bugs, even MATLAB.

    Comparing MATLAB’s stability with Scipy is not a fair comparison since MATLAB has had 20 more years to fix its bugs. Many years ago when I was an avid MATLAB user, I contributed some bug reports to MathWorks that caused me some headaches. But most of my headaches with MATLAB have nothing to do with bugs. First, its a horribly designed language created by electrical engineers who should have taken a course in language design prior to writing the language. Second, it promotes bad programming style; some users write clean code but most just learn enough to quickly hack together a filter that no one else can read. Third, its hard to spread computation across machines due to licensing limitations, causing needless tension among colleagues who are all fighting for licenses. People end up cutting experiments and moving on so less science gets done. Fourth, it is harder to collaborate with others because your collaborators must purchase MATLAB licenses to use your code. Fifth, it is primarily an interactive environment rather than a batch environment, encouraging experimentation and statistical analysis which is not reproducible. Sixth, it is difficult to work with any data structure other than a matrix (e.g. a graph, hash map, tree, lists, heaps). This is frustrating when needing to write algorithms requiring richer data structures.

    Have you ever tried writing a large application in MATLAB? Are you familiar with MATLAB’s object oriented interface? It is unusable. Modifying a data member causes the object to be copied prior to modification. At some point, prototype code needs to make its way into a larger application. While most people intend to rewrite their prototype code in C or C++, it is often too costly to do so they end up writing their application in the same language as their prototype code, i.e. MATLAB.

    You should not be surprised if there is symbol overlap in packages. It would be very difficult for two projects to maintain mutual exclusion in their namespaces. In general, it is bad programming practice to import everything. It should only be done at the python prompt, not in programs.

    With regard to memory management, I was unable to handle large data sets with MATLAB. When Scipy came into the picture, I could handle much larger data sets. Also, I can pass very large arrays into functions and modify their contents without any copying. This is particularly helpful when dealing with a memory footprint that approaches the 2GB limit.

    You raise a good point about complex data. However, MATLAB has its data type headaches as well. For example, if I have an int8 array (say an image), not all arithmetic operators are supported in MATLAB. Images often need to be converted to a double prior to doing arithmetic on them. That’s annoying, especially when you’re working with large images. I can do most arithmetic operations on int8 arrays in Scipy.

    Often, the semantics of Python/Scipy can be quite different from MATLAB. This causes some frustration when migrating from MATLAB but this is temporary. Eventually, I believe the committed learner should find the Python/Scipy framework to be much more flexible.

    Sorry for your bug–bugs are very frustrating. It is good that you submitted a patch. Thank you for doing this.

    Damian

  36. Ivo says:

    Damin, thank you for your long reply. I am not a computer scientist, but I have a PhD in ECE, and often I need to write simulations in matlab. I did write some larger programs (e.gl, WiMAX simulation), but I did not use the OO framework. The only useful thing I found was writing MEX files to speed up the calculation.

    >> By that same logic, if one finds a bug in MATLAB, they could consider it unreliable to do “any processing”. It is not often I compute variances on complex data.

    The point was not that you do not trust some s/w if there is a bug, but the question of trust comes into play if something as basic as a variance calculation is not done properly. And the variance calculation was not considered a bug but just a different interpretation of what the variance definition is.

    Just out of curiosity, how is it possible that when you compare the output of var() function with matlab output you get the same thing? I am running Ubuntu 8.04 and the scipy that comes with it. Do you have a different version of SciPy?

    Could you try the lines I wrote on my message submitted on May 13?

  37. Ivo says:

    And, Damien, sorry for the typo when I wrote my name.

    Ivo Maljevic

  38. Ivo says:

    ARGH, correction again: sorry for the typo when I wrote your name (Damin instead of Damien).

    Also, I realize that you said that you tested var() with your own data, which does not necessarily mean that you tested it with complex data.

  39. Pingback: Miscellany II « QED

  40. My python motto:

    “Perfection is attained not when there is nothing left to add but when there is nothing left to remove.”

    - Antoine de St. Exupery

  41. Herzog says:

    You ought to try Mathematica some day. To me, Matlab feels verbose compared to Mathematica. ;-) In Mma it’s a lot of thinking about how to join certain list (or tree) transformation operations, then writing those three to ten lines of code that’ll solve your problem.

  42. yeah its quite bit embarrassing, this a great lesson and will not to try.thanks a lot Bos Shane.

  43. Pingback: Coding well is embarrassing « Sho Fukamachi Online

  44. kbob says:

    “I’m sorry I wrote such a long letter. I did not have time to write a short one.” — Abraham Lincoln(?)

    Seems apt.

  45. df says:

    kbob,

    It was Blaise Pascal that said “The present letter is a very long one, simply because I had no leisure to make it shorter. ”

    See: http://books.google.com/books?id=hUMRAAAAYAAJ&pg=PA339#v=onepage&q=&f=false

    in the penultimate paragraph.

  46. Barney says:

    Anyone can create complicated code, very few can write simple code.

  47. Pingback: Habit 3: Complexity Demonstrates Intelligence | Lessons of Failure

  48. Pingback: Miscellany II | A Dense Subset

Comments are closed.