Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:
- Ruby for text processing and miscellaneous scripting;
- Ruby on Rails/JavaScript for web development;
- Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
- MATLAB for neuroimaging data analysis;
- R for statistical analysis;
- R for plotting and visualization;
- Occasional excursions into other languages/environments for other stuff.
In 2013, my toolbox looks like this:
- Python for text processing and miscellaneous scripting;
- Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
- Python (NumPy/SciPy) for numerical computing;
- Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
- Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
- Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
- Python (scikit-learn) for machine learning;
- Excursions into other languages have dropped markedly.
You may notice a theme here.
The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.
These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.
Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.
Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.
It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.
Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms RÂ for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.
Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.
Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!
Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.
* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.
** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.
There is also now a ggplot clone for python: http://blog.yhathq.com/posts/ggplot-for-python.html and https://github.com/yhat/ggplot
What is the state of installing that set of tools on different platforms? Any easier nowadays?
Jan, thanks, forgot about that!
Domen, I think it’s pretty smooth sailing… the core NumPy/SciPy can occasionally cause a few compilation problems on some platforms, but I’ve never had issues with any of the other packages I mentioned when compiling from source or installing via pip… but others’ mileage may vary of course.
And if you need anything for windows: http://www.lfd.uci.edu/~gohlke/pythonlibs/
Just FYI it is on the Bokeh roadmap to integrate with yhat’s python ggplot library so that folks who want a ggplot interface that targets the browser can have the best of both worlds.
Python may be eating other languages’ lunch now, but really, the growth is in JavaScript. I’ve managed to avoid having to learn Java, C++, PHP, Python, Haskell and all the obscure JVM languages – I can get stuff done in Bash/gawk/sed, Perl and R. But I can’t avoid JavaScript, and I don’t think Python will escape it’s relentless march.
Oh, yeah, what about Ruby? I learned enough to get by, but really, I’ve pretty much given up on it.
For the installation things, I definitely recommend Anaconda 🙂
https://store.continuum.io/cshop/anaconda/
The best thing since sliced bread!
Thanks for the article!
Does Python (or Python libraries) support vectorized operations and indexing as done in R? For example: x[232:1733, 7] <- NA, etc.?
I agree 100% with the argument that it’s good to minimize switch costs. But it’s worth pointing out that IPython makes interfacing with R very easy (though the cell magic system in the notebook), to the point where literally every line of R code I write these days starts with “lmer(“.
Maybe it’s just me, but I feel as if my toolbox gets narrower naturally over time, as my efficiency within one language (for me, R) outweighs my perceived benefit of learning the basics of another language. Could be a function of age / years out of college too, an environment in which we were forced to learn multiple languages…
Django is the killer app for me, so now all my python analyses just get integrated into web application (for internal lab use mostly) but sometimes for data sharing as well. The Django ORM has become omnipresent in my analysis itself.
Does Python have a package to write backtests for trading systems that are written and maintained by professional traders and portfolio managers? Let me know when Python gets something remotely comparable to quantstrat, which is also embedded with C++ in the bottlenecks (and will get even faster also).
Yes Allan, you can do that with NumPy.
Nice post. Actually, having used R for many years, the roadblock for switching to python isn’t in the data ingestion and munging (pandas is awesome), it’s in the statistical functionality. For example, there is no good implementation of Cox regression. The interface is also not consistent yet (patsy isn’t part of scikit-learn yet, so a single formula mechanism to automatically create dummy variables from categorical variables isn’t there yet). The breadth of R is still a deal-breaker. Also, until very recently, trellis graphics weren’t easy (though bokeh and yhat’s ggplot is making it easier).
However I feel in 3-5 years, Python can be rich enough if (a) IDES akin to RStudio appear (unifying script, graphics, CVS, build tools, file management), (b) people invest time in porting or creating statistical functionality in Python, (c) packaging and distribution through github is made easy by software (akin to devtools and the package.skeleton available in R). We’re very close in Python, but need someone (Hadley Wickham’s Python equivalent) to invest time and effort to making things easier for users. Wes McKinney has helped us a lot, but more needs to be done.
I’m very high on the IPython notebook as the central data analysis platform. It allows both python development and ease of use through the magic functions, and makes disseminating reports trivial through nbviewer. I really think it can be the defacto analytics platform in Python, allowing for reproducible research and dissemination.
Thanks a lot, Tal, for this informative summary on scientific computing with python. I am new to python and pandas, love it! I am glad to find a new tool here named statsmodels.
Edward, yeah, I guess right now there’s no escaping JS for anything client-side. That said, you can avoid writing native JS by using one of the alternate syntaxes. E.g., I write almost all my JS in CoffeeScript, which preserves most of the nice Python and Ruby idioms and is generally very Pythonic.
As for Ruby–personally I prefer its syntax to Python’s, but it really can’t compete with Python’s scientific computing ecosystem. So I almost never use Ruby for anything any more–except to maintain legacy code I wrote in Rails. And really, the two languages are so similar in most respects that it doesn’t make much of a difference anyway.
Michael–thanks, I didn’t realize the extent of the iPython magic; will give that a shot! lmer() is pretty much the only thing I’m missing in Python at this point.
DMac, I think there’s always that fundamental exploitation/exploration tradeoff, and it really does depend on how proficient you are in a given language and what you need to do. If I didn’t do a lot of web development and general-purpose data munging, I might never have had a reason to learn Ruby or Python, and R would have been a perfectly sensible language to stick with.
RickG, That’s basically how I feel about Rails, and is the main reason I still cling to Ruby for most of my web development, even though I’d be much better off switching to Python so I can have a unified back-end (I’ve tried getting into Django, but I don’t really like it–I’ll probably use Flask more going forward).
Ilya, I can’t speak to that, but as I explicitly said, I don’t doubt that there are many domains where Python just won’t cut it. My claim is just that for most (though again, not all) scientists, Python is the environment of choice.
Abhijit, thanks, that’s a great summary of what it would take to convert more R users. I hope some Python devs are reading! And thanks for mentioning the iPython notebook–I forgot to mention that as one of the biggest selling points!
In my experience, watching the care and feeding of larval-stage AI researchers, research groups rarely manage to move past whatever environment their PI decided on as a freshly-hired junior faculty member. It doesn’t take very long to develop a substantial investment in a particular set of tools (both in terms of code developed and also in terms of graduate-student culture) to the point that it is rarely cost-effective to switch to something else. In my lab we have groups who do everything in Scheme, a group that does old-school NLP in Lisp (on a Lisp Machine, even), vision groups that do everything in MATLAB, statistical NLP groups that do everything in Python, theoreticians who can’t do anything without Mathematica, and one of our faculty is one of the people behind Julia, so that’s starting to get some traction now.
Iiya, have a look at Pandas. It was written by a quant finance guy. It’s incredibly fast too. http://pandas.pydata.org/
I second Anaconda. I recently stumbled onto it and I think it is amazing.. many libraries by default … including ipython/notebook,scipy, bumpy.
I don’t know your exact need for web scrapping, but I thought you might want to take a look at Scrapy (http://scrapy.org). I find it much more intuitive, explicit and readable than BeautifulSoup.
Allan Miller : pandas also has slicing (both integer position based and by index labels): http://pandas.pydata.org/pandas-docs/dev/indexing.html
I think Python is fine for hobbyists, amateurs and those who need some scripting ability but who are not programmers. I still think that you need Java for real software development. Its steeper learning curve is more than compensated for by its power and reach. All this JavaScript hype is another distraction, a truly atrocious language
@Ilya Kipnis yes, I have a former colleague who now works for a company developing high-frequency trading software, and although most of their code is in C# (the speed is REALLY important for them), for most of their ad-hoc deep analysis they use NumPy. Obviously, their code is completely proprietary and they will never ever let anybody see it (also, most of the Python is as I understand, ad-hoc, so it doesn’t even make much sense to publish it).
R has OpenCPU, thanks to Jeroen Ooms. Is there a Python equivalent?
I’m thinking about all my colleagues who would like to switch to R, but delay and delay, not least because the ugly syntax and bad error handling implies a steep learning curve. I think for many ggplots are the main temptation. Who knows, maybe they will switch to Python later and I’ll be stuck with R because I switched earlier.
Also, nobody mentioned Julia, how does it compare here?
There’s also Sage, which is a mathematical software (kinda similar to Mathematica) based in Python and with a scripting language that is mostly Python.
It uses NumPy and matplotlib, and some of its functionality is implemented in C (using Cython), R, and Fortran (iirc), and can also interface with Matlab, Octave, Mathematica, Maple, etc if they’re installed.
Also, the interface is browser-based: it creates a web server and opens a browser to display it, so if you’re not at home and want to show Sage and your programs to a friend (and your home router is configured correctly), you just have to point a web browser to your home’s IP and the Sage interface will show up.
@Ilya re: quantstrat – have you seen zipline (used by Quantopian)?
https://github.com/quantopian/zipline
https://www.quantopian.com/
Do check out http://julialang.org an upcoming language specifically targeted at the scientific/math community.
Great article. As a long time R user, it tempts me to find out what I’m missing now that pandas, matplotlib and those other graphics packages are maturing. But the thing I miss most whenever I switch to Python is Emacs/ESS and org mode. Does anyone have a good, up-to-date reference for configuring Python to work with Emacs?
@landis http://www.emacswiki.org/emacs/PythonProgrammingInEmacs (and I am a vim user :))
Some very interesting points there and many I agree with. When I started working on text processing some 4 years ago, perl was still in vogue but as of today python seems to have caught up. As I paused my research on that domain and moved on to social network analysis a year later, I started to realise that all that all the statistical analysis my older and more experienced colleagues were still doing in Matlab or R could actually be done in Python.
I’m aware of more and more people in neuroscience now privileging python and I think it’s good for science and reproducibility, even though I see many of them being terrified of leaving the calm and known waters of Matlab.
As for me, I’m still new to sna and when I began searching for open source sna tools I was immediately referred to R. I am now using python for data pre-processing, data analysis and plotting. And for machine learning there’s obviously scikit-learn! As you say, what’s astonishing is the pace at which new tools for data processing, analysis and plotting are made available for python for different research domains. Will it be the ultimate scientific programming ecosystem of the future? Who knows 🙂
Isn’t the shift from R to Python a bit like the mountain coming to Mohammed? Wouldn’t it be much easier to fix R’s memory problems and homogenize on R than to create IDEs and recreate the myriad data and statistical tools for/in Python?
@Armchair
I don’t know if it would easier. Have you read this?
https://github.com/tdsmith/aRrgh
Or type “Brian Ripley ” into Google..
The author says:
R for statistical analysis;
R for plotting and visualisation;
Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing.
That’s fine with me; I do stat analysis, plotting and visualisation.
Nice summary,
I’ve been using R/S-Plus for ~12 years, and for work using the Python ecosystem for about a year. There are moments where Python has me shouting “this is awesome” and times where it just has me shouting. PyTables is a great feature, as is PyMC. Pandas is great most of the time, but I sometimes get some screwy output and the syntax isn’t as straight-forward as I would like (plyr/dplyr in R still win this one, for me).
The three biggest obstacles for me with Python are OOP, performance, and parallel processing. OOP: just too formal for me when I am essentially using a repl. Functional programming practices are too me, much more natural with on-demand computation, and it’s better handled in R. Performance, python is better than R, but still pretty bad in production compared to the more standard production languages like C++ and Java. This has me starting to lean towards Clojure, since the same language I use for munging and exploring can be directly implemented into the production environment, no translation needed. My most recent project in Python has been dealing with parallel processing/ asynchronous computing. Simply put, not fun at all. The necessity for OOP here makes things unnecessarily complex, there just isn’t much baked in here. Clojure wins again here, big time.
Python has grown in leaps and bounds, but I still think there is room to go before its all aboard. To me, considering the direction of more and more cores in computers, the poor parallel processing in Python is a deal-breaker for the near future, outside of work requirements.
@Matthew did you try Cython? Examples in http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf sounds quite impressive to me. However, I have no experience with any scientific computation, so I may be very wrong.
@Matej
I have used Cython. It is very impressive compared to standard python. A few caveats, it doesn’t play nicely with all modules, and it requires some experience to optimize well. There is a very convenient framework for using it with SAGE. That said, for the same problem, using naive Cython and Naive Rcpp in R I found Rcpp to be faster, plus it is much easier to break up the data in chunks and process in parallel with R than python. But yes, Cython is fantastic for a huge class of problems, though it will take a little time to master.
“Eating other languages lunch?”
The popularity of Python has decreased YoY in 2013. Additionally, Python was the number 5 programming language in the world back in 2007 – now it’s number 8.
I think that Python has reached a point where it’s shrinking rather than growing.
Note: I use mainly PHP and I think that PHP has many flaws but is much more practical than Python. Yes – the latter is more powerful – but it’s more rigid and more complicated.
I think some of the commenters are missing the context here. I’m not arguing that Python is the best language for any single task, or for general-purpose software development; I’m saying that, to my mind, the Python ecosystem offers a currently unparalleled combination of flexibility, accessibility, and performance specifically in the realm of scientific computing.
Fadi, if you prefer PHP for web development, that’s defensible, but I don’t think anyone would consider PHP a remotely viable option for scientific computing. Rob Endover, the same goes (to a lesser degree) for Java–it may or may not be a better language for “real” software development (whatever that means), and I don’t doubt that in specific scientific applications, writing code in Java will make much more sense than writing code in Python. But I think the proportion of scientists who use Java for day-to-day data munging, statistical analysis, and visualization is vanishingly small.
Armchair Guy, if all you ever do is statistical analysis and visualization, then sure, your time may be better spent figuring out how to patch R than switching to Python. But the benefit of Python is that it’s a general-purpose language with far better support for almost anything else you might need to do outside of statistics and plotting. As I wrote above, personally I’d much rather do my statistical analysis in the same environment as my web development, document parsing, and neuroimaging data analysis than use a different tool for each job. Your mileage may vary, of course.
I’d agree with the trend the author has pointed out. We are working in the geo-science domain and Python has pretty much got us covered. From wrappers to interact with very large datasets (python-netcdf4)*, to analysis tools (panda, numpy), to desktop/documentation visuals (matplotlib), to chaining processes (vistrails), to web-based visualisation (geonode, which is Django-based), there is very little Python cannot do. For intensive desktop GIS, there is QGIS (written in C++) which allows extensibility through Python scripting.
(*and for the posters who think you cannot run parallel processes in Python for dealing with multi-gig sized datasets; well, you certainly can!)
I think for people that come from real programming languages to R, the language is completely insufferable. You can’t get around the fact that the language is designed in a hackish, procedural style. Objects tacked on in packages? Come on. Even basic things like array slices are broken (e.g. A[i:j] will never return an empty list for any values of i and j. A[i] will never complain if i is negative, no matter how negative).
Also, people always talk about R’s visualization tools. This really confuses me, as I don’t in general see anything stronger about R’s visualization versus Python or Matlab. Actually, can someone to tell me if there is a way to zoom in on an R plot? In Matlab, you type plot(1:10) and you get a figure that you can zoom in and out of to your heart’s content. Constantly replotting to investigate your data is moronic. I can’t speak to ggplot2 but the default plot in R is in general horribly primitive compared to Matlab’s plot.
Yes there are many packages for R, and I’m sure some are excellent and save you tons of time. But in generally it’s hard to be confident of them. Many of the packages are just written by one random person, and in some cases the person hasn’t updated in it a while.
The part that bothers me is that sometimes these packages are fundamental things. For instead, multi-processing in R, despite the comment above, is absolutely horrible compared to Python. Python has one unified, very clean way of doing both multithreading and mulitprocessing. In R there are at least half a dozen packages purporting to do variants of different things, and they are not great. Multi-processing is not something you will write from scratch.
Which brings really to the core of my beef with R. Statistics is a horrible central motivation for a programming language. Statistics is just a collection of methods, nothing more. If you break down the individual tasks in statistics they always boil down to various other branches of (sometimes applied) math; probability, numerical differentiation, optimization, linear algebra, etc. And R is not really so standout in terms of optimization or linear algebra, or any of these.
Most stats methods are really not that complicated at the end of the day. If you have all the tools that actually make up stats, they aren’t that hard to re-implement. If you are just a user, I understand that spending a few hours re-implementing a method is time wasted, and by all means keep using R. But if you actually spend a lot of time comprehending, investigating, and modifying methods, then frankly the time to rewrite an algorithm is insignificant. You can even just copy the R code into python and then change the syntax line by line. Hopefully if more people who actually create and modify algorithms switch to python, users will eventually follow.
Hi! I really liked this post, i’m in trouble in this topic. I like the Ruby language more then Python, or everything else. I’ve searched on internet, and i find that there are 2 languages with w mature ecosystem in scientific research: Java and Python. So i would give a try for JRuby. What do you think about that? The only thing i think i could miss -compared to python- is the speed due Cython. Am I wrong? Could JRuby be faster, or just FAST ENOUGH? Or can I mix JRuby with C? Anyway, if speed would be in first place, why there is not good enough c or c++ libs? Thanks for your answer.
“Eating other languages lunch?â€
The popularity of Python has decreased in 2013. Additionally, Python was the number 5 programming language in the world back in 2007 – now it’s number 8.
I think that Python has reached a point where it’s shrinking rather than growing.