Summary: In the past, R seemed like the obvious choice for Data Science projects. This article highlights some of the issues, such as performance and licensing, and then illustrates why Python with its eco-system of dedicated modules like Scikit-learn, Pandas and others has quickly become the rising star amongst Data Scientists.
Data Scientists starting a new project have a cornucopia of options to choose from to develop and implement their models. A popular choice is the R project for Statistical Computing, an open source implementation of the S environment originally developed by the Bell Laboratories. R (and S) is designed as a software environment for working with data, including statistical computations and graphics. R is used in a wide range of use-cases and is supported by a large community to both develop the R system, as well as contribute CRAN repository of algorithms, tools and extensions.
The gap towards production use
Although such an environment seems to contain everything a Data Scientists needs, beyond first steps and evaluations of the data one is quickly faced with the challenge of how to operate such a statistical model in a production environment, delivering stable service 24/7 with minimal downtime. This then either means to interface the computer system on which the model is to be run on with an R environment, adding both another layer of complexity in the software infrastructure, as well as first extracting the data in a way that it can be digested by R and then transform it back once the results have been computed. Alternatively one could re-implement the model in another programming language more suitable for the (existing) infrastructure – or indeed for speed reasons.
Both approaches are error-prone and cumbersome, especially re-implementing an existing model in a different environment not only requires duplicate testing efforts but also takes a significant amount of time, meaning that a gap of several months may open between developing and testing improvements to a model or a new model – and the time it can be deployed in production use. Commercial adaptions exist to ease this, such as e.g. SAP HANA or Oracle R Enterprise which include R in their respective proprietary technology. However, this limits the technology choices in your own company and commits to a long-term choice in infrastructure.
Furthermore, R is based on the GNU license which requires that anything directly packaged with R has also to be put under the GNU license. Hence any commercial offering will by one way or the other have to solve the conundrum of both integrating R with the private intellectual property such as to provide a seamless transition while at the same time keeping R and the own core technology as far apart as possible to avoid being “hit” by the license requirements. For example, SAP doesn’t ship R with Hana but instead provides an installation manual how to download and configure the Hana database to work with R which pushes all support and maintenance to the client instead of the vendor.
Even then more fundamental issues with R remain as a recent paper by S. Sridharan and J M. Patel (University of Wisconsin, USA) suggests: The authors have examined how the processor time is spent during the execution of an R program and found that around 85% of the execution time is spent in processor stalls, i.e. the program actually “does” anything only in around 15% of the total execution time. Further problematic areas identified by the authors concern the garbage collector and the memory footprint. The latter is of particular importance as around a factor 10 more memory is needed than the size of the original dataset, i.e. a relatively small dataset of just 2GB can lead to a memory footprint between 20 and 60GB, depending on the algorithm used. While such research opens avenues which can be pursued to improve the current situation, this also means that many of the findings can only be alleviated by a fundamental re-design of the R system.
Enter the world of Python which has quickly become a rising star among Data Scientists. Using Python for Data Science combines a number of advantages: The general purpose programming language is both easy to learn and can be used, in particular with dedicated modules such as Pandas and IPython Notebook to quickly dive into the data, explore it and start to develop models interactively. Being a general purpose language instead of a specialised environment, many modules exist to optimise specific aspects, in particular NumPy and SciPy for efficient use of numeric data types and scientific computing, packages such as Cython and Numba allow to execute code (almost) at the speed as if implemented in C, though keeping the “look & feel” of Python programs.
The scikit-learn machine learning package offers both a well-designed interface for machine learning as well as a number of choices of implemented machine learning algorithms. Combining all of the above allows the Data Scientists to deep-dive into the data, develop models and explore new angles in a consistent software framework which can then also be used to deploy the models in a production environment, aided by dedicated tools for testing the code to ensure high quality, continuous integration tests and automated deploys.
The business friendly licenses of the various packages also allow the best of both worlds where business meets the open source community: Developers and Data Scientists can be part of the open – source community and easily contribute both with bug-fixes and new algorithms for the use by the global community, while protecting the core IP of the respective company they work for.
Proven to work
It is a while ago that we at Blue Yonder switched to using Python, scikit-learn, Pandas and other packages – and we saw a significant increase in productivity since. As we based our own solutions on top of the open source machine learning eco-system provided by Pandas, scikit-learn and the other packages, we are also able to teach the basics as part of the Data Science Academy, including how to build models in a hands-on approach. This allows the aspiring Data Scientists to get to know every aspect of diving into the data, building a predictive model and get a good feel of what is needed to run it in a production environment.
Even at a quick glance on the web one finds that many other projects are moving in the same direction. The Hadoop core technology is developed in Java. Hence, it was quite natural that the first client libraries were also in Java. Increasingly, these projects have added and beefed-up their Python support (e.g. Hadoopy, PySpark). Newer project in the Hadoop space (e.g. Spark) have Python support right away, others (like Flink) have added it to their roadmap. Our observation is that when it comes to statistical packages and machine learning a Java ecosystem has to offer a lot but is not on eye level with Python. This is due to the fact that battle-proven statistics packages or machine learning packages written in C or Fortran can be easily integrated in Python. The integration in Python allows programmers to quickly combine these packages to more complex applications. Reimplementing all these packages and getting them to the same quality and performance is a decade long task. The option to move to a programming ecosystem that has the batteries included allows for quick results and happy programmers.
Ulrich Kerzel earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, were he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a senior data scientist.
(Image source: O’Reilly)