Lean, mean data science machine

Published on 07 December 2013

Update (1-4-2014) Be sure to check out the brand new Data Science Toolbox!

Update (9-12-2013) I have compared my Vagrant environment with three other virtual environments for data science (see below).

Data scientists love to create interesting models and exciting data visualizations. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data. I argue that the *nix command-line, although invented decades ago, remains a powerful environment for processing data. It provides a read-eval-print loop (REPL) that is often much more convenient for exploratory data analysis than the edit-compile-run-debug cycle associated with large programs and even scripts.

Unfortunately, setting up a workable environment and installing the latest command-line tools can be quite a pain. This post describes how to alleviate that pain and how to get you started doing data science on the command-line in a matter minutes.

Data Science at the Command Line

I am currently authoring a book titled "Data Science at the Command Line", which will be published by O'Reilly in summer 2014. The main goal of the book is to teach why, how, and when the command-line could be employed for data science. The tentative outline is as follows:

  1. Introduction
  2. Getting Started
  3. Step 1: Obtaining Data
  4. Creating Reusable Command-line Tools
  5. Step 2: Scrubbing Data
  6. Managing Your Data Workflow
  7. Step 3: Exploring Data
  8. Speeding Up Data-Intensive Commands
  9. Step 4: Modeling Data
  10. Poor Man's MapReduce
  11. Step 5: Interpreting Data
  12. Conclusion

Naturally, the book will be drenched with commands and source code. It is important that the text, the code, and the output of the code are consistent with each other. Manually running the code and copy-pasting the output is a cumbersome and error-prone process. To automate this process, I have created a script (a dexy filter to be precise) that will (1) extract all the source code from the text, (2) run these in an isolated environment, and (3) paste the output back into the text. From here the O'Reilly toolchain takes over and converts the text to a variety of digital formats. Very smooth.

Your own Data Science Toolbox environment with Vagrant

The environment is created and configured using Vagrant, which is basically a wrapper around VirtualBox and other virtualization software such AWS EC2. With a few commands, a fresh virtual machine is spun up and configured according to a simple script. It was Matthew Russell's Ignite talk that inspired me to use Vagrant; he provides one for his book Mining the Social Web that is focused more on Python. If my Vagrant environment would be provided with Data Science at the Command Line, then the reader would be able to follow along with the commands and source code. But since my mission is to enable everybody to do data science at the command-line as soon as possible, I have decided to make it available right now.

Currently, the environment includes the seven command-line tools I discussed a while ago and GNU parallel, which will be discussed in Chapter 8. Just like the book itself, the environment is a work in progress. In order to be able to run Rio (one of the seven tools), I had to include the latest version of R, together with the packages ggplot2, sqldf, and plyr. I am aware that many of you would prefer the Python scientific stack to be included as well. The Python scientific stack (ipython, numpy, scipy, matplotlib, pandas, and scikit-learn) is also included. However, because of disk-space and provision-time constraints, I doubt whether it is desirable (or even possible) to create an environment that includes everything. Perhaps that we can devise a solution where you select which tools, packages, and languages you would like to have installed. As mentioned, it is a work in progress and my main goal is to get you up and running on the command-line.

Installing the Data Science Toolbox environment

The environment is currently configured to run on top of VirtualBox. (I am looking into the option to deploy it on an AWS EC2 instance.) So, first you will need to install VirtualBox. Second you need to install Vagrant. Third, you need to download the environment by cloning the data science toolbox. (If you do not want to use git you can also download the zip file.)

git clone https://github.com/jeroenjanssens/data-science-toolbox.git
cd data-science-toolbox/box

Running vagrant up in the box directory will download the base box (Ubuntu 12.04 LTS 64-bit), spin up a virtual machine, and provision it. (Now would be the perfect time to think about any command-line scripts you may have lying around and donate them to the data science toolbox.) Once the provisioning is complete, you will be able to log into your own lean, mean data science machine:

vagrant ssh

Run the following command to test whether everything has been installed correctly:

curl -s 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' | scrape -be 'table.wikitable > tr:not(:first-child)' | xml2json | jq -c '.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][], ratio: .td[4][]}' | json2csv -p -k=country,ratio | Rio -se'sqldf("select * from df where ratio > 0.3 order by ratio desc")' | csvlook
|----------------+------------|
|  country       | ratio      |
|----------------+------------|
|  Vatican City  | 7.2727273  |
|  Monaco        | 2.2        |
|  San Marino    | 0.6393443  |
|  Liechtenstein | 0.475      |
|----------------+------------|

The virtual machine is not entirely isolated. Files that you put in the box directory will be accessible from the /vagrant directory in the virtual machine. This allows you to use both the tools you already have installed and the command-line tools provided by the environment. If you want to install any of these tools on your own machine, then you can run the relevant commands from the provisioning script.

Comparison of virtual environments for data science

Of course the Data Science Toolbox environment is not the only one available for doing data science! So far, I have been able to perform a rudimentary comparison with three other solutions. (Please let me know if you know any others.)

1. Data Science Toolbox (DST)
Created by: Jeroen Janssens
Github: jeroenjanssens/data-science-toolbox
Installs R, the Python scientific stack, and of course many command-line tools for processing data. Uses Vagrant and for now it can be deployed on VirtualBox, only.

2. Mining the Social Web (MTSW)
Created by: Matthew Russel
Website: miningthesocialweb.com/
Github: ptwobrussell/Mining-the-Social-Web-2nd-Edition
Uses Vagrant (with Chef as the provisioner, which is really nice) and can be deployed on both VirtualBox and AWS. Installs IPython Notebook, numpy, mongo, and NLTK, which allows you to follow along with the examples provided in the book. An AWS AMI is available as well.

3. Data Science Toolkit (DSTK)
Created by: Pete Warden
Website: www.datasciencetoolkit.org
Github: petewarden/dstk
The website provides a sandbox from which you can try out many interesting APIs. These APIs can also be accessed from the command line. An AWS AMI is available.

4. Data Science Box (DSB)
Created by: Drew Conway
Github: drewconway/data_science_box
This is a bash script for which you need have an AWS EC2 instance running. It installs R, Shiny, IPython Notebook, and the Python scientific stack.

For your convenience I have summarized this information in the following table.

Configuration VirtualBox AWS AMI Python R Shiny Comments
1. DST Vagrant Yes No No Yes Yes No Includes the Data Science Toolbox
2. MTSW Vagrant Yes Yes Yes Yes No No
3. DSTK Vagrant Yes Yes Yes No No No Includes various command-line tools
4. DSB Bash No Yes No Yes Yes Yes

In short, I think that they all have some strong aspects. Some of these may be improved over time (I am currently looking into using Chef as the provisioner), new environments may arise; that is the way open source works. In the end, it is up to you to decide which one works best for you. And if you want to make some tweaks, you can always fork the appropriate Github repository.

It is in general just amazing to be able to spin up a new virtual machine with your own or somebody else's environment, whether by running vagrant up or by clicking a few buttons on AWS.

I realize that three out of four names look really alike, which can be confusing, but it could also indicate that there is a need for having an automated (and isolated) setup to start doing data science without any additional hassle.

Conclusion

While the command-line is a very powerful environment to process data, manually installing the latest command-line tools is not straightforward. Vagrant allows you to spin up a virtual machine and to install all the tools automatically. In this post I have shared with you the exact same Vagrant environment as that I am using for my upcoming book, in the hope that it will be useful to get you started with doing data science at the command line. I have also compared my environment with three other virtual environments for data science. Please let me know if you have any questions, suggestions, or contributions.

Best wishes,

Jeroen
@jeroenhjanssens