Workshop Data Science at the Command Line | Data Science Workshops with Jeroen Janssens

Together you’ll learn better thanks to my workshop Data Science at the Command Line. Do you want to know more about this workshop? Curious how I can adapt it to your needs? Something else? Don’t hesitate to contact me.

Data Science at the Command Line

The unix command line, although invented decades ago, is an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools (like parallel, jq, and csvkit), you can quickly scrub and explore your data and hack together prototypes.

This hands-on workshop is based on the O’Reilly book Data Science at the Command Line, written by instructor Jeroen Janssens. You’ll learn how to build fast data pipelines, how to leverage R and Python at the command line, and how to quickly visualise data. No prior knowledge about the unix command line is required.

By the end of this workshop you will have a solid understanding of how to integrate the command line in your data science workflow. Even if you’re already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make you a more effective and efficient data scientist.

What you’ll learn

Automate tedious tasks
Parallelise and distribute your tasks to multiple cores and machines
Convert your existing code to reusable command-line tools
Easily inspect, transform, and visualise data
Apply a variety of supervised and unsupervised machine learning algorithms

Schedule

Day 1:

Introduction
- What is the command line?
- Why learn the command line for doing data science?
- A real-world data science use case
- Getting up and running with the Docker image
Essential concepts of the unix command line
- Running command-line tools
- Combining command-line tools
- Redirecting input and output
- Working with files
- Getting help
Obtaining data from logs, spreadsheets, and databases
Downloading data from the Internet and accessing APIs using curl
Transforming data with filters such as cut, paste, grep, and sed
Processing other data formats efficiently
- JSON with jq
- CSV with csvkit
- HTML with pup
- XML with xmlstarlet

Day 2:

Running R from the command line
Visualising data from the command line
- Scatter plot
- Histogram
- Bar chart
- Geographic visualisation
Parallelising and distributing data-intensive pipelines
Creating reusable command-line tools
- Automate things in a Bash script
- Convert your existing code to a command-line tool
- Processing arguments
- Working with streaming data
Applying machine learning
- Dimensionality reduction
- Classification
- Regression
Conclusion