In this final chapter, the book comes to a close. I’ll first recap what I’ve discussed in the previous ten chapters, and will then offer you three pieces of advice and provide some resources to further explore the related topics we touched upon. Finally, in case you have any questions, comments, or new command-line tools to share, I provide a few ways to get in touch with me.
This book explored the power of using the command line to do data science. I find it an interesting observation that the challenges posed by this relatively young field can be tackled by such a time-tested technology. I hope that you now see what the command line is capable of. The many command-line tools offer all sorts of possibilities that are well suited to the variety of tasks encompassing data science.
There are many definitions for data science available. In Chapter 1, I introduced the OSEMN model as defined by Mason and Wiggins, because it is a very practical one that translates to very specific tasks. The acronym OSEMN stands for obtaining, scrubbing, exploring, modeling, and interpreting data. Chapter 1 also explained why the command line is very suitable for doing these data science tasks.
The four OSEMN model chapters focused on performing those practical tasks using the command line. I haven’t devoted a chapter to the fifth step, interpreting data, because, quite frankly, the computer, let alone the command line, is of very little use here. I have, however, provided some pointers for further reading on this topic.
In the four intermezzo chapters, we looked at some broader topics of doing data science at the command line, topics which are not really specific to one particular step.
In Chapter 4, I explained how you can turn one-liners and existing code into reusable command-line tools.
In Chapter 6, I described how you can manage your data workflow using a tool called
In Chapter 8, I demonstrated how ordinary command-line tools and pipelines can be run in parallel using GNU Parallel.
In Chapter 10, I showed that the command line doesn’t exist in a vacuum but that it can be leveraged from other programming languages and environments.
The topics discussed in these intermezzo chapters can be applied at any point in your data workflow.
It’s impossible to demonstrate all command-line tools that are available and relevant for doing data science. New tools are created on a daily basis. As you may have come to understand by now, this book is more about the idea of using the command line, rather than giving you an exhaustive list of tools.
You probably spent quite some time reading these chapters and perhaps also following along with the code examples. In the hope that it maximizes the return on this investment and increases the probability that you’ll continue to incorporate the command line into your data science workflow, I would like to offer you three pieces of advice: (1) be patient, (2) be creative, and (3) be practical. In the next three subsections I elaborate on each piece of advice.
The first piece of advice that I can give is to be patient. Working with data on the command line is different from using a programming language, and therefore it requires a different mindset.
Moreover, the command-line tools themselves are not without their quirks and inconsistencies.
This is partly because they have been developed by many different people, over the course of multiple decades.
If you ever find yourself at a loss regarding their mind-dazzling options, don’t forget to use
tldr, or your favorite search engine to learn more.
Still, especially in the beginning, it can be a frustrating experience. Trust me, you’ll become more proficient as you practice using the command line and its tools. The command line has been around for many decades, and will be around for many more to come. It’s a worthwhile investment.
The second, related piece of advice is to be creative. The command line is very flexible. By combining the command-line tools, you can accomplish more than you might think.
I encourage you to not immediately fall back onto your programming language. And when you do have to use a programming language, think about whether the code can be generalized or reused in some way. If so, consider creating your own command-line tool with that code using the steps I discussed in Chapter 4. If you believe your tool may be beneficial for others, you could even go one step further by making it open source. Maybe there’s a step you know how to perform at the command line, but you would rather not leave the comfort of the main programming language or environment you’re working in. Perhaps you can use one of the approaches listed in Chapter 10.
The third piece of advice is to be practical. Being practical is related to being creative, but deserves a separate explanation. In the previous subsection, I mentioned that you should not immediately fall back to a programming language. Of course, the command line has its limits. Throughout the book, I have emphasized that the command line should be regarded as a companion approach to doing data science.
I’ve discussed four steps for doing data science at the command line. In practice, the applicability of the command line is higher for step 1 than it is for step 4. You should use whatever approach works best for the task at hand. And it’s perfectly fine to mix and match approaches at any point in your workflow. As I’ve shown in Chapter 10, the command line is wonderful at being integrated with other approaches, programming languages, and statistical environments. There’s a certain trade-off with each approach, and part of becoming proficient at the command line is to learn when to use which.
In conclusion, when you’re patient, creative, and practical, the command line will make you a more efficient and productive data scientist.
As this book is on the intersection of the command line and data science, many related topics have only been touched upon. Now, it’s up to you to further explore these topics. The following subsections provide a list of topics and suggested resources to consult.
- The Linux Command Line: A Complete Introduction, 2nd Edition By William E. Shotts, Jr. (No Starch Press, 2019)
- Unix Power Tools, 3rd Edition by Jerry Peek, Shelley Powers, Tim O’Reilly, and Mike Loukides (O’Reilly Media, 2002)
- Learning the Vi and Vim Editors, 7th Edition by Arnold Robbins, Elbert Hannah, and Linda Lamb (O’Reilly Media, 2008)
- Classic Shell Scripting by Arnold Robbins and Nelson H.F. Beebe (O’Reilly Media, 2005)
- Wicked Cool Shell Scripts, 2nd Edition by Dave Taylor and Brandon Perry (No Starch Press, 2017)
- Bash Cookbook by Carl Albing JP Vossen (O’Reilly Media, 2018)
- Learn Python 3 the Hard Way by Zed A. Shaw (Addison-Wesley Professional, 2017)
- Python for Data Analysis, 2nd Edition by Wes McKinney (O’Reilly Media, 2017)
- Data Science from Scratch, 2nd Edition by Joel Grus (O’Reilly Media, 2019)
- R for Data Science by Garrett Grolemund and Hadley Wickham (O’Reilly Media, 2016)
- R for Everyone, 2nd edition by Jared Lander (Addison-Wesley Professional, 2017)
- Sams Teach Yourself SQL in 10 Minutes a Day, 5th Edition by Ben Forta (Sams, 2020)
- Mining the Social Web, 3rd Edition by Matthew A. Russell and Mikhail Klassen (O’Reilly Media, 2019)
- Data Source Handbook by Pete Warden (O’Reilly Media, 2011)
- Python Machine Learning, 3rd Edition by Sebastian Raschka and Vahid Mirjalili (Packt Publishing, 2019)
- Pattern Recognition and Machine Learning by Christopher M. Bishop (Springer, 2006)
- Information Theory, Inference, and Learning Algorithms by David MacKay (Cambridge University Press, 2003)
This book would not have been possible without the many people who created the command line and the numerous tools. It’s safe to say that the current ecosystem of command-line tools for data science is a community effort. I have only been able to give you a glimpse of the many command-line tools available. New ones are created everyday, and perhaps some day you will create one yourself. In that case, I would love to hear from you. I’d also appreciate it if you would drop me a line whenever you have a question, comment, or suggestion. There are a couple of ways to get in touch:
- Email: email@example.com
- Twitter: @jeroenhjanssens
- Book website: https://datascienceatthecommandline.com/
- Book GitHub repository: https://github.com/jeroenjanssens/data-science-at-the-command-line