Python Polars: The Definitive Guide

With Thijs Nieuwdorp. Expected to be published by O’Reilly Media in August 2024. Foreword by Ritchie Vink.

Get ready to speed up your data analysis and start working with larger-than-memory datasets. Polars offers a blazingly fast, multi-threaded, elegant API for data loading, manipulation, and processing. Authors Jeroen Janssens and Thijs Nieuwdorp walk you through every aspect of Python Polars as they tackle practical use cases using real-world datasets. You’ll not only learn the syntax, but also understand the underlying concepts. You don’t need to have any experience with Pandas or Spark, but if you do, this book will help you make a smooth transition.

With this definitive guide at your side, you’ll be able to: process larger-than-memory datasets at record speed; apply the eager, lazy, and streaming APIs of Polars and decide when to use which; transition smoothly from Pandas or Spark to Polars; integrate Polars into your existing codebase; work with Arrow and Parquet to efficiently read and write data; and translate complex ETL tasks into efficient and elegant queries.

Did you know? Thijs and I are colleagues at Xomnia, the very birthplace of Polars!

Data Science at the Command Line

Second edition. Published by O’Reilly Media in October 2021. Foreword by Tim O’Reilly.

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools—useful whether you work with Windows, macOS, or Linux. You’ll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you’re comfortable processing data with Python or R, you’ll learn how to greatly improve your data science workflow by leveraging the command line’s power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

Did you know? The first edition, which came out in 2014, eventually led me to start Data Science Workshops. You can read more about this in the Acknowledgments.

Outlier Selection and One-Class Classification

My PhD thesis. Defended on June 11, 2013 at Tilburg University, the Netherlands.

What is common in a terrorist attack, a forged painting, and a rotten apple? The answer is: all three are anomalies; they are real-world observations that deviate from what is considered to be normal. Detecting anomalies is of utmost importance because an undetected anomaly can be dangerous or expensive. A human domain expert may suffer from three cognitive limitations: fatigue, information overload, and emotional bias. The cognitive limitations will hamper the detection of anomalies. Outlier-selection and one-class classification algorithms are capable of automatically classifying data points as outliers in large amounts of data. In this thesis we study to what extent outlier-selection and one-class classification algorithms can support domain experts with real-world anomaly detection.

Did you know? The Stochastic Outlier Selection algorithm, which is covered in Chapter 4, is available in the PyOD package for Python.

© 2013–2023  Jeroen Janssens