Materials for Robust Pandas Tutorial, EuroPython, Prague, 2023.
We will explore possibilities for making our data analyses and transformations in Pandas robust and production ready. We will see how advanced group-by, resample or rolling aggregations work on large time series weather data. (As a bonus, you will learn about Prague climate.) We will use type annotations and schema validations with the Pandera library to make our code more readable and robust. We will also show the potential of property-based testing using the Hypothesis package, with strategies generated from Pandera schemas. We will show how to avoid issues with time zones when working with time series data. By the end of the tutorial, you will have a deeper understanding of advanced Pandas aggregations and be able to write robust, production ready Pandas code.
Two data sources are used in this workshop:
Please prepare a Python environment that you can use during the workshop. We will work in Jupyter Notebook as well as in an editor or an IDE of your choice. Recommended are Visual Studio Code or PyCharm.
Note: All the instructions below are for Unix-like systems (Linux, macOS, WSL on Windows).
If you want / need to work in Windows native cmd
or PowerShell, you will need to adapt the commands accordingly.
We cannot provide support for Windows native environments.
git clone https://github.com/coobas/robust-pandas-workshop.git
or using gh
client:
gh repo clone coobas/robust-pandas-workshop
We have included either requirements.txt
or environment.yml
files for you to create a Python environment
using either pip
or conda
respectively.
Python version 3.10+ is required.
First, cd
into the repository directory:
cd robust-pandas-workshop
python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
conda env create -f environment.yml -n "robust-pandas-workshop"
conda activate robust-pandas-workshop
Code for this workshop is in the weatherlyser
package in this repository.
Before working with it, either in Jupyter notebooks, in your IDE, or when running tests,
Python needs to know about it.
Either set the PYTHONPATH
environment variable to the repository directory:
export PYTHONPATH=$PWD
(this of course assumes your current directory in the repository root)
or, which is more robust, install the package in editable mode:
pip install -e .
Follow the instructions therein and if you do not have it, create a free Deepnote account.
All materials that we will use during the workshop are in Jupyter notebooks.
- Introduction
- First data exploration
- Type annotations and dataframe models
- Data loading
- Time zones
- Hypothesis testing
- Grouping, resampling and aggregations
- Windowing and differences
Visual Studio Code or PyCharm Professional users can work with notebooks directly in their IDE; this is the recommended way. You can also use Jupyter Lab, which will be installed in your environment and features an IDE environment too with and editor and command line.
The tests
directory contains tests for the weatherlyser
package.
We will use the tests throughout the workshop to test our code.
It is also a good idea to run the tests to check whether your installation is working correctly.
To run tests, use pytest
:
pytest
The mypy
static type checker is configured to check the weatherlyser
and tests
folders.
You can run it with:
mypy