A comprehensive tutorial on the Python Pandas library, updated to be consistent with best practices and features available in 2024.
The tutorial can be watched here
The code that is walked through in the tutorial is in tutorial.ipynb
To get started with Pandas locally, you can follow these steps to set up your environment and clone the recommended repository.
First, ensure you have Python installed on your system. You can download Python from the official website.
Fork the repository to your own GitHub account by visiting complete-pandas-tutorial and clicking the "Fork" button in the top-right corner.
Clone your forked repository to your local machine. Open a terminal or command prompt and run:
git clone https://github.com/yourusername/complete-pandas-tutorial.git
cd complete-pandas-tutorial
Replace yourusername
with your actual GitHub username.
Creating a virtual environment is a good practice to manage dependencies for your projects. Run the following command:
python -m venv myenv
Activate the virtual environment:
- On Windows:
myenv\Scripts\activate
- On macOS/Linux:
source myenv/bin/activate
To deactivate the virtual environment, run:
- On Windows:
myenv\Scripts\deactivate.bat
- On macOS/Linux:
deactivate
With the virtual environment activated, install the necessary libraries from the requirements.txt
file:
pip install -r requirements.txt
You can use your favorite code editor like Visual Studio Code or PyCharm. Open the cloned repository folder in your code editor.
Create a new Jupyter Notebook file in your code editor:
- In Visual Studio Code, click on the "New File" icon or press
Ctrl+N
, then save the file with a.ipynb
extension. - In PyCharm, right-click on the project folder, select "New", and then "Jupyter Notebook".
- Else, if these options don't work or you are using an editor that doesn't support Jupyter Notebooks, run the following command in your terminal:
This will open Jupyter Notebook in your web browser.
jupyter notebook
If you prefer not to set up things locally, you can use Google Colab, which allows you to run Python code in your browser without any setup.
Go to Google Colab and start a new notebook. You can upload your dataset and start coding with Pandas immediately.
This cheat sheet is a companion to the "Complete Python Pandas Data Science Tutorial."
Creating DataFrames is the foundation of using Pandas. Here’s how to create a simple DataFrame and display its content.
import pandas as pd
import numpy as np
# Create a simple DataFrame
df = pd.DataFrame(
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
columns=["A", "B", "C"],
index=["x", "y", "z", "zz"]
)
# Display the first few rows
df.head()
# Display the last two rows
df.tail(2)
Loading data into DataFrames from various file formats is crucial for real-world data analysis.
# Load data from CSV
coffee = pd.read_csv('./warmup-data/coffee.csv')
# Load data from Parquet
results = pd.read_parquet('./data/results.parquet')
# Load data from Excel
olympics_data = pd.read_excel('./data/olympics-data.xlsx', sheet_name="results")
Accessing different parts of the DataFrame allows for flexible data manipulation and inspection.
# Access columns
df.columns
# Access index
df.index.tolist()
# General info about the DataFrame
df.info()
# Statistical summary
df.describe()
# Number of unique values in each column
df.nunique()
# Access unique values in a column
df['A'].unique()
# Shape and size of DataFrame
df.shape
df.size
Filtering data is essential for extracting relevant subsets based on conditions.
# Filter rows based on conditions
bios.loc[bios["height_cm"] > 215]
# Multiple conditions
bios[(bios['height_cm'] > 215) & (bios['born_country']=='USA')]
# Filter by string conditions
bios[bios['name'].str.contains("keith", case=False)]
# Regex filters
bios[bios['name'].str.contains(r'^[AEIOUaeiou]', na=False)]
Adding and removing columns is important for maintaining and analyzing relevant data.
# Add a new column
coffee['price'] = 4.99
# Conditional column
coffee['new_price'] = np.where(coffee['Coffee Type']=='Espresso', 3.99, 5.99)
# Remove a column
coffee.drop(columns=['price'], inplace=True)
# Rename columns
coffee.rename(columns={'new_price': 'price'}, inplace=True)
# Create new columns from existing ones
coffee['revenue'] = coffee['Units Sold'] * coffee['price']
Merging and concatenating DataFrames is useful for combining different datasets for comprehensive analysis.
# Merge DataFrames
nocs = pd.read_csv('./data/noc_regions.csv')
bios_new = pd.merge(bios, nocs, left_on='born_country', right_on='NOC', how='left')
# Concatenate DataFrames
usa = bios[bios['born_country']=='USA'].copy()
gbr = bios[bios['born_country']=='GBR'].copy()
new_df = pd.concat([usa, gbr])
Handling null values is essential to ensure the integrity of data analysis.
# Fill NaNs with a specific value
coffee['Units Sold'].fillna(0, inplace=True)
# Interpolate missing values
coffee['Units Sold'].interpolate(inplace=True)
# Drop rows with NaNs
coffee.dropna(subset=['Units Sold'], inplace=True)
Aggregation functions like value counts and group by help in summarizing data efficiently.
# Value counts
bios['born_city'].value_counts()
# Group by and aggregation
coffee.groupby(['Coffee Type'])['Units Sold'].sum()
coffee.groupby(['Coffee Type'])['Units Sold'].mean()
# Pivot table
pivot = coffee.pivot(columns='Coffee Type', index='Day', values='revenue')
Advanced functionalities such as rolling calculations, rankings, and shifts can provide deeper insights.
# Cumulative sum
coffee['cumsum'] = coffee['Units Sold'].cumsum()
# Rolling window
latte = coffee[coffee['Coffee Type']=="Latte"].copy()
latte['3day'] = latte['Units Sold'].rolling(3).sum()
# Rank
bios['height_rank'] = bios['height_cm'].rank(ascending=False)
# Shift
coffee['yesterday_revenue'] = coffee['revenue'].shift(1)
The PyArrow backend offers optimized performance for certain operations, particularly string operations.
# PyArrow backend
results_arrow = pd.read_csv('./data/results.csv', engine='pyarrow', dtype_backend='pyarrow')
results_arrow.info()