Attempt to shrink Pandas
dtypes
without losing data so you have more RAM (and maybe more speed)
This file will become your README and also the index of your documentation.
pip install dtype_diet
https://noklam.github.io/dtype_diet/
This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.
This tool checks each column to see if larger dtypes (e.g. 8 byte float64
and int64
) could be shrunk to smaller dtypes
without causing any data loss.
Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for object
columns which can bring significant speed and RAM benefits.
Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the repository:
#slow
# sell_prices.csv.zip
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
Original df memory: 957.5197134017944 MB
Propsed df memory: 85.09655094146729 MB
#slow
proposed_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Current dtype | Proposed dtype | Current Memory (MB) | Proposed Memory (MB) | Ram Usage Improvement (MB) | Ram Usage Improvement (%) | |
---|---|---|---|---|---|---|
Column | ||||||
store_id | object | category | 203763.920410 | 3340.907715 | 200423.012695 | 98.360403 |
item_id | object | category | 233039.977539 | 6824.677734 | 226215.299805 | 97.071456 |
wm_yr_wk | int64 | int16 | 26723.191406 | 6680.844727 | 20042.346680 | 74.999825 |
sell_price | float64 | None | 26723.191406 | NaN | NaN | NaN |
Recommendations:
- Run
report_on_dataframe(your_df)
to get recommendations - Run
optimize_dtypes(df, proposed_df)
to convert to recommeded dtypes. - Consider if Categoricals will save you RAM (see Caveats below)
- Consider if f32 or f16 will be useful (see Caveats - f32 is probably a reasonable choice unless you have huge ranges of floats)
- Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
- Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
- Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum for the tweet)
Look at report_on_dataframe(your_df)
to get a printed report - no changes are made to your dataframe.
- reduced numeric ranges might lead to overflow (TODO document)
- category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
- f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
- we could do with a link that explains binary representation of float & int for those wanting to learn more
- Antony Milbourne https://github.com/amilbourne
- Mani https://github.com/neomatrix369
$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet
make release
The repository is developed with nbdev
, a system for developing library with notebook.
Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)
nbdev_install_git_hooks
Some other useful commands
nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs