Simple Python script to split up CSV files >2GB in size into smaller files.
This script uses dask
and pandas
libraries to load and process
large data sets into memory via chunks.
The intention is to ingest a CSV payload with large datasets, and split it into smaller files without losing data formatting. The size of each file outputted by the sctipt is defined by the user.
This was made as a learning exercise, not meant to be an elegant script. Use at your own risk for production data!
Below is a how-to on using the script.
The following steps must be taken to prepare your local host to run script.
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
brew install -y git python3
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo apt-get update -y
sudo apt-get install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo yum install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo dnf install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo pacman -S git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
The following global values should be set in csv-splitter.py
before execution.
This can be done using a GUI editor or using one of the many terminal-based ones:
vim "$(CSV_PATH/$CSV_APP.py)"
OR
nano "$CSV_PATH/$CSV_APP.py"
# Insert your own CSV schema here
CSV_SCHEMA = {
'FIELD_1': str,
'FIELD_2': str,
'FIELD_3': str,
'FIELD_4': str,
'FIELD_5': str,
'FIELD_6': str,
'FIELD_7': str,
'FIELD_8': str
}
OUTPUT_PATH = 'result/output-*.csv'
cd "$CSV_PATH" && python3 -m pip install dask pandas
python3 "$CSV_PATH/$CSV_APP.py"
- User provides the filename of the CSV (without extension) in the current working directory requiring the split as input.
- User provides the desired size of each split file in MB to output by the script.
If successful, the split files should output in the following location:
/home/username/tmp/csv-splitter/result/
This can be done using the following command:
ls -la "$CSV_PATH/result"
By default, the outputted files should be in the below output, but this can be changed in the OUTPUT_PATH
global variable within csv-splitter.py
output-1.csv
output-2.csv
output-3.csv