Skip to content

ctalaveraw/csv-splitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

CSVSplitter

Simple Python script to split up CSV files >2GB in size into smaller files.

Table of Contents

About

This script uses dask and pandas libraries to load and process large data sets into memory via chunks.

The intention is to ingest a CSV payload with large datasets, and split it into smaller files without losing data formatting. The size of each file outputted by the sctipt is defined by the user.

Disclaimer:

This was made as a learning exercise, not meant to be an elegant script. Use at your own risk for production data!

Docs

Below is a how-to on using the script.

Before running

The following steps must be taken to prepare your local host to run script.

Set up local env depending on OS version

MacOS (using Homebrew)
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
brew install -y git python3
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
Linux
Ubuntu/Debian-based
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo apt-get update -y
sudo apt-get install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
RHEL/CentOS-based
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo yum install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
Fedora-based
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo dnf install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"
Arch-based
CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo pacman -S git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

Settting global variables

The following global values should be set in csv-splitter.py before execution. This can be done using a GUI editor or using one of the many terminal-based ones:

vim "$(CSV_PATH/$CSV_APP.py)"

OR

nano "$CSV_PATH/$CSV_APP.py"
Define the schema using the CSV_SCHEMA global variable
# Insert your own CSV schema here
CSV_SCHEMA = {
    'FIELD_1': str,
    'FIELD_2': str,
    'FIELD_3': str,
    'FIELD_4': str,
    'FIELD_5': str,
    'FIELD_6': str,
    'FIELD_7': str,
    'FIELD_8': str
}
Define the schema using the OUTPUT_PATH global variable
OUTPUT_PATH = 'result/output-*.csv'

Running the script

Import dask and pandas libraries

cd "$CSV_PATH" && python3 -m pip install dask pandas

Start the script

python3 "$CSV_PATH/$CSV_APP.py"
User Input
  • User provides the filename of the CSV (without extension) in the current working directory requiring the split as input.
  • User provides the desired size of each split file in MB to output by the script.

Output

If successful, the split files should output in the following location:

/home/username/tmp/csv-splitter/result/

List output files

This can be done using the following command:

ls -la "$CSV_PATH/result"

By default, the outputted files should be in the below output, but this can be changed in the OUTPUT_PATH global variable within csv-splitter.py

output-1.csv
output-2.csv
output-3.csv

About

Splitting large CSVs up

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages