CSVSplitter

Simple Python script to split up CSV files >2GB in size into smaller files.

About

This script uses dask and pandas libraries to load and process large data sets into memory via chunks.

The intention is to ingest a CSV payload with large datasets, and split it into smaller files without losing data formatting. The size of each file outputted by the sctipt is defined by the user.

_Disclaimer:

_{This was made as a learning exercise, not meant to be an elegant script. Use at your own risk for production data!}

Docs

Below is a how-to on using the script.

Before running

The following steps must be taken to prepare your local host to run script.

Set up local env depending on OS version

MacOS (using Homebrew)

CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
brew install -y git python3
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

Linux

Ubuntu/Debian-based

CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo apt-get update -y
sudo apt-get install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

RHEL/CentOS-based

CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo yum install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

Fedora-based

CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo dnf install -y git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

Arch-based

CSV_DIR="$HOME/tmp"
CSV_APP="csv-splitter"
CSV_PATH="$CSV_DIR/$CSV_APP"
sudo pacman -S git python3 python3-pip
mkdir -pv "$CSV_DIR"
git clone "git@github.com:ctalaveraw/$CSV_APP.git" "$CSV_PATH"

Settting global variables

The following global values should be set in csv-splitter.py before execution. This can be done using a GUI editor or using one of the many terminal-based ones:

vim "$(CSV_PATH/$CSV_APP.py)"

OR

nano "$CSV_PATH/$CSV_APP.py"

Define the schema using the `CSV_SCHEMA` global variable

# Insert your own CSV schema here
CSV_SCHEMA = {
    'FIELD_1': str,
    'FIELD_2': str,
    'FIELD_3': str,
    'FIELD_4': str,
    'FIELD_5': str,
    'FIELD_6': str,
    'FIELD_7': str,
    'FIELD_8': str
}

Define the schema using the `OUTPUT_PATH` global variable

OUTPUT_PATH = 'result/output-*.csv'

Running the script

Import `dask` and `pandas` libraries

cd "$CSV_PATH" && python3 -m pip install dask pandas

Start the script

python3 "$CSV_PATH/$CSV_APP.py"

User Input

User provides the filename of the CSV (without extension) in the current working directory requiring the split as input.
User provides the desired size of each split file in MB to output by the script.

Output

If successful, the split files should output in the following location:

/home/username/tmp/csv-splitter/result/

List output files

This can be done using the following command:

ls -la "$CSV_PATH/result"

By default, the outputted files should be in the below output, but this can be changed in the OUTPUT_PATH global variable within csv-splitter.py

output-1.csv
output-2.csv
output-3.csv

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
csv-splitter.py		csv-splitter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSVSplitter

Table of Contents

About

_Disclaimer:

Docs

Before running

Set up local env depending on OS version

MacOS (using Homebrew)

Linux

Ubuntu/Debian-based

RHEL/CentOS-based

Fedora-based

Arch-based

Settting global variables

Define the schema using the `CSV_SCHEMA` global variable

Define the schema using the `OUTPUT_PATH` global variable

Running the script

Import `dask` and `pandas` libraries

Start the script

User Input

Output

List output files

About

Releases

Packages

Languages

License

ctalaveraw/csv-splitter

Folders and files

Latest commit

History

Repository files navigation

CSVSplitter

Table of Contents

About

Disclaimer:

Docs

Before running

Set up local env depending on OS version

MacOS (using Homebrew)

Linux

Ubuntu/Debian-based

RHEL/CentOS-based

Fedora-based

Arch-based

Settting global variables

Define the schema using the CSV_SCHEMA global variable

Define the schema using the OUTPUT_PATH global variable

Running the script

Import dask and pandas libraries

Start the script

User Input

Output

List output files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

_Disclaimer:

Define the schema using the `CSV_SCHEMA` global variable

Define the schema using the `OUTPUT_PATH` global variable

Import `dask` and `pandas` libraries

Packages