Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tool: tsv-split #270

Merged
merged 23 commits into from
Mar 18, 2020
Merged

New tool: tsv-split #270

merged 23 commits into from
Mar 18, 2020

Conversation

jondegenhardt
Copy link
Contributor

This PR adds a new tool: tsv-split. This tool is used to split a large data set into multiple smaller files. It is similar to the Unix split utility, but with several additional capabilities. These include random assignment, and random assignment based on key fields, and header line support.

From the help:

Synopsis: tsv-split [options] [file...]

Split input lines into multiple output files. There are three modes of
operation:

* Fixed number of lines per file (--l|lines-per-file NUM): Each input
  block of NUM lines is written to a new file. Similar to Unix 'split'.

* Random assignment (--n|num-files NUM): Each input line is written to a
  randomly selected output file. Random selection is from NUM files.

* Random assignment by key (--n|num-files NUM, --k|key-fields FIELDS):
  Input lines are written to output files using fields as a key. Each
  unique key is randomly assigned to one of NUM output files. All lines
  with the same key are written to the same file.

By default, files are written to the current directory and have names of the
form 'part_NNN.tsv', with 'NNN' being a number. The output directory and
file names are customizable.

Options:

     --help-verbose      Print more detailed help.
-H         --header      Input files have a header line. Write the header to each output file.
-I --header-in-only      Input files have a header line. Do not write the header to output files.
-l --lines-per-file NUM  Number of lines to write to each output file (excluding the header line).
-n      --num-files NUM  Number of output files to generate.
-k     --key-fields <field-list>  Fields to use as key. Lines with the same key are written to the same output file. Use '--k|key-fields 0' to use the entire line as the key.
              --dir STR  Directory to write to. Default: Current working directory.
           --prefix STR  Filename prefix. Default: 'part_'
           --suffix STR  Filename suffix. Default: '.tsv'
-a         --append      Append to existing files.
-s    --static-seed      Use the same random seed every run.
-v     --seed-value NUM  Sets the random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op.
-d      --delimiter CHR  Field delimiter.
   --max-open-files NUM  Maximum open file handles to use. Min of 5 required.
-V        --version      Print version information and exit.
-h           --help This help information.

@jondegenhardt jondegenhardt merged commit 3e214d8 into eBay:master Mar 18, 2020
@jondegenhardt jondegenhardt deleted the tsv-split branch March 18, 2020 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant