diff --git a/README.md b/README.md index d09dccf2..da32c92a 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,13 @@ These tools are especially useful when working with large data sets. They run fa File an [issue](https://github.com/eBay/tsv-utils/issues) if you have problems, questions or suggestions. **In this README:** -* [Tools overview](#tools-overview) - Descriptions of each tool. +* [Tools overview](#tools-overview) - Toolkit introduction and descriptions of each tool. * [Obtaining and installation](#obtaining-and-installation) **Additional documents:** -* [Tools reference](docs/ToolReference.md) - Detailed documentation. +* [Tools Reference](docs/ToolReference.md) - Detailed documentation. * [Releases](https://github.com/eBay/tsv-utils/releases) - Prebuilt binaries and release notes. + * New in version 2.0: Named field support! * [Tips and tricks](docs/TipsAndTricks.md) - Simpler and faster command line tool use. * [Performance Studies](docs/Performance.md) - Benchmarks against similar tools and other performance studies. * [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) @@ -36,12 +37,12 @@ File an [issue](https://github.com/eBay/tsv-utils/issues) if you have problems, These tools perform data manipulation and statistical calculations on tab delimited data. They are intended for large files. Larger than ideal for loading entirely in memory in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments. The features supported are useful both for standalone analysis and for preparing data for use in R, Pandas, and similar toolkits. -The tools work like traditional Unix command line utilities such as `cut`, `sort`, `grep` and `awk`, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (`tsv-filter`). Documentation is available for each tool by invoking it with the `--help` option. TSV format is similar to CSV, see [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for the differences. +The tools work like traditional Unix command line utilities such as `cut`, `sort`, `grep` and `awk`, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. Fields are identified either by field name or field number. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (`tsv-filter`). Documentation is available for each tool by invoking it with the `--help` option. TSV format is similar to CSV, see [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for the differences. -The rest of this section contains descriptions of each tool. Click on the links below to jump directly to one of the tools. Full documentation is available in the [tool reference](docs/ToolReference.md). +The rest of this section contains descriptions of each tool. Click on the links below to jump directly to one of the tools. Full documentation is available in the [Tools Reference](docs/ToolReference.md). The first tool listed, [tsv-filter](#tsv-filter), provides a tutorial introduction to features found throughout the toolkit. -* [tsv-filter](#tsv-filter) - Filter lines using numeric, string and regular expression comparisons against individual fields. This description also provides an introduction to features found throughout the toolkit. -* [tsv-select](#tsv-select) - Keep a subset of columns (fields). Like `cut`, but with field reordering. +* [tsv-filter](#tsv-filter) - Filter lines using numeric, string and regular expression comparisons against individual fields. +* [tsv-select](#tsv-select) - Keep a subset of columns (fields). Like `cut`, but supporting named fields, field reordering, and field exclusions. * [tsv-uniq](#tsv-uniq) - Filter out duplicate lines using either the full line or individual fields as a key. * [tsv-summarize](#tsv-summarize) - Summary statistics on selected fields, against the full data set or grouped by key. * [tsv-sample](#tsv-sample) - Sample input lines or randomize their order. A number of sampling methods are available. @@ -69,24 +70,28 @@ $ tsv-pretty data.tsv | head -n 5 The following command finds all entries where 'year' (field 3) is 2008: ``` -$ tsv-filter -H --eq 3:2008 data.tsv +$ tsv-filter -H --eq year:2008 data.tsv ``` -The `--eq` operator performs a numeric equality test. String comparisons are also available. The following command finds entries where 'color' (field 2) is "red": +The `-H` option indicates the first input line is a header. The `--eq` operator performs a numeric equality test. String comparisons are also available. The following command finds entries where 'color' (field 2) is "red": ``` -$ tsv-filter -H --str-eq 2:red data.tsv +$ tsv-filter -H --str-eq color:red data.tsv ``` -Fields are identified by a 1-up field number, same as traditional Unix tools. The `-H` option preserves the header line. +Fields can also be identified by field number, same as traditional Unix tools. This works for files with and without header lines. The following commands are equivalent to the previous two: +``` +$ tsv-filter -H --eq 3:2008 data.tsv +$ tsv-filter -H --str-eq 2:red data.tsv +``` -Multiple tests can be specified. The following command finds `red` entries with years between 1850 and 1950: +Multiple tests can be specified. The following command finds `red` entries with `year` between 1850 and 1950: ``` -$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv +$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv ``` Viewing the first few results produced by this command: ``` -$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv | tsv-pretty | head -n 5 +$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv | tsv-pretty | head -n 5 id color year count 101 red 1935 756 106 red 1883 1156 @@ -96,9 +101,9 @@ $ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv | tsv-pretty | h Files can be placed anywhere on the command line. Data will be read from standard input if a file is not specified. The following commands are equivalent: ``` -$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv -$ tsv-filter data.tsv -H --str-eq 2:red --ge 3:1850 --lt 3:1950 -$ cat data.tsv | tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 +$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv +$ tsv-filter data.tsv -H --str-eq color:red --ge year:1850 --lt year:1950 +$ cat data.tsv | tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 ``` Multiple files can be provided. Only the header line from the first file will be kept when the `-H` option is used: @@ -111,12 +116,12 @@ Numeric comparisons are among the most useful tests. Numeric operators include: * Equality: `--eq`, `--ne` (equal, not-equal). * Relational: `--lt`, `--le`, `--gt`, `--ge` (less-than, less-equal, greater-than, greater-equal). -Several filters are available to help with invalid entries. Assume there is a messier version of the 4-field file where some fields are not filled in. The following command will filter out all lines with an empty value in any of the four fields: +Several filters are available to help with invalid data. Assume there is a messier version of the 4-field file where some fields are not filled in. The following command will filter out all lines with an empty value in any of the four fields: ``` $ tsv-filter -H messy.tsv --not-empty 1-4 ``` -The above command uses a "field list" to specify running the test on each of fields 1-4. The test can be "inverted" to see the lines that were filtered out: +The above command uses a "field list" to run the test on each of fields 1-4. The test can be "inverted" to see the lines that were filtered out: ``` $ tsv-filter -H messy.tsv --invert --not-empty 1-4 | head -n 5 | tsv-pretty id color year count @@ -153,7 +158,19 @@ The earlier `--not-empty` example uses a "field list". Fields lists specify a se $ tsv-filter -H --lt 1-3,7:100 file.tsv ``` -Most of the TSV Utilities tools support field lists. See [Field numbers and field-lists](docs/ToolReference.md#field-numbers-and-field-lists) in the [Tools reference](docs/ToolReference.md) document for details. +Field names can be used in field lists as well. The following command selects lines where both 'color' and 'count' fields are not empty: +``` +$ tsv-filter -H messy.tsv --not-empty color,count +``` + +Field names can be matched using wildcards. The previous command could also be written as: +``` +$ tsv-filter -H messy.tsv --not-empty 'co*' +``` + +The `co*` matches both the 'color' and 'count' fields. (Note: Single quotes are used to prevent the shell from interpreting the asterisk character.) + +All TSV Utilities tools use the same syntax for specifying fields. See [Field syntax](docs/tool_reference/common-options-and-behavior.md#field-syntax) in the [Tools Reference](docs/ToolReference.md) document for details. Bash completion is especially helpful with `tsv-filter`. It allows quickly seeing and selecting from the different operators available. See [bash completion](docs/TipsAndTricks.md#enable-bash-completion) on the [Tips and tricks](docs/TipsAndTricks.md) page for setup information. @@ -164,27 +181,39 @@ This makes `tsv-filter` ideal for preparing data for applications like R and Pan $ tsv-filter --ne 4:0 file.tsv | wc -l ``` -See the [tsv-filter reference](docs/ToolReference.md#tsv-filter-reference) for more details and the full list of operators. +See the [tsv-filter reference](docs/tool_reference/tsv-filter.md) for more details and the full list of operators. ### tsv-select -A version of the Unix `cut` utility with the ability to re-order fields. The following command writes fields [4, 2, 9, 10, 11] from a pair of files to stdout: +A version of the Unix `cut` utility with the ability to select fields by name, drop fields, and reorder fields. The following command writes the `date` and `time` fields from a pair of files to standard output: +``` +$ tsv-select -H -f date,time file1.tsv file2.tsv +``` +Fields can also be selected by field number: ``` $ tsv-select -f 4,2,9-11 file1.tsv file2.tsv ``` -Fields can be listed more than once, and fields not specified can be selected as a group using `--r|rest`. Fields can be dropped using `--e|exclude`. When working with multiple files, the `--H|header` option can be used to retain the header from just the first file. +Fields can be listed more than once, and fields not specified can be selected as a group using `--r|rest`. Fields can be dropped using `--e|exclude`. + +The `--H|header` option turns on header processing. This enables specifying fields by name. Only the header from the first file is retained when multiple input files are provided. Examples: ``` $ # Output fields 2 and 1, in that order. $ tsv-select -f 2,1 data.tsv +$ # Output the 'Name' and 'RecordNum' fields. +$ tsv-select -H -f Name,RecordNum data.tsv. + $ # Drop the first field, keep everything else. $ tsv-select --exclude 1 file.tsv -$ # Move field 7 to the start of the line. -$ tsv-select -f 7 --rest last data.tsv +$ # Drop the 'Color' field, keep everything else. +$ tsv-select -H --exclude Color file.tsv + +$ # Move the 'RecordNum' field to the start of the line. +$ tsv-select -H -f RecordNum --rest last data.tsv $ # Move field 1 to the end of the line. $ tsv-select -f 1 --rest first data.tsv @@ -192,11 +221,14 @@ $ tsv-select -f 1 --rest first data.tsv $ # Output a range of fields in reverse order. $ tsv-select -f 30-3 data.tsv +$ # Drop all the fields ending in '_time' +$ tsv-select -H -e '*_time' data.tsv + $ # Multiple files with header lines. Keep only one header. $ tsv-select data*.tsv -H --fields 1,2,4-7,14 ``` -See the [tsv-select reference](docs/ToolReference.md#tsv-select-reference) for details. +See the [tsv-select reference](docs/tool_reference/tsv-select.md) for details on `tsv-select`. See [Field syntax](docs/tool_reference/common-options-and-behavior.md#field-syntax) for more information on selecting fields by name. ### tsv-uniq @@ -204,16 +236,23 @@ Similar in spirit to the Unix `uniq` tool, `tsv-uniq` filters a dataset so there `tsv-uniq` can also be run in 'equivalence class identification' mode, where lines with equivalent keys are marked with a unique id rather than filtered out. Another variant is 'number' mode, which generates lines numbers grouped by the key. -An example uniq'ing a file on fields 2 and 3: +`tsv-uniq` operates on the entire line when no fields are specified. This is a useful alternative to the traditional `sort -u` or `sort | uniq` paradigms for identifying unique lines in unsorted files, as it is quite a bit faster, especially when there are many duplicate lines. As a bonus, order of the input lines is retained. + +Examples: ``` +$ # Unique a file based on the full line. +$ tsv-uniq data.tsv + +$ # Unique a file with fields 2 and 3 as the key. $ tsv-uniq -f 2,3 data.tsv -``` -`tsv-uniq` operates on the entire line when no fields are specified. This is a useful alternative to the traditional `sort -u` or `sort | uniq` paradigms for identifying unique lines in unsorted files, as it is quite a bit faster, especially when there are many duplicate lines. As a bonus, order of the input lines is retained. +$ # Unique a file using the 'RecordID' field as the key. +$ tsv-uniq -H -f RecordID data.tsv +``` An in-memory lookup table is used to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance starts to degrade after that. Even then it remains faster than the alternatives. -See the [tsv-uniq reference](docs/ToolReference.md#tsv-uniq-reference) for details. +See the [tsv-uniq reference](docs/tool_reference/tsv-uniq.md) for details. ### tsv-summarize @@ -228,19 +267,19 @@ blue 10 ``` Calculations of the sum and mean of the `weight` column is shown below. The first command runs calculations on all values. The second groups them by color. ``` -$ tsv-summarize --header --sum 2 --mean 2 data.tsv +$ tsv-summarize --header --sum weight --mean weight data.tsv weight_sum weight_mean 40 8 -$ tsv-summarize --header --group-by 1 --sum 2 --mean 2 data.tsv +$ tsv-summarize --header --group-by color --sum weight --mean color data.tsv color weight_sum weight_mean red 15 5 blue 25 12.5 ``` -Multiple fields can be used as the `--group-by` key. The file's sort order does not matter, there is no need to sort in the `--group-by` order first. +Multiple fields can be used as the `--group-by` key. The file's sort order does not matter, there is no need to sort in the `--group-by` order first. Fields can be specified either by name or field number, like other tsv-utils tools. -See the [tsv-summarize reference](docs/ToolReference.md#tsv-summarize-reference) for the list of statistical and other aggregation operations available. +See the [tsv-summarize reference](docs/tool_reference/tsv-summarize.md) for the list of statistical and other aggregation operations available. ### tsv-sample @@ -255,20 +294,22 @@ See the [tsv-summarize reference](docs/ToolReference.md#tsv-summarize-reference) `tsv-sample` is designed for large data sets. Streaming algorithms make immediate decisions on each line. They do not accumulate memory and can run on infinite length input streams. Both shuffling and sampling with replacement read in the entire dataset and are limited by available memory. Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (`--n|num`) in memory. By default, a new random order is generated every run, but options are available for using the same randomization order over multiple runs. The random values assigned to each line can be printed, either to observe the behavior or to run custom algorithms on the results. -See the [tsv-sample reference](docs/ToolReference.md#tsv-sample-reference) for further details. +See the [tsv-sample reference](docs/tool_reference/tsv-sample.md) for further details. ### tsv-join -Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. Example: +Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. This is similar to the "stream-static" joins available in Spark Structured Streaming and "KStream-KTable" joins in Kafka. (The filter file plays the same role as the Spark static dataset or Kafka KTable.) + +Example: ``` -$ tsv-join --filter-file filter.tsv --key-fields 1,3 --append-fields 5,6 data.tsv +$ tsv-join -H --filter-file filter.tsv --key-fields Country,City --append-fields Population,Elevation data.tsv ``` -This reads `filter.tsv`, creating a lookup table keyed on fields 1 and 3. `data.tsv` is read, lines with a matching key are written to standard output with fields 5 and 6 from `filter.tsv` appended. This is a form of inner-join. Outer-joins and anti-joins can also be done. +This reads `filter.tsv`, creating a lookup table keyed on the fields `Country` and `City` fields. `data.tsv` is read, lines with a matching key are written to standard output with the `Population` and `Elevation` fields from `filter.tsv` appended. This is an inner join. Left outer joins and anti-joins are also supported. Common uses for `tsv-join` are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that. -See the [tsv-join reference](docs/ToolReference.md#tsv-join-reference) for details. +See the [tsv-join reference](docs/tool_reference/tsv-join.md) for details. ### tsv-pretty @@ -293,7 +334,7 @@ Chartreuse 1139 77.02 6.220 Fluorescent Orange 422 1141.70 7.921 Grey 19 140.30 1.030 ``` -See the [tsv-pretty reference](docs/ToolReference.md#tsv-pretty-reference) for details. +See the [tsv-pretty reference](docs/tool_reference/tsv-pretty.md) for details. ### csv2tsv @@ -314,7 +355,7 @@ The `csv2tsv` converter often has a second benefit: regularizing newlines. CSV f See [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for more information on CSV escapes and other differences between CSV and TSV formats. -There are many variations of CSV file format. See the [csv2tsv reference](docs/ToolReference.md#csv2tsv-reference) for details of the format variations supported by this tool. +There are many variations of CSV file format. See the [csv2tsv reference](docs/tool_reference/csv2tsv.md) for details of the format variations supported by this tool. ### tsv-split @@ -340,7 +381,7 @@ $ tsv-split data.txt --num-files 1000 --dir split_files $ tsv-split data.tsv --num-files 1000 -key-fields 3 --dir split_files ``` -See the [tsv-split reference](docs/ToolReference.md#tsv-split-reference) for more information. +See the [tsv-split reference](docs/tool_reference/tsv-split.md) for more information. ### tsv-append @@ -352,7 +393,7 @@ Source tracking is useful when creating long/narrow form tabular data. This form In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized. -See the [tsv-append reference](docs/ToolReference.md#tsv-append-reference) for the complete list of options available. +See the [tsv-append reference](docs/tool_reference/tsv-append.md) for the complete list of options available. ### number-lines @@ -363,7 +404,7 @@ $ number-lines myfile.txt Despite it's original purpose as a code sample, `number-lines` turns out to be quite convenient. It is often useful to add a unique row ID to a file, and this tool does this in a manner that maintains proper TSV formatting. -See the [number-lines reference](docs/ToolReference.md#number-lines-reference) for details. +See the [number-lines reference](docs/tool_reference/number-lines.md) for details. ### keep-header @@ -407,7 +448,7 @@ $ # script described on the "Tips and Tricks" page. $ keep-header *.tsv -- tsv-sort-fast -k2,2n ``` -See the [keep-header reference](docs/ToolReference.md#keep-header-reference) for more information. +See the [keep-header reference](docs/tool_reference/keep-header.md) for more information. --- @@ -417,10 +458,10 @@ There are several ways to obtain the tools: [prebuilt binaries](#prebuilt-binari ### Prebuilt binaries -Prebuilt binaries are available for Linux and Mac, these can be found on the [Github releases](https://github.com/eBay/tsv-utils/releases) page. Download and unpack the tar.gz file. Executables are in the `bin` directory. Add the `bin` directory or individual tools to the `PATH` environment variable. As an example, the 1.6.1 releases for Linux and MacOS can be downloaded and unpacked with these commands: +Prebuilt binaries are available for Linux and Mac, these can be found on the [Github releases](https://github.com/eBay/tsv-utils/releases) page. Download and unpack the tar.gz file. Executables are in the `bin` directory. Add the `bin` directory or individual tools to the `PATH` environment variable. As an example, the 2.0.0 releases for Linux and MacOS can be downloaded and unpacked with these commands: ``` -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_linux-x86_64_ldc2.tar.gz | tar xz -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_osx-x86_64_ldc2.tar.gz | tar xz +$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_linux-x86_64_ldc2.tar.gz | tar xz +$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_osx-x86_64_ldc2.tar.gz | tar xz ``` See the [Github releases](https://github.com/eBay/tsv-utils/releases) page for the latest release. @@ -461,10 +502,10 @@ The above requires LDC 1.9.0 or later. See [Building with Link Time Optimization ### Install using DUB -If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace `1.6.1` with the current version): +If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace `2.0.0` with the current version): ``` $ dub fetch tsv-utils --cache=local -$ cd tsv-utils-1.6.1/tsv-utils +$ cd tsv-utils-2.0.0/tsv-utils $ dub run # For LDC: dub run -- --compiler=ldc2 ``` @@ -477,11 +518,3 @@ The applications can be built with LTO and PGO when source code is fetched by DU ### Setup customization There are a number of simple ways to ways to improve the utility of these tools, these are listed on the [Tips and tricks](docs/TipsAndTricks.md) page. [Bash aliases](docs/TipsAndTricks.md#useful-bash-aliases), [Unix sort command customization](docs/TipsAndTricks.md#customize-the-unix-sort-command), and [bash completion](docs/TipsAndTricks.md#enable-bash-completion) are especially useful. - ---- - -## Upcoming feature: Named Fields - -Named field support is being added to the toolkit in an upcoming release. This can be previewed by building the master branch. Documentation is being written, the current versions can be found here: -* [README](README_v2.0.md) -* [Tools reference](docs/ToolReference_v2.0.md) diff --git a/README_v2.0.md b/README_v2.0.md deleted file mode 100644 index c6d67ad3..00000000 --- a/README_v2.0.md +++ /dev/null @@ -1,520 +0,0 @@ -# Command line utilities for tabular data files - -This is a set of command line utilities for manipulating large tabular data files. Files of numeric and text data commonly found in machine learning, data mining, and similar environments. Filtering, sampling, statistics, joins, and more. - -These tools are especially useful when working with large data sets. They run faster than other tools providing similar functionality, often by significant margins. See [Performance Studies](docs/Performance.md) for comparisons with other tools. - -File an [issue](https://github.com/eBay/tsv-utils/issues) if you have problems, questions or suggestions. - -**In this README:** -* [Tools overview](#tools-overview) - Toolkit introduction and descriptions of each tool. -* [Obtaining and installation](#obtaining-and-installation) - -**Additional documents:** -* [Tools Reference](docs/ToolReference.md) - Detailed documentation. -* [Releases](https://github.com/eBay/tsv-utils/releases) - Prebuilt binaries and release notes. - * New in version 2.0: Named field support! -* [Tips and tricks](docs/TipsAndTricks.md) - Simpler and faster command line tool use. -* [Performance Studies](docs/Performance.md) - Benchmarks against similar tools and other performance studies. -* [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) -* [Building with Link Time Optimization (LTO) and Profile Guided Optimization (PGO)](docs/BuildingWithLTO.md) -* [About the code](docs/AboutTheCode.md) (see also: [tsv-utils code documentation](https://tsv-utils.dpldocs.info/)) -* [Other toolkits](docs/OtherToolkits.md) - -**Talks and blog posts:** -* [Faster Command Line Tools in D](https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/). May 24, 2017. A blog post showing a few ways to optimize performance in command line tools. Many of the ideas in the post were identified while developing the TSV Utilities. -* [Experimenting with Link Time Optimization](docs/dlang-meetup-14dec2017.pdf). Dec 14, 2017. A presentation at the [Silicon Valley D Meetup](https://www.meetup.com/D-Lang-Silicon-Valley/) describing experiments using LTO based on eBay's TSV Utilities. -* [Exploring D via Benchmarking of eBay's TSV Utilities](http://dconf.org/2018/talks/degenhardt.html). May 2, 2018. A presentation at [DConf 2018](http://dconf.org/2018/) describing performance benchmark studies conducted using eBay's TSV Utilities (slides [here](docs/dconf2018.pdf)). - -[![Travis](https://img.shields.io/travis/eBay/tsv-utils.svg)](https://travis-ci.org/eBay/tsv-utils) -[![Codecov](https://img.shields.io/codecov/c/github/eBay/tsv-utils.svg)](https://codecov.io/gh/eBay/tsv-utils) -[![GitHub release](https://img.shields.io/github/release/eBay/tsv-utils.svg)](https://github.com/eBay/tsv-utils/releases) -[![Github commits (since latest release)](https://img.shields.io/github/commits-since/eBay/tsv-utils/latest.svg)](https://github.com/eBay/tsv-utils/commits/master) -[![GitHub last commit](https://img.shields.io/github/last-commit/eBay/tsv-utils.svg)](https://github.com/eBay/tsv-utils/commits/master) -[![license](https://img.shields.io/github/license/eBay/tsv-utils.svg)](https://github.com/eBay/tsv-utils/blob/master/LICENSE.txt) - -## Tools overview - -These tools perform data manipulation and statistical calculations on tab delimited data. They are intended for large files. Larger than ideal for loading entirely in memory in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments. The features supported are useful both for standalone analysis and for preparing data for use in R, Pandas, and similar toolkits. - -The tools work like traditional Unix command line utilities such as `cut`, `sort`, `grep` and `awk`, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. Fields are identified either by field name or field number. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (`tsv-filter`). Documentation is available for each tool by invoking it with the `--help` option. TSV format is similar to CSV, see [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for the differences. - -The rest of this section contains descriptions of each tool. Click on the links below to jump directly to one of the tools. Full documentation is available in the [Tools Reference](docs/ToolReference.md). The first tool listed, [tsv-filter](#tsv-filter), provides a tutorial introduction to features found throughout the toolkit. - -* [tsv-filter](#tsv-filter) - Filter lines using numeric, string and regular expression comparisons against individual fields. -* [tsv-select](#tsv-select) - Keep a subset of columns (fields). Like `cut`, but supporting named fields, field reordering, and field exclusions. -* [tsv-uniq](#tsv-uniq) - Filter out duplicate lines using either the full line or individual fields as a key. -* [tsv-summarize](#tsv-summarize) - Summary statistics on selected fields, against the full data set or grouped by key. -* [tsv-sample](#tsv-sample) - Sample input lines or randomize their order. A number of sampling methods are available. -* [tsv-join](#tsv-join) - Join lines from multiple files using fields as a key. -* [tsv-pretty](#tsv-pretty) - Print TSV data aligned for easier reading on the command-line. -* [csv2tsv](#csv2tsv) - Convert CSV files to TSV. -* [tsv-split](#tsv-split) - Split data into multiple files. Random splits, random splits by key, and splits by blocks of lines. -* [tsv-append](#tsv-append) - Concatenate TSV files. Header-aware; supports source file tracking. -* [number-lines](#number-lines) - Number the input lines. -* [keep-header](#keep-header) - Run a shell command in a header-aware fashion. - -### tsv-filter - -Filter lines by running tests against individual fields. Multiple tests can be specified in a single call. A variety of numeric and string comparison tests are available, including regular expressions. - -Consider a file having 4 fields: `id`, `color`, `year`, `count`. Using [tsv-pretty](#tsv-pretty) to view the first few lines: -``` -$ tsv-pretty data.tsv | head -n 5 - id color year count -100 green 1982 173 -101 red 1935 756 -102 red 2008 1303 -103 yellow 1873 180 -``` - -The following command finds all entries where 'year' (field 3) is 2008: -``` -$ tsv-filter -H --eq year:2008 data.tsv -``` - -The `-H` option indicates the first input line is a header. The `--eq` operator performs a numeric equality test. String comparisons are also available. The following command finds entries where 'color' (field 2) is "red": -``` -$ tsv-filter -H --str-eq color:red data.tsv -``` - -Fields can also be identified by field number, same as traditional Unix tools. This works for files with and without header lines. The following commands are equivalent to the previous two: -``` -$ tsv-filter -H --eq 3:2008 data.tsv -$ tsv-filter -H --str-eq 2:red data.tsv -``` - -Multiple tests can be specified. The following command finds `red` entries with `year` between 1850 and 1950: -``` -$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv -``` - -Viewing the first few results produced by this command: -``` -$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv | tsv-pretty | head -n 5 - id color year count -101 red 1935 756 -106 red 1883 1156 -111 red 1907 1792 -114 red 1931 1412 -``` - -Files can be placed anywhere on the command line. Data will be read from standard input if a file is not specified. The following commands are equivalent: -``` -$ tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 data.tsv -$ tsv-filter data.tsv -H --str-eq color:red --ge year:1850 --lt year:1950 -$ cat data.tsv | tsv-filter -H --str-eq color:red --ge year:1850 --lt year:1950 -``` - -Multiple files can be provided. Only the header line from the first file will be kept when the `-H` option is used: -``` -$ tsv-filter -H data1.tsv data2.tsv data3.tsv --str-eq 2:red --ge 3:1850 --lt 3:1950 -$ tsv-filter -H *.tsv --str-eq 2:red --ge 3:1850 --lt 3:1950 -``` - -Numeric comparisons are among the most useful tests. Numeric operators include: -* Equality: `--eq`, `--ne` (equal, not-equal). -* Relational: `--lt`, `--le`, `--gt`, `--ge` (less-than, less-equal, greater-than, greater-equal). - -Several filters are available to help with invalid data. Assume there is a messier version of the 4-field file where some fields are not filled in. The following command will filter out all lines with an empty value in any of the four fields: -``` -$ tsv-filter -H messy.tsv --not-empty 1-4 -``` - -The above command uses a "field list" to run the test on each of fields 1-4. The test can be "inverted" to see the lines that were filtered out: -``` -$ tsv-filter -H messy.tsv --invert --not-empty 1-4 | head -n 5 | tsv-pretty - id color year count -116 1982 11 -118 yellow 143 -123 red 65 -126 79 -``` - -There are several filters for testing characteristics of numeric data. The most useful are: -* `--is-numeric` - Test if the data in a field can be interpreted as a number. -* `--is-finite` - Test if the data in a field can be interpreted as a number, but not NaN (not-a-number) or infinity. This is useful when working with data where floating point calculations may have produced NaN or infinity values. - -By default, all tests specified must be satisfied for a line to pass a filter. This can be changed using the `--or` option. For example, the following command finds records where 'count' (field 4) is less than 100 or greater than 1000: -``` -$ tsv-filter -H --or --lt 4:100 --gt 4:1000 data.tsv | head -n 5 | tsv-pretty - id color year count -102 red 2008 1303 -105 green 1982 16 -106 red 1883 1156 -107 white 1982 0 -``` - -A number of string and regular expression tests are available. These include: -* Equality: `--str-eq`, `--str-ne` -* Partial match: `--str-in-fld`, `--str-not-in-fld` -* Relational operators: `--str-lt`, `--str-gt`, etc. -* Case insensitive tests: `--istr-eq`, `--istr-in-fld`, etc. -* Regular expressions: `--regex`, `--not-regex`, etc. -* Field length: `--char-len-lt`, `--byte-len-gt`, etc. - -The earlier `--not-empty` example uses a "field list". Fields lists specify a set of fields and can be used with most operators. For example, the following command ensures that fields 1-3 and 7 are less-than 100: -``` -$ tsv-filter -H --lt 1-3,7:100 file.tsv -``` - -Field names can be used in field lists as well. The following command selects lines where both 'color' and 'count' fields are not empty: -``` -$ tsv-filter -H messy.tsv --not-empty color,count -``` - -Field names can be matched using wildcards. The previous command could also be written as: -``` -$ tsv-filter -H messy.tsv --not-empty 'co*' -``` - -The `co*` matches both the 'color' and 'count' fields. (Note: Single quotes are used to prevent the shell from interpreting the asterisk character.) - -All TSV Utilities tools use the same syntax for specifying fields. See [Field syntax](docs/tool_reference/common-options-and-behavior.md#field-syntax) in the [Tools Reference](docs/ToolReference.md) document for details. - -Bash completion is especially helpful with `tsv-filter`. It allows quickly seeing and selecting from the different operators available. See [bash completion](docs/TipsAndTricks.md#enable-bash-completion) on the [Tips and tricks](docs/TipsAndTricks.md) page for setup information. - -`tsv-filter` is perhaps the most broadly applicable of the TSV Utilities tools, as dataset pruning is such a common task. It is stream oriented, so it can handle arbitrarily large files. It is fast, quite a bit faster than other tools the author has tried. (See the "Numeric row filter" and "Regular expression row filter" tests in the [2018 Benchmark Summary](docs/Performance.md#2018-benchmark-summary).) - -This makes `tsv-filter` ideal for preparing data for applications like R and Pandas. It is also convenient for quickly answering simple questions about a dataset. For example, to count the number of records with a non-zero value in field 4, use the command: -``` -$ tsv-filter --ne 4:0 file.tsv | wc -l -``` - -See the [tsv-filter reference](docs/tool_reference/tsv-filter.md) for more details and the full list of operators. - -### tsv-select - -A version of the Unix `cut` utility with the ability to select fields by name, drop fields, and reorder fields. The following command writes the `date` and `time` fields from a pair of files to standard output: -``` -$ tsv-select -H -f date,time file1.tsv file2.tsv -``` -Fields can also be selected by field number: -``` -$ tsv-select -f 4,2,9-11 file1.tsv file2.tsv -``` - -Fields can be listed more than once, and fields not specified can be selected as a group using `--r|rest`. Fields can be dropped using `--e|exclude`. - -The `--H|header` option turns on header processing. This enables specifying fields by name. Only the header from the first file is retained when multiple input files are provided. - -Examples: -``` -$ # Output fields 2 and 1, in that order. -$ tsv-select -f 2,1 data.tsv - -$ # Output the 'Name' and 'RecordNum' fields. -$ tsv-select -H -f Name,RecordNum data.tsv. - -$ # Drop the first field, keep everything else. -$ tsv-select --exclude 1 file.tsv - -$ # Drop the 'Color' field, keep everything else. -$ tsv-select -H --exclude Color file.tsv - -$ # Move the 'RecordNum' field to the start of the line. -$ tsv-select -H -f RecordNum --rest last data.tsv - -$ # Move field 1 to the end of the line. -$ tsv-select -f 1 --rest first data.tsv - -$ # Output a range of fields in reverse order. -$ tsv-select -f 30-3 data.tsv - -$ # Drop all the fields ending in '_time' -$ tsv-select -H -e '*_time' data.tsv - -$ # Multiple files with header lines. Keep only one header. -$ tsv-select data*.tsv -H --fields 1,2,4-7,14 -``` - -See the [tsv-select reference](docs/tool_reference/tsv-select.md) for details on `tsv-select`. See [Field syntax](docs/tool_reference/common-options-and-behavior.md#field-syntax) for more information on selecting fields by name. - -### tsv-uniq - -Similar in spirit to the Unix `uniq` tool, `tsv-uniq` filters a dataset so there is only one copy of each unique line. `tsv-uniq` goes beyond Unix `uniq` in a couple ways. First, data does not need to be sorted. Second, equivalence can be based on a subset of fields rather than the full line. - -`tsv-uniq` can also be run in 'equivalence class identification' mode, where lines with equivalent keys are marked with a unique id rather than filtered out. Another variant is 'number' mode, which generates lines numbers grouped by the key. - -`tsv-uniq` operates on the entire line when no fields are specified. This is a useful alternative to the traditional `sort -u` or `sort | uniq` paradigms for identifying unique lines in unsorted files, as it is quite a bit faster, especially when there are many duplicate lines. As a bonus, order of the input lines is retained. - -Examples: -``` -$ # Unique a file based on the full line. -$ tsv-uniq data.tsv - -$ # Unique a file with fields 2 and 3 as the key. -$ tsv-uniq -f 2,3 data.tsv - -$ # Unique a file using the 'RecordID' field as the key. -$ tsv-uniq -H -f RecordID data.tsv -``` - -An in-memory lookup table is used to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance starts to degrade after that. Even then it remains faster than the alternatives. - -See the [tsv-uniq reference](docs/tool_reference/tsv-uniq.md) for details. - -### tsv-summarize - -`tsv-summarize` performs statistical calculations on fields. For example, generating the sum or median of a field's values. Calculations can be run across the entire input or can be grouped by key fields. Consider the file `data.tsv`: -``` -color weight -red 6 -red 5 -blue 15 -red 4 -blue 10 -``` -Calculations of the sum and mean of the `weight` column is shown below. The first command runs calculations on all values. The second groups them by color. -``` -$ tsv-summarize --header --sum weight --mean weight data.tsv -weight_sum weight_mean -40 8 - -$ tsv-summarize --header --group-by color --sum weight --mean color data.tsv -color weight_sum weight_mean -red 15 5 -blue 25 12.5 -``` - -Multiple fields can be used as the `--group-by` key. The file's sort order does not matter, there is no need to sort in the `--group-by` order first. Fields can be specified either by name or field number, like other tsv-utils tools. - -See the [tsv-summarize reference](docs/tool_reference/tsv-summarize.md) for the list of statistical and other aggregation operations available. - -### tsv-sample - -`tsv-sample` randomizes line order (shuffling) or selects random subsets of lines (sampling) from input data. Several methods are available, including shuffling, simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling. Data can be read from files or standard input. These sampling methods are made available through several modes of operation: - -* Shuffling - The default mode of operation. All lines are read in and written out in random order. All orderings are equally likely. -* Simple random sampling (`--n|num N`) - A random sample of `N` lines are selected and written out in random order. The `--i|inorder` option preserves the original input order. -* Weighted random sampling (`--n|num N`, `--w|weight-field F`) - A weighted random sample of N lines are selected using weights from a field on each line. Output is in weighted selection order unless the `--i|inorder` option is used. Omitting `--n|num` outputs all lines in weighted selection order (weighted shuffling). -* Sampling with replacement (`--r|replace`, `--n|num N`) - All lines are read in, then lines are randomly selected one at a time and written out. Lines can be selected multiple times. Output continues until `N` samples have been output. -* Bernoulli sampling (`--p|prob P`) - A streaming form of sampling. Lines are read one at a time and selected for output using probability `P`. e.g. `-p 0.1` specifies that 10% of lines should be included in the sample. -* Distinct sampling (`--k|key-fields F`, `--p|prob P`) - Another streaming form of sampling. However, instead of each line being subject to an independent selection choice, lines are selected based on a key contained in each line. A portion of keys are randomly selected for output, with probability P. Every line containing a selected key is included in the output. Consider a query log with records consisting of triples. It may be desirable to sample records for one percent of the users, but include all records for the selected users. - -`tsv-sample` is designed for large data sets. Streaming algorithms make immediate decisions on each line. They do not accumulate memory and can run on infinite length input streams. Both shuffling and sampling with replacement read in the entire dataset and are limited by available memory. Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (`--n|num`) in memory. By default, a new random order is generated every run, but options are available for using the same randomization order over multiple runs. The random values assigned to each line can be printed, either to observe the behavior or to run custom algorithms on the results. - -See the [tsv-sample reference](docs/tool_reference/tsv-sample.md) for further details. - -### tsv-join - -Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. This is similar to the "stream-static" joins available in Spark Structured Streaming and "KStream-KTable" joins in Kafka. (The filter file plays the same role as the Spark static dataset or Kafka KTable.) - -Example: -``` -$ tsv-join -H --filter-file filter.tsv --key-fields Country,City --append-fields Population,Elevation data.tsv -``` - -This reads `filter.tsv`, creating a lookup table keyed on the fields `Country` and `City` fields. `data.tsv` is read, lines with a matching key are written to standard output with the `Population` and `Elevation` fields from `filter.tsv` appended. This is an inner join. Left outer joins and anti-joins are also supported. - -Common uses for `tsv-join` are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that. - -See the [tsv-join reference](docs/tool_reference/tsv-join.md) for details. - -### tsv-pretty - -tsv-pretty prints TSV data in an aligned format for better readability when working on the command-line. Text columns are left aligned, numeric columns are right aligned. Floats are aligned on the decimal point and precision can be specified. Header lines are detected automatically. If desired, the header line can be repeated at regular intervals. An example, first printed without formatting: -``` -$ cat sample.tsv -Color Count Ht Wt -Brown 106 202.2 1.5 -Canary Yellow 7 106 0.761 -Chartreuse 1139 77.02 6.22 -Fluorescent Orange 422 1141.7 7.921 -Grey 19 140.3 1.03 -``` -Now with `tsv-pretty`, using header underlining and float formatting: -``` -$ tsv-pretty -u -f sample.tsv -Color Count Ht Wt ------ ----- -- -- -Brown 106 202.20 1.500 -Canary Yellow 7 106.00 0.761 -Chartreuse 1139 77.02 6.220 -Fluorescent Orange 422 1141.70 7.921 -Grey 19 140.30 1.030 -``` -See the [tsv-pretty reference](docs/tool_reference/tsv-pretty.md) for details. - -### csv2tsv - -`csv2tsv` does what you expect: convert CSV data to TSV. Example: -``` -$ csv2tsv data.csv > data.tsv -``` - -A strict delimited format like TSV has many advantages for data processing over an escape oriented format like CSV. However, CSV is a very popular data interchange format and the default export format for many database and spreadsheet programs. Converting CSV files to TSV allows them to be processed reliably by both this toolkit and standard Unix utilities like `awk` and `sort`. - -Note that many CSV files do not use escapes, and in-fact follow a strict delimited format using comma as the delimiter. Such files can be processed reliably by this toolkit and Unix tools by specifying the delimiter character. However, when there is doubt, using a `csv2tsv` converter adds reliability. - -`csv2tsv` differs from many csv-to-tsv conversion tools in that it produces output free of CSV escapes. Many conversion tools produce data with CSV style escapes, but switching the field delimiter from comma to TAB. Such data cannot be reliably processed by Unix tools like `cut`, `awk`, `sort`, etc. - -`csv2tsv` avoids escapes by replacing TAB and newline characters in the data with a single space. These characters are rare in data mining scenarios, and space is usually a good substitute in cases where they do occur. The replacement string is customizable to enable alternate handling when needed. - -The `csv2tsv` converter often has a second benefit: regularizing newlines. CSV files are often exported using Windows newline conventions. `csv2tsv` converts all newlines to Unix format. - -See [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for more information on CSV escapes and other differences between CSV and TSV formats. - -There are many variations of CSV file format. See the [csv2tsv reference](docs/tool_reference/csv2tsv.md) for details of the format variations supported by this tool. - -### tsv-split - -`tsv-split` is used to split one or more input files into multiple output files. There are three modes of operation: -* Fixed number of lines per file (`--l|lines-per-file NUM`): Each input block of NUM lines is written to a new file. This is similar to the Unix `split` utility. - -* Random assignment (`--n|num-files NUM`): Each input line is written to a randomly selected output file. Random selection is from NUM files. - -* Random assignment by key (`--n|num-files NUM, --k|key-fields FIELDS`): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file. - -By default, files are written to the current directory and have names of the form `part_NNN`, with `NNN` being a number and `` being the extension of the first input file. If the input file is `file.txt`, the names will take the form `part_NNN.txt`. The output directory and file names are customizable. - -Examples: -``` -$ # Split a file into files of 10,000 lines each. Output files -$ # are written to the 'split_files/' directory. -$ tsv-split data.txt --lines-per-file 10000 --dir split_files - -$ # Split a file into 1000 files with lines randomly assigned. -$ tsv-split data.txt --num-files 1000 --dir split_files - -# Randomly assign lines to 1000 files using field 3 as a key. -$ tsv-split data.tsv --num-files 1000 -key-fields 3 --dir split_files -``` - -See the [tsv-split reference](docs/tool_reference/tsv-split.md) for more information. - -### tsv-append - -`tsv-append` concatenates multiple TSV files, similar to the Unix `cat` utility. It is header-aware, writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. - -Concatenation with header support is useful when preparing data for traditional Unix utilities like `sort` and `sed` or applications that read a single file. - -Source tracking is useful when creating long/narrow form tabular data. This format is used by many statistics and data mining packages. (See [Wide & Long Data - Stanford University](https://stanford.edu/~ejdemyr/r-tutorials/wide-and-long/) or Hadley Wickham's [Tidy data](http://vita.had.co.nz/papers/tidy-data.html) for more info.) - -In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized. - -See the [tsv-append reference](docs/tool_reference/tsv-append.md) for the complete list of options available. - -### number-lines - -A simpler version of the Unix `nl` program. It prepends a line number to each line read from files or standard input. This tool was written primarily as an example of a simple command line tool. The code structure it uses is the same as followed by all the other tools. Example: -``` -$ number-lines myfile.txt -``` - -Despite it's original purpose as a code sample, `number-lines` turns out to be quite convenient. It is often useful to add a unique row ID to a file, and this tool does this in a manner that maintains proper TSV formatting. - -See the [number-lines reference](docs/tool_reference/number-lines.md) for details. - -### keep-header - -A convenience utility that runs Unix commands in a header-aware fashion. It is especially useful with `sort`. `sort` does not know about headers, so the header line ends up wherever it falls in the sort order. Using `keep-header`, the header line is output first and the rest of the sorted file follows. For example: -``` -$ # Sort a file, keeping the header line at the top. -$ keep-header myfile.txt -- sort -``` - -The command to run is placed after the double dash (`--`). Everything after the initial double dash is part of the command. For example, `sort --ignore-case` is run as follows: -``` -$ # Case-insensitive sort, keeping the header line at the top. -$ keep-header myfile.txt -- sort --ignore-case -``` - -Multiple files can be provided, only the header from the first is retained. For example: - -``` -$ # Sort a set of files in reverse order, keeping only one header line. -$ keep-header *.txt -- sort -r -``` - -`keep-header` is especially useful for commands like `sort` and `shuf` that reorder input lines. It is also useful with filtering commands like `grep`, many `awk` uses, and even `tail`, where the header should be retained without filtering or evaluation. - -Examples: -``` -$ # 'grep' a file, keeping the header line without needing to match it. -$ keep-header file.txt -- grep 'some text' - -$ # Print the last 10 lines of a file, but keep the header line -$ keep-header file.txt -- tail - -$ # Print lines 100-149 of a file, plus the header -$ keep-header file.txt -- tail -n +100 | head -n 51 - -$ # Sort a set of TSV files numerically on field 2, keeping one header. -$ keep-header *.tsv -- sort -t $'\t' -k2,2n - -$ # Same as the previous example, but using the 'tsv-sort-fast' bash -$ # script described on the "Tips and Tricks" page. -$ keep-header *.tsv -- tsv-sort-fast -k2,2n -``` - -See the [keep-header reference](docs/tool_reference/keep-header.md) for more information. - ---- - -## Obtaining and installation - -There are several ways to obtain the tools: [prebuilt binaries](#prebuilt-binaries); [building from source code](#build-from-source-files); and [installing using the DUB package manager](#install-using-dub). The tools have been tested on Linux and Mac OS X. They have not been tested on Windows, but there are no obvious impediments to running on Windows as well. - -### Prebuilt binaries - -Prebuilt binaries are available for Linux and Mac, these can be found on the [Github releases](https://github.com/eBay/tsv-utils/releases) page. Download and unpack the tar.gz file. Executables are in the `bin` directory. Add the `bin` directory or individual tools to the `PATH` environment variable. As an example, the 1.6.1 releases for Linux and MacOS can be downloaded and unpacked with these commands: -``` -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_linux-x86_64_ldc2.tar.gz | tar xz -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_osx-x86_64_ldc2.tar.gz | tar xz -``` - -See the [Github releases](https://github.com/eBay/tsv-utils/releases) page for the latest release. - -For some distributions a package can directly be installed: - -| Distribution | Command | -| ------------ | --------------------- | -| Arch Linux | `pacaur -S tsv-utils` (see [`tsv-utils`](https://aur.archlinux.org/packages/tsv-utils/)) - -*Note: The distributions above are not updated as frequently as the [Github releases](https://github.com/eBay/tsv-utils/releases) page.* - -### Build from source files - -[Download a D compiler](https://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.088.1 or later, LDC version 1.18.0 or later. - -Clone this repository, select a compiler, and run `make` from the top level directory: -``` -$ git clone https://github.com/eBay/tsv-utils.git -$ cd tsv-utils -$ make # For LDC: make DCOMPILER=ldc2 -``` - -Executables are written to `tsv-utils/bin`, place this directory or the executables in the PATH. The compiler defaults to DMD, this can be changed on the make command line (e.g. `make DCOMPILER=ldc2`). DMD is the reference compiler, but LDC produces faster executables. (For some tools LDC is quite a bit faster than DMD.) - -The makefile supports other typical development tasks such as unit tests and code coverage reports. See [Building and makefile](docs/AboutTheCode.md#building-and-makefile) for more details. - -For fastest performance, use LDC with Link Time Optimization (LTO) and Profile Guided Optimization (PGO) enabled: -``` -$ git clone https://github.com/eBay/tsv-utils.git -$ cd tsv-utils -$ make DCOMPILER=ldc2 LDC_LTO_RUNTIME=1 LDC_PGO=2 -$ # Run the test suite -$ make test-nobuild DCOMPILER=ldc2 -``` - -The above requires LDC 1.9.0 or later. See [Building with Link Time Optimization](docs/BuildingWithLTO.md) for more information. The prebuilt binaries are built using LTO and PGO, but these must be explicitly enabled when building from source. LTO and PGO are still early stage technologies, issues may surface in some system configurations. Running the test suite (shown above) is a good way to detect issues that may arise. - -### Install using DUB - -If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace `1.6.1` with the current version): -``` -$ dub fetch tsv-utils --cache=local -$ cd tsv-utils-1.6.1/tsv-utils -$ dub run # For LDC: dub run -- --compiler=ldc2 -``` - -The `dub run` command compiles all the tools. The executables are written to `tsv-utils/bin`. Add this directory or individual executables to the PATH. - -See [Building and makefile](docs/AboutTheCode.md#building-and-makefile) for more information about the DUB setup. - -The applications can be built with LTO and PGO when source code is fetched by DUB. However, the DUB build system does not support this. `make` must be used instead. See [Building with Link Time Optimization](docs/BuildingWithLTO.md). - -### Setup customization - -There are a number of simple ways to ways to improve the utility of these tools, these are listed on the [Tips and tricks](docs/TipsAndTricks.md) page. [Bash aliases](docs/TipsAndTricks.md#useful-bash-aliases), [Unix sort command customization](docs/TipsAndTricks.md#customize-the-unix-sort-command), and [bash completion](docs/TipsAndTricks.md#enable-bash-completion) are especially useful. diff --git a/docs/ToolReference.md b/docs/ToolReference.md index 99c2514c..f18ba1e8 100644 --- a/docs/ToolReference.md +++ b/docs/ToolReference.md @@ -1,1003 +1,23 @@ -_Visit the [main page](../README.md)_ +_Visit the [TSV Utilities main page](../README.md)_ -# Tool reference +# Tools Reference -This page provides detailed documentation about the different tools as well as examples. Material for the individual tools is also available via the `--help` option. +The TSV Utilities Tools Reference provides detailed documentation about each tool. Each tool has it's own page, available through the links below. The [Common options and behavior](tool_reference/common-options-and-behavior.md) page provides information about features and options common to all the tools. -* [Common options and behavior](#common-options-and-behavior) -* [csv2tsv](#csv2tsv-reference) -* [keep-header](#keep-header-reference) -* [number-lines](#number-lines-reference) -* [tsv-append](#tsv-append-reference) -* [tsv-filter](#tsv-filter-reference) -* [tsv-join](#tsv-join-reference) -* [tsv-pretty](#tsv-pretty-reference) -* [tsv-sample](#tsv-sample-reference) -* [tsv-select](#tsv-select-reference) -* [tsv-split](#tsv-split-reference) -* [tsv-summarize](#tsv-summarize-reference) -* [tsv-uniq](#tsv-uniq-reference) +Documentation for individual tools is also available via the `--help` option available on every tool. -___ +* [Common options and behavior](tool_reference/common-options-and-behavior.md) +* [csv2tsv](tool_reference/csv2tsv.md) +* [keep-header](tool_reference/keep-header.md) +* [number-lines](tool_reference/number-lines.md) +* [tsv-append](tool_reference/tsv-append.md) +* [tsv-filter](tool_reference/tsv-filter.md) +* [tsv-join](tool_reference/tsv-join.md) +* [tsv-pretty](tool_reference/tsv-pretty.md) +* [tsv-sample](tool_reference/tsv-sample.md) +* [tsv-select](tool_reference/tsv-select.md) +* [tsv-split](tool_reference/tsv-split.md) +* [tsv-summarize](tool_reference/tsv-summarize.md) +* [tsv-uniq](tool_reference/tsv-uniq.md) -## Common options and behavior - -Information in this section applies to all the tools. - -### Specifying options - -Multi-letter options are specified with a double dash. Single letter options can be specified with a single dash or double dash. For example: -``` -$ tsv-select -f 1,2 # Valid -$ tsv-select --f 1,2 # Valid -$ tsv-select --fields 1,2 # Valid -$ tsv-select -fields 1,2 # Invalid. -``` - -### Help (`-h`, `--help`, `--help-verbose`) - -All tools print help if given the `-h` or `--help` option. Many provide more detail via the `--help-verbose` option. - -### Field numbers and field-lists. - -Field numbers are one-upped integers, following Unix conventions. Some tools use zero to represent the entire line (`tsv-join`, `tsv-uniq`). - -In many cases multiple fields can be entered as a "field-list". A field-list is a comma separated list of field numbers or field ranges. For example: - -``` -$ tsv-select -f 3 # Field 3 -$ tsv-select -f 3,5 # Fields 3 and 5 -$ tsv-select -f 3-5 # Fields 3, 4, 5 -$ tsv-select -f 1,3-5 # Fields 1, 3, 4, 5 -``` - -Most tools process or output fields in the order listed, and repeated use is usually fine: -``` -$ tsv-select -f 5-1 # Fields 5, 4, 3, 2, 1 -$ tsv-select -f 1-3,2,1 # Fields 1, 2, 3, 2, 1 -``` - -### UTF-8 input - -These tools assume data is utf-8 encoded. - -### Line endings - -These tools have been tested on Unix platforms, including macOS, but not Windows. On Unix platforms, Unix line endings (`\n`) are expected, with the notable exception of `tsv2csv`. Not all the tools are affected by DOS and Windows line endings (`\r\n`), those that are check the first line and flag an error. `csv2tsv` explicitly handles DOS and Windows line endings, converting to Unix line endings as part of the conversion. - -The `dos2unix` tool can be used to convert Windows line endings to Unix format. See [Convert newline format and character encoding with dos2unix and iconv](TipsAndTricks.md#convert-newline-format-and-character-encoding-with-dos2unix-and-iconv) - -The tools were written to respect platform line endings. If built on Windows, then Windows line endings. However, given the lack of testing, a Windows build should be expected to have some issues with line endings. - -### File format and alternate delimiters (`--delimiter`) - -Any character can be used as a delimiter, TAB is the default. However, there is no escaping for including the delimiter character or newlines within a field. This differs from CSV file format which provides an escaping mechanism. In practice the lack of an escaping mechanism is not a meaningful limitation for data oriented files. - -Aside from a header line, all lines are expected to have data. There is no comment mechanism and no special handling for blank lines. Tools taking field indices as arguments expect the specified fields to be available on every line. - -### Headers (`-H`, `--header`) - -Most tools handle the first line of files as a header when given the `-H` or `--header` option. For example, `tsv-filter` passes the header through without filtering it. When `--header` is used, all files and stdin are assumed to have header lines. Only one header line is written to stdout. If multiple files are being processed, header lines from subsequent files are discarded. - -### Multiple files and standard input - -Tools can read from any number of files and from standard input. As per typical Unix behavior, a single dash represents standard input when included in a list of files. Terminate non-file arguments with a double dash (`--`) when using a single dash in this fashion. Example: -``` -$ head -n 1000 file-c.tsv | tsv-filter --eq 2:1000 -- file-a.tsv file-b.tsv - > out.tsv -``` - -The above passes `file-a.tsv`, `file-b.tsv`, and the first 1000 lines of `file-c.tsv` to `tsv-filter` and write the results to `out.tsv`. - ---- - -## csv2tsv reference - -**Synopsis:** csv2tsv [options] [file...] - -csv2tsv converts CSV (comma-separated) text to TSV (tab-separated) format. Records are read from files or standard input, converted records are written to standard output. - -Both formats represent tabular data, each record on its own line, fields separated by a delimiter character. The key difference is that CSV uses escape sequences to represent newlines and field separators in the data, whereas TSV disallows these characters in the data. The most common field delimiters are comma for CSV and tab for TSV, but any character can be used. See [Comparing TSV and CSV formats](comparing-tsv-and-csv.md) for addition discussion of the formats. - -Conversion to TSV is done by removing CSV escape syntax, changing field delimiters, and replacing newlines and tabs in the data. By default, newlines and tabs in the data are replaced by spaces. Most details are customizable. - -There is no single spec for CSV, any number of variants can be found. The escape syntax is common enough: fields containing newlines or field delimiters are placed in double quotes. Inside a quoted field, a double quote is represented by a pair of double quotes. As with field separators, the quoting character is customizable. - -Behaviors of this program that often vary between CSV implementations: -* Newlines are supported in quoted fields. -* Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules. -* Each record can have a different numbers of fields. -* The three common forms of newlines are supported: CR, CRLF, LF. -* A newline will be added if the file does not end with one. -* No whitespace trimming is done. - -This program does not validate CSV correctness, but will terminate with an error upon reaching an inconsistent state. Improperly terminated quoted fields are the primary cause. - -UTF-8 input is assumed. Convert other encodings prior to invoking this tool. - -**Options:** -* `--h|help` - Print help. -* `--help-verbose` - Print detailed help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. Only the header of the first file is output. -* `--q|quote CHR` - Quoting character in CSV data. Default: double-quote (") -* `--c|csv-delim CHR` - Field delimiter in CSV data. Default: comma (,). -* `--t|tsv-delim CHR` - Field delimiter in TSV data. Default: TAB -* `--r|replacement STR` - Replacement for newline and TSV field delimiters found in CSV input. Default: Space. - ---- - -## keep-header reference - -**Synopsis:** keep-header [file...] \-- program [args] - -Execute a command against one or more files in a header-aware fashion. The first line of each file is assumed to be a header. The first header is output unchanged. Remaining lines are sent to the given command via standard input, excluding the header lines of subsequent files. Output from the command is appended to the initial header line. A double dash (\--) delimits the command, similar to how the pipe operator (\|) delimits commands. - -The following commands sort files in the usual way, except for retaining a single header line: -``` -$ keep-header file1.txt -- sort -$ keep-header file1.txt file2.txt -- sort -k1,1nr -``` - -Data can also be read from from standard input. For example: -``` -$ cat file1.txt | keep-header -- sort -$ keep-header file1.txt -- sort -r | keep-header -- grep red -``` - -The last example can be simplified using a shell command: -``` -$ keep-header file1.txt -- /bin/sh -c '(sort -r | grep red)' -``` - -`keep-header` is especially useful for commands like `sort` and `shuf` that reorder input lines. It is also useful with filtering commands like `grep`, many `awk` uses, and even `tail`, where the header should be retained without filtering or evaluation. - -`keep-header` works on any file where the first line is delimited by a newline character. This includes all TSV files and the majority of CSV files. It won't work on CSV files having embedded newlines in the header. - -**Options:** -* `--h|help` - Print help. -* `--V|version` - Print version information and exit. - ---- - -## number-lines reference - -**Synopsis:** number-lines [options] [file...] - -number-lines reads from files or standard input and writes each line to standard output preceded by a line number. It is a simplified version of the Unix `nl` program. It supports one feature `nl` does not: the ability to treat the first line of files as a header. This is useful when working with tab-separated-value files. If header processing used, a header line is written for the first file, and the header lines are dropped from any subsequent files. - -**Options:** -* `--h|help` - Print help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. The first input file's header is output, subsequent file headers are discarded. -* `--s|header-string STR` - String to use as the header for the line number field. Implies `--header`. Default: 'line'. -* `--n|start-number NUM` - Number to use for the first line. Default: 1. -* `--d|delimiter CHR` - Character appended to line number, preceding the rest of the line. Default: TAB (Single byte UTF-8 characters only.) - -**Examples:** -``` -$ # Number lines in a file -$ number-lines file.tsv - -$ # Number lines from multiple files. Treat the first line of each file -$ # as a header. -$ number-lines --header data*.tsv -``` - -**See Also:** - -* [tsv-uniq](#tsv-uniq-reference) supports numbering lines grouped by key. - ---- - -## tsv-append reference - -**Synopsis:** tsv-append [options] [file...] - -tsv-append concatenates multiple TSV files, similar to the Unix `cat` utility. Unlike `cat`, it is header-aware (`--H|header`), writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. Results are written to standard output. - -Concatenation with header support is useful when preparing data for traditional Unix utilities like `sort` and `sed` or applications that read a single file. - -Source tracking is useful when creating long/narrow form tabular data, a format used by many statistics and data mining packages. In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. - -The file-name (without extension) is used as the source value. This can customized using the `--f|file` option. - -Example: Header processing: -``` -$ tsv-append -H file1.tsv file2.tsv file3.tsv -``` - -Example: Header processing and source tracking: -``` -$ tsv-append -H -t file1.tsv file2.tsv file3.tsv -``` - -Example: Source tracking with custom source values: -``` -$ tsv-append -H -s test_id -f test1=file1.tsv -f test2=file2.tsv - ``` - -**Options:** -* `--h|help` - Print help. -* `--help-verbose` - Print detailed help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. -* `--t|track-source` - Track the source file. Adds an column with the source name. -* `--s|source-header STR` - Use STR as the header for the source column. Implies `--H|header` and `--t|track-source`. Default: 'file' -* `--f|file STR=FILE` - Read file FILE, using STR as the 'source' value. Implies `--t|track-source`. -* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) - ---- - -## tsv-filter reference - -_Note: See the [tsv-filter](../README.md#tsv-filter) description in the project [README](../README.md) for a tutorial style introduction._ - -**Synopsis:** tsv-filter [options] [file...] - -Filter lines of tab-delimited files via comparison tests against fields. Multiple tests can be specified, by default they are evaluated as AND clause. Lines satisfying the tests are written to standard output. - -**General options:** -* `--help` - Print help. -* `--help-verbose` - Print detailed help. -* `--help-options` - Print the options list by itself. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. -* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) -* `--or` - Evaluate tests as an OR rather than an AND. This applies globally. -* `--v|invert` - Invert the filter, printing lines that do not match. This applies globally. - -**Tests:** - -Empty and blank field tests: -* `--empty ` - True if field is empty (no characters) -* `--not-empty ` - True if field is not empty. -* `--blank ` - True if field is empty or all whitespace. -* `--not-blank ` - True if field contains a non-whitespace character. - -Numeric type tests: -* `--is-numeric ` - True if the field can be interpreted as a number. -* `--is-finite ` - True if the field can be interpreted as a number, and it is not NaN or infinity. -* `--is-nan ` - True if the field is NaN (including: "nan", "NaN", "NAN"). -* `--is-infinity ` - True if the field is infinity (including: "inf", "INF", "-inf", "-INF") - -Numeric comparisons: -* `--le :NUM` - FIELD <= NUM (numeric). -* `--lt :NUM` - FIELD < NUM (numeric). -* `--ge :NUM` - FIELD >= NUM (numeric). -* `--gt :NUM` - FIELD > NUM (numeric). -* `--eq :NUM` - FIELD == NUM (numeric). -* `--ne :NUM` - FIELD != NUM (numeric). - -String comparisons: -* `--str-le :STR` - FIELD <= STR (string). -* `--str-lt :STR` - FIELD < STR (string). -* `--str-ge :STR` - FIELD >= STR (string). -* `--str-gt :STR` - FIELD > STR (string). -* `--str-eq :STR` - FIELD == STR (string). -* `--istr-eq :STR` - FIELD == STR (string, case-insensitive). -* `--str-ne :STR` - FIELD != STR (string). -* `--istr-ne :STR` - FIELD != STR (string, case-insensitive). -* `--str-in-fld :STR` - FIELD contains STR (substring search). -* `--istr-in-fld :STR` - FIELD contains STR (substring search, case-insensitive). -* `--str-not-in-fld :STR` - FIELD does not contain STR (substring search). -* `--istr-not-in-fld :STR` - FIELD does not contain STR (substring search, case-insensitive). - -Regular expression tests: -* `--regex :REGEX` - FIELD matches regular expression. -* `--iregex :REGEX` - FIELD matches regular expression, case-insensitive. -* `--not-regex :REGEX` - FIELD does not match regular expression. -* `--not-iregex :REGEX` - FIELD does not match regular expression, case-insensitive. - -Field length tests -* `--char-len-le :NUM` - FIELD character length <= NUM. -* `--char-len-lt :NUM` - FIELD character length < NUM. -* `--char-len-ge :NUM` - FIELD character length >= NUM. -* `--char-len-gt :NUM` - FIELD character length > NUM. -* `--char-len-eq :NUM` - FIELD character length == NUM. -* `--char-len-ne :NUM` - FIELD character length != NUM. -* `--byte-len-le :NUM` - FIELD byte length <= NUM. -* `--byte-len-lt :NUM` - FIELD byte length < NUM. -* `--byte-len-ge :NUM` - FIELD byte length >= NUM. -* `--byte-len-gt :NUM` - FIELD byte length > NUM. -* `--byte-len-eq :NUM` - FIELD byte length == NUM. -* `--byte-len-ne :NUM` - FIELD byte length != NUM. - -Field to field comparisons: -* `--ff-le FIELD1:FIELD2` - FIELD1 <= FIELD2 (numeric). -* `--ff-lt FIELD1:FIELD2` - FIELD1 < FIELD2 (numeric). -* `--ff-ge FIELD1:FIELD2` - FIELD1 >= FIELD2 (numeric). -* `--ff-gt FIELD1:FIELD2` - FIELD1 > FIELD2 (numeric). -* `--ff-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (numeric). -* `--ff-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (numeric). -* `--ff-str-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (string). -* `--ff-istr-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (string, case-insensitive). -* `--ff-str-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (string). -* `--ff-istr-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (string, case-insensitive). -* `--ff-absdiff-le FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) <= NUM -* `--ff-absdiff-gt FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) > NUM -* `--ff-reldiff-le FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) <= NUM -* `--ff-reldiff-gt FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) > NUM - -**Examples:** - -Basic comparisons: -``` -$ # Field 2 non-zero -$ tsv-filter --ne 2:0 data.tsv - -$ # Field 1 == 0 and Field 2 >= 100, first line is a header. -$ tsv-filter --header --eq 1:0 --ge 2:100 data.tsv - -$ # Field 1 == -1 or Field 1 > 100 -$ tsv-filter --or --eq 1:-1 --gt 1:100 - -$ # Field 3 is foo, Field 4 contains bar -$ tsv-filter --header --str-eq 3:foo --str-in-fld 4:bar data.tsv - -$ # Field 3 == field 4 (numeric test) -$ tsv-filter --header --ff-eq 3:4 data.tsv -``` - -Field lists: - -Field lists can be used to run the same test on multiple fields. For example: -``` -$ # Test that fields 1-10 are not blank -$ tsv-filter --not-blank 1-10 data.tsv - -$ # Test that fields 1-5 are not zero -$ tsv-filter --ne 1-5:0 - -$ # Test that fields 1-5, 7, and 10-20 are less than 100 -$ tsv-filter --lt 1-5,7,10-20:100 -``` - -Regular expressions: - -The regular expression syntax supported is that defined by the [D regex library](). The basic syntax has become quite standard and is used by many tools. It will rarely be necessary to consult the D language documentation. A general reference such as the guide available at [Regular-Expressions.info](http://www.regular-expressions.info/) will suffice in nearly all cases. (Note: Unicode properties are supported.) - -``` -$ # Field 2 has a sequence with two a's, one or more digits, then 2 a's. -$ tsv-filter --regex '2:aa[0-9]+aa' data.tsv - -$ # Same thing, except the field starts and ends with the two a's. -$ tsv-filter --regex '2:^aa[0-9]+aa$' data.tsv - -$ # Field 2 is a sequence of "word" characters with two or more embedded -$ # whitespace sequences (match against entire field) -$ tsv-filter --regex '2:^\w+\s+(\w+\s+)+\w+$' data.tsv - -$ # Field 2 containing at least one cyrillic character. -$ tsv-filter --regex '2:\p{Cyrillic}' data.tsv -``` - -Short-circuiting expressions: - -Numeric tests like `--gt` (greater-than) assume field values can be interpreted as numbers. An error occurs if the field cannot be parsed as a number, halting the program. This can be avoiding by including a testing ensure the field is recognizable as a number. For example: - -``` -$ # Ensure field 2 is a number before testing for greater-than 10. -$ tsv-filter --is-numeric 2 --gt 2:10 data.tsv - -$ # Ensure field 2 is a number, not NaN or infinity before greater-than test. -$ tsv-filter --is-finite 2 --gt 2:10 data.tsv -``` - -The above tests work because `tsv-filter` short-circuits evaluation, only running as many tests as necessary to filter each line. Tests are run in the order listed on the command line. In the first example, if `--is-numeric 2` is false, the remaining tests do not get run. - -_**Tip:**_ Bash completion is very helpful when using commands like `tsv-filter` that have many options. See [Enable bash-completion](TipsAndTricks.md#enable-bash-completion) for details. - ---- - -## tsv-join reference - -**Synopsis:** tsv-join --filter-file file [options] file [file...] - -tsv-join matches input lines against lines from a 'filter' file. The match is based on exact match comparison of one or more 'key' fields. Fields are TAB delimited by default. Matching lines are written to standard output, along with any additional fields from the key file that have been specified. - -**Options:** -* `--h|help` - Print help. -* `--h|help-verbose` - Print detailed help. -* `--V|version` - Print version information and exit. -* `--f|filter-file FILE` - (Required) File with records to use as a filter. -* `--k|key-fields ` - Fields to use as join key. Default: 0 (entire line). -* `--d|data-fields ` - Data record fields to use as join key, if different than `--key-fields`. -* `--a|append-fields ` - Filter fields to append to matched records. -* `--H|header` - Treat the first line of each file as a header. -* `--p|prefix STR` - String to use as a prefix for `--append-fields` when writing a header line. -* `--w|write-all STR` - Output all data records. STR is the `--append-fields` value when writing unmatched records. This is an outer join. -* `--e|exclude` - Exclude matching records. This is an anti-join. -* `--delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) -* `--z|allow-duplicate-keys` - Allow duplicate keys with different append values (last entry wins). Default behavior is that this is an error. - -**Examples:** - -Filter one file based on another, using the full line as the key. -``` -$ # Output lines in data.txt that appear in filter.txt -$ tsv-join -f filter.txt data.txt - -$ # Output lines in data.txt that do not appear in filter.txt -$ tsv-join -f filter.txt --exclude data.txt -``` - -Filter multiple files, using fields 2 & 3 as the filter key. -``` -$ tsv-join -f filter.tsv --key-fields 2,3 data1.tsv data2.tsv data3.tsv -``` - -Same as previous, except use field 4 & 5 from the data files. -``` -$ tsv-join -f filter.tsv --key-fields 2,3 --data-fields 4,5 data1.tsv data2.tsv data3.tsv -``` - -Append fields from the filter file to matched records. -``` -$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2-5 data.tsv -``` - -Write out all records from the data file, but when there is no match, write the 'append fields' as NULL. This is an outer join. -``` -$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2 --write-all NULL data.tsv -``` - -Managing headers: Often it's useful to join a field from one data file to anther, where the data fields are related and the headers have the same name in both files. They can be kept distinct by adding a prefix to the filter file header. Example: -``` -$ tsv-join -f run1.tsv --header --key-fields 1 --append-fields 2 --prefix run1_ run2.tsv -``` - ---- - -## tsv-pretty reference - -**Synopsis:** tsv-pretty [options] [file...] - -`tsv-pretty` outputs TSV data in a format intended to be more human readable when working on the command line. This is done primarily by lining up data into fixed-width columns. Text is left aligned, numbers are right aligned. Floating points numbers are aligned on the decimal point when feasible. - -Processing begins by reading the initial set of lines into memory to determine the field widths and data types of each column. This look-ahead buffer is used for header detection as well. Output begins after this processing is complete. - -By default, only the alignment is changed, the actual values are not modified. Several of the formatting options do modify the values. - -Features: - -* Floating point numbers: Floats can be printed in fixed-width precision, using the same precision for all floats in a column. This makes then line up nicely. Precision is determined by values seen during look-ahead processing. The max precision defaults to 9, this can be changed when smaller or larger values are desired. See the `--f|format-floats` and `--p|precision` options. - -* Header lines: Headers are detected automatically when possible. This can be overridden when automatic detection doesn't work as desired. Headers can be underlined and repeated at regular intervals. - -* Missing values: A substitute value can be used for empty fields. This is often less confusing than spaces. See `--e|replace-empty` and `--E|empty-replacement`. - -* Exponential notion: As part of float formatting, `--f|format-floats` re-formats columns where exponential notation is found so all the values in the column are displayed using exponential notation and the same precision. - -* Preamble: A number of initial lines can be designated as a preamble and output unchanged. The preamble is before the header, if a header is present. Preamble lines can be auto-detected via the heuristic that they lack field delimiters. This works well when the field delimiter is a TAB. - -* Fonts: Fixed-width fonts are assumed. CJK characters are assumed to be double width. This is not always correct, but works well in most cases. - -**Options:** - -* `--help-verbose` - Print full help. -* `--H|header` - Treat the first line of each file as a header. -* `--x|no-header` - Assume no header. Turns off automatic header detection. -* `--l|lookahead NUM` - Lines to read to interpret data before generating output. Default: 1000 -* `--r|repeat-header NUM` - Lines to print before repeating the header. Default: No repeating header -* `--u|underline-header` - Underline the header. -* `--f|format-floats` - Format floats for better readability. Default: No -* `--p|precision NUM` - Max floating point precision. Implies --format-floats. Default: 9 -* `--e|replace-empty` - Replace empty fields with `--`. -* `--E|empty-replacement STR` - Replace empty fields with a string. -* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) -* `--s|space-between-fields NUM` - Spaces between each field (Default: 2) -* `--m|max-text-width NUM` - Max reserved field width for variable width text fields. Default: 40 -* `--a|auto-preamble` - Treat initial lines in a file as a preamble if the line contains no field delimiters. The preamble is output unchanged. -* `--b|preamble NUM` - Treat the first NUM lines as a preamble and output them unchanged. -* `--V|version` - Print version information and exit. -* `--h|help` - This help information. - -**Examples:** - -A tab-delimited file printed without any formatting: -``` -$ cat sample.tsv -Color Count Ht Wt -Brown 106 202.2 1.5 -Canary Yellow 7 106 0.761 -Chartreuse 1139 77.02 6.22 -Fluorescent Orange 422 1141.7 7.921 -Grey 19 140.3 1.03 -``` -The same file printed with `tsv-pretty`: -``` -$ tsv-pretty sample.tsv -Color Count Ht Wt -Brown 106 202.2 1.5 -Canary Yellow 7 106 0.761 -Chartreuse 1139 77.02 6.22 -Fluorescent Orange 422 1141.7 7.921 -Grey 19 140.3 1.03 -``` -Printed with float formatting and header underlining: -``` -$ tsv-pretty -f -u sample.tsv -Color Count Ht Wt ------ ----- -- -- -Brown 106 202.20 1.500 -Canary Yellow 7 106.00 0.761 -Chartreuse 1139 77.02 6.220 -Fluorescent Orange 422 1141.70 7.921 -Grey 19 140.30 1.030 -``` -Printed with setting the precision to one: -``` -$ tsv-pretty -u -p 1 sample.tsv -Color Count Ht Wt ------ ----- -- -- -Brown 106 202.2 1.5 -Canary Yellow 7 106.0 0.8 -Chartreuse 1139 77.0 6.2 -Fluorescent Orange 422 1141.7 7.9 -Grey 19 140.3 1.0 -``` - ---- - -## tsv-sample reference - -**Synopsis:** tsv-sample [options] [file...] - -`tsv-sample` subsamples input lines or randomizes their order. Several techniques are available: shuffling, simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling. These are provided via several different modes operation: - -* **Shuffling** (_default_): All lines are read into memory and output in random order. All orderings are equally likely. -* **Simple random sampling** (`--n|num N`): A random sample of `N` lines is selected and written to standard output. Selected lines are written in random order, similar to shuffling. All sample sets and orderings are equally likely. Use `--i|inorder` to preserve the original input order. -* **Weighted random sampling** (`--n|num N`, `--w|weight-field F`): A weighted sample of N lines is selected using weights from a field on each line. Selected lines are written in weighted selection order. Use `--i|inorder` to preserve the original input order. Omit `--n|num` to shuffle all input lines (weighted shuffling). -* **Sampling with replacement** (`--r|replace`, `--n|num N`): All lines are read into memory, then lines are selected one at a time at random and written out. Lines can be selected multiple times. Output continues until `N` samples have been written. Output continues forever if `--n|num` is zero or not specified. -* **Bernoulli sampling** (`--p|prob P`): Lines are read one-at-a-time in a streaming fashion and a random subset is output based on the inclusion probability. For example, `--prob 0.2` gives each line a 20% chance of being selected. All lines have an equal likelihood of being selected. The order of the lines is unchanged. -* **Distinct sampling** (`--k|key-fields F`, `--p|prob P`): Input lines are sampled based on a key from each line. A key is made up of one or more fields. A subset of the keys are chosen based on the inclusion probability (a "distinct" set of keys). All lines with one of the selected keys are output. This is a streaming operation: a decision is made on each line as it is read. The order of the lines is not changed. - -**Sample size**: The `--n|num` option controls the sample size for all sampling methods. In the case of simple and weighted random sampling it also limits the amount of memory required. - -**Performance and memory use**: `tsv-sample` is designed for large data sets. Algorithms make one pass over the data, using reservoir sampling and hashing when possible to limit the memory required. Bernoulli sampling and distinct sampling make immediate decisions on each line, with no memory accumulation. They can operate on arbitrary length data streams. Sampling with replacement reads all lines into memory and is limited by available memory. Shuffling also reads all lines into memory and is similarly limited. Simple and weighted random sampling use reservoir sampling algorithms and only need to hold the sample size (`--n|num`) in memory. See [Shuffling large files](TipsAndTricks.md#shuffling-large-files) for ways to use disk when available memory is not sufficient. - -**Controlling randomization**: Each run produces a different randomization. Using `--s|static-seed` changes this so multiple runs produce the same randomization. This works by using the same random seed each run. The random seed can be specified using `--v|seed-value`. This takes a non-zero, 32-bit positive integer. A zero value is a no-op and ignored. - -**Weighted sampling**: Weighted line order randomization is done using an algorithm for weighted reservoir sampling described by Pavlos Efraimidis and Paul Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is *not* necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more information see: -* Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling -* "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256) - -**Distinct sampling**: Distinct sampling selects a subset based on a key in data. Consider a query log with records consisting of triples. Distinct sampling selects all records matching a subset of values from one of the fields. For example, all events for ten percent of the users. This is important for certain types of analysis. Distinct sampling works by converting the specified probability (`--p|prob`) into a set of buckets and mapping every key into one of the buckets. One bucket is used to select records in the sample. Buckets are equal size and therefore may be a bit larger than the inclusion probability. Since every key is assigned a bucket, this method can also be used to fully divide a set of records into distinct groups. (See *Printing random values* below.) The term "distinct sampling" originates from algorithms estimating the number of distinct elements in extremely large data sets. - -**Printing random values**: Most of these algorithms work by generating a random value for each line. (See also "Compatibility mode" below.) The nature of these values depends on the sampling algorithm. They are used for both line selection and output ordering. The `--print-random` option can be used to print these values. The random value is prepended to the line separated by the `--d|delimiter` char (TAB by default). The `--gen-random-inorder` option takes this one step further, generating random values for all input lines without changing the input order. The types of values currently used are specific to the sampling algorithm: -* Shuffling, simple random sampling, Bernoulli sampling: Uniform random value in the interval [0,1]. -* Weighted random sampling: Value in the interval [0,1]. Distribution depends on the values in the weight field. -* Distinct sampling: An integer, zero and up, representing a selection group (aka. "bucket"). The inclusion probability determines the number of selection groups. -* Sampling with replacement: Random value printing is not supported. - -The specifics behind these random values are subject to change in future releases. - -**Compatibility mode**: As described above, many of the sampling algorithms assign a random value to each line. This is useful when printing random values. It has another occasionally useful property: repeated runs with the same static seed but different selection parameters are more compatible with each other, as each line gets assigned the same random value on every run. This property comes at a cost: in some cases there are faster algorithms that don't assign random values to each line. By default, `tsv-sample` will use the fastest algorithm available. The `--compatibility-mode` option changes this, switching to algorithms that assign a random value per line. Printing random values also engages compatibility mode. Compatibility mode is beneficial primarily when using Bernoulli sampling or random sampling: -* Bernoulli sampling - A run with a larger probability will be a superset of a smaller probability. In the example below, all lines selected in the first run are also selected in the second. - ``` - $ tsv-sample --static-seed --compatibility-mode --prob 0.2 data.tsv - $ tsv-sample --static-seed --compatibility-mode --prob 0.3 data.tsv - ``` -* Random sampling - A run with a larger sample size will be a superset of a smaller sample size. In the example below, all lines selected in the first run are also selected in the second. - ``` - $ tsv-sample --static-seed --compatibility-mode -n 1000 data.tsv - $ tsv-sample --static-seed --compatibility-mode -n 1500 data.tsv - ``` - This works for weighted sampling as well. - -**Options:** - -* `--h|help` - This help information. -* `--help-verbose` - Print more detailed help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. -* `--n|num NUM` - Maximum number of lines to output. All selected lines are output if not provided or zero. -* `--p|prob NUM` - Inclusion probability (0.0 < NUM <= 1.0). For Bernoulli sampling, the probability each line is selected output. For distinct sampling, the probability each unique key is selected for output. -* `--k|key-fields ` - Fields to use as key for distinct sampling. Use with `--p|prob`. Specify `--k|key-fields 0` to use the entire line as the key. -* `--w|weight-field NUM` - Field containing weights. All lines get equal weight if not provided or zero. -* `--r|replace` - Simple random sampling with replacement. Use `--n|num` to specify the sample size. -* `--s|static-seed` - Use the same random seed every run. -* `--v|seed-value NUM` - Sets the random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op. -* `--print-random` - Output the random values that were assigned. -* `--gen-random-inorder` - Output all lines with assigned random values prepended, no changes to the order of input. -* `--random-value-header` - Header to use with `--print-random` and `--gen-random-inorder`. Default: `random_value`. -* `--compatibility-mode` - Turns on "compatibility mode". -* `--d|delimiter CHR` - Field delimiter. -* `--prefer-skip-sampling` - (Internal) Prefer the skip-sampling algorithm for Bernoulli sampling. Used for testing and diagnostics. -* `--prefer-algorithm-r` - (Internal) Prefer Algorithm R for unweighted line order randomization. Used for testing and diagnostics. - ---- - -## tsv-select reference - -**Synopsis:** tsv-select [options] [file...] - -tsv-select reads files or standard input and writes specified fields to standard output in the order listed. Similar to Unix `cut` with the ability to reorder fields. - -Fields numbers start with one. They are comma separated, and ranges can be used. Fields can be listed more than once, and fields not listed can be selected as a group using the `--rest` option. When working with multiple files, the `--header` option can be used to retain the header from the just the first file. - -Fields can be excluded using `--e|exclude`. All fields not excluded are output. `--f|fields` and `--r|rest` can be used with `--e|exclude` to change the order of non-excluded fields. - -**Options:** -* `--h|help` - Print help. -* `--help-verbose` - Print more detailed help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. -* `--f|fields ` - Fields to retain. Fields are output in the order listed. -* `--e|--exclude ` - Fields to exclude. -* `--r|rest first|last` - Output location for fields not included in the `--f|fields` field-list. -* `--d|delimiter CHR` - Character to use as field delimiter. Default: TAB. (Single byte UTF-8 characters only.) - -**Notes:** -* One of `--f|fields` or `--e|exclude` is required. -* Fields specified by `--f|fields` and `--e|exclude` cannot overlap. -* When `--f|fields` and `--e|exclude` are used together, the effect is to specify `--rest last`. This can be overridden by specifying `--rest first`. -* Each input line must be long enough to contain all fields specified with `--f|fields`. This is not necessary for `--e|exclude` fields. - -**Examples:** -``` -$ # Keep the first field from two files -$ tsv-select -f 1 file1.tsv file2.tsv - -$ # Keep fields 1 and 2, retain the header from the first file -$ tsv-select -H -f 1,2 file1.tsv file2.tsv - -$ # Output fields 2 and 1, in that order -$ tsv-select -f 2,1 file.tsv - -$ # Output a range of fields -$ tsv-select -f 3-30 file.tsv - -$ # Output a range of fields in reverse order -$ tsv-select -f 30-3 file.tsv - -$ # Drop the first field, keep everything else -$ # Equivalent to 'cut -f 2- file.tsv' -$ tsv-select --exclude 1 file.tsv -$ tsv-select -e 1 file.tsv - -$ # Move field 1 to the end of the line -$ tsv-select -f 1 --rest first file.tsv - -$ # Move fields 7 and 3 to the start of the line -$ tsv-select -f 7,3 --rest last file.tsv - -# Output with repeating fields -$ tsv-select -f 1,2,1 file.tsv -$ tsv-select -f 1-3,3-1 file.tsv - -$ # Read from standard input -$ cat file*.tsv | tsv-select -f 1,4-7,11 - -$ # Read from a file and standard input. The '--' terminates command -$ # option processing, '-' represents standard input. -$ cat file1.tsv | tsv-select -f 1-3 -- - file2.tsv - -$ # Files using comma as the separator ('simple csv') -$ # (Note: Does not handle CSV escapes.) -$ tsv-select -d , --fields 5,1,2 file.csv - -$ # Move field 2 to the front and drop fields 10-15 -$ tsv-select -f 2 -e 10-15 file.tsv - -$ # Move field 2 to the end, dropping fields 10-15 -$ tsv-select -f 2 -rest first -e 10-15 file.tsv -``` - ---- - -## tsv-split reference - -Synopsis: tsv-split [options] [file...] - -Split input lines into multiple output files. There are three modes of operation: - -* **Fixed number of lines per file** (`--l|lines-per-file NUM`): Each input block of NUM lines is written to a new file. Similar to Unix `split`. - -* **Random assignment** (`--n|num-files NUM`): Each input line is written to a randomly selected output file. Random selection is from NUM files. - -* **Random assignment by key** (`--n|num-files NUM`, `--k|key-fields FIELDS`): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file. - -**Output files**: By default, files are written to the current directory and have names of the form `part_NNN`, with `NNN` being a number and `` being the extension of the first input file. If the input file is `file.txt`, the names will take the form `part_NNN.txt`. The suffix is empty when reading from standard input. The numeric part defaults to 3 digits for `--l|lines-per-files`. For `--n|num-files` enough digits are used so all filenames are the same length. The output directory and file names are customizable. - -**Header lines**: There are two ways to handle input with headers: write a header to all output files (`--H|header`), or exclude headers from all output files (`--I|header-in-only`). The best choice depends on the follow-up processing. All tsv-utils tools support header lines in multiple input files, but many other tools do not. For example, [GNU parallel](https://www.gnu.org/software/parallel/) works best on files without header lines. (See [Faster processing using GNU parallel](TipsAndTricks.md#faster-processing-using-gnu-parallel) for some info on using GNU parallel and tsv-utils together.) - -**About Random assignment** (`--n|num-files`): Random distribution of records to a set of files is a common task. When data fits in memory the preferred approach is usually to shuffle the data and split it into fixed sized blocks. Both of the following command lines accomplish this: -``` -$ shuf data.tsv | split -l NUM -$ tsv-sample data.tsv | tsv-split -l NUM -``` - -However, alternate approaches are needed when data is too large for convenient shuffling. tsv-split's random assignment feature can be useful in these cases. Each input line is written to a randomly selected output file. Note that output files will have similar but not identical numbers of records. - -**About Random assignment by key** (`--n|num-files NUM`, `--k|key-fields FIELDS`): This splits a data set into multiple files sharded by key. All lines with the same key are written to the same file. This partitioning enables parallel computation based on the key. For example, statistical calculation (`tsv-summarize --group-by`) or duplicate removal (`tsv-uniq --fields`). These operations can be parallelized using tools like GNU parallel, which simplifies concurrent operations on multiple files. - -**Random seed**: By default, each tsv-split invocation using random assignment or random assignment by key produces different assignments to the output files. Using `--s|static-seed` changes this so multiple runs produce the same assignments. This works by using the same random seed each run. The seed can be specified using `--v|seed-value`. - -**Appending to existing files**: By default, an error is triggered if an output file already exists. `--a|append` changes this so that lines are appended to existing files. (Header lines are not appended to files with data.) This is useful when adding new data to files created by a previous `tsv-split` run. Random assignment should use the same `--n|num-files` value each run, but different random seeds (avoid `--s|static-seed`). Random assignment by key should use the same `--n|num-files`, `--k|key-fields`, and seed (`--s|static-seed` or `--v|seed-value`) each run. - -**Max number of open files**: Random assignment and random assignment by key are dramatically faster when all output files are kept open. However, keeping a large numbers of open files can bump into system limits or limit resources available to other processes. By default, `tsv-split` uses up to 4096 open files or the system per-process limit, whichever is smaller. This can be changed using `--max-open-files`, though it cannot be set larger than the system limit. The system limit varies considerably between systems. On many systems it is unlimited. On MacOS it is often set to 256. Use Unix `ulimit` to display and modify the limits: -``` -$ ulimit -n # Show the "soft limit". The per-process maximum. -$ ulimit -Hn # Show the "hard limit". The max allowed soft limit. -$ ulimit -Sn NUM # Change the "soft limit" to NUM. -``` - -**Examples**: -``` -$ # Split a 10 million line file into 1000 files, 10,000 lines each. -$ # Output files are part_000.txt, part_001.txt, ... part_999.txt. -$ tsv-split data.txt --lines-per-file 10000 - -$ # Same as the previous example, but write files to a subdirectory. -$ tsv-split data.txt --dir split_files --lines-per-file 10000 - -$ # Split a file into 10,000 line files, writing a header line to each -$ tsv-split data.txt -H --lines-per-file 10000 - -$ # Same as the previous example, but dropping the header line. -$ tsv-split data.txt -I --lines-per-file 10000 - -$ # Randomly assign lines to 1000 files -$ tsv-split data.txt --num-files 1000 - -$ # Randomly assign lines to 1000 files while keeping unique keys from -$ # field 3 together. -$ tsv-split data.tsv --num-files 1000 -k 3 - -$ # Randomly assign lines to 1000 files. Later, randomly assign lines -$ # from a second data file to the same output files. -$ tsv-split data1.tsv -n 1000 -$ tsv-split data2.tsv -n 1000 --append - -$ # Randomly assign lines to 1000 files using field 3 as a key. -$ # Later, add a second file to the same output files. -$ tsv-split data1.tsv -n 1000 -k 3 --static-seed -$ tsv-split data2.tsv -n 1000 -k 3 --static-seed --append - -$ # Change the system per-process open file limit for one command. -$ # The parens create a sub-shell. The current shell is not changed. -$ ( ulimit -Sn 1000 && tsv-split --num-files 1000 data.txt ) -``` - -**Options**: -* `--h|--help` - Print help. -* `--help-verbose` - Print more detailed help. -* `--V|--version` - Print version information and exit. -* `--H|header` - Input files have a header line. Write the header to each output file. -* `--I|header-in-only` - Input files have a header line. Do not write the header to output files. -* `--l|lines-per-file NUM` - Number of lines to write to each output file (excluding the header line). -* `--n|num-files NUM` - Number of output files to generate. -* `--k|key-fields ` - Fields to use as key. Lines with the same key are written to the same output file. Use `--k|key-fields 0` to use the entire line as the key. -* `--dir STR` - Directory to write to. Default: Current working directory. -* `--prefix STR` - Filename prefix. Default: `part_` -* `--suffix STR` - Filename suffix. Default: First input file extension. None for standard input. -* `--w|digit-width NUM` - Number of digits in filename numeric portion. Default: `--l|lines-per-file`: 3. `--n|num-files`: Chosen so filenames have the same length. `--w|digit-width 0` uses the default. -* `--a|append` - Append to existing files. -* `--s|static-seed` - Use the same random seed every run. -* `--v|seed-value NUM` - Sets the random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op. -* `--d|delimiter CHR` - Field delimiter. -* `--max-open-files NUM` - Maximum open file handles to use. Min of 5 required. - ---- - -## tsv-summarize reference - -Synopsis: tsv-summarize [options] file [file...] - -`tsv-summarize` generates summary statistics on fields of a TSV file. A variety of statistics are supported. Calculations can run against the entire data stream or grouped by key. Consider the file data.tsv: -``` -make color time -ford blue 131 -chevy green 124 -ford red 128 -bmw black 118 -bmw black 126 -ford blue 122 -``` - -The min and average 'time' values for the 'make' field is generated by the command: -``` -$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv -``` - -This produces: -``` -make time_min time_mean -ford 122 127 -chevy 124 124 -bmw 118 122 -``` - -Using `--group-by 1,2` will group by both 'make' and 'color'. Omitting the `--group-by` entirely summarizes fields for full file. - -The program tries to generate useful headers, but custom headers can be specified. Example: -``` -$ tsv-summarize --header --group-by 1 --min 3:fastest --mean 3:average data.tsv -make fastest average -ford 122 127 -chevy 124 124 -bmw 118 122 -``` - -Most operators take custom headers in a manner shown above, following the syntax: -``` --- FIELD[:header] -``` - -Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Examples: -``` ---median 2,3,4 ---median 1,5-8 -``` - -The quantile operator requires one or more probabilities after the fields: -``` ---quantile 2:0.25 # Quantile 1 of field 2 ---quantile 2-4:0.25,0.5,0.75 # Q1, Median, Q3 of fields 2, 3, 4 -``` - -Summarization operators available are: -``` - count range mad values - retain sum var unique-values - first mean stddev unique-count - last median mode missing-count - min quantile mode-count not-missing-count - max -``` - -Calculated numeric values are printed to 12 significant digits by default. This can be changed using the `--p|float-precision` option. If six or less it sets the number of significant digits after the decimal point. If greater than six it sets the total number of significant digits. - -Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled. - -Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found. - -Missing values are not treated specially by default, this can be changed using the `--x|exclude-missing` or `--r|replace-missing` option. The former turns off processing for missing values, the latter uses a replacement value. - -**Options:** -* `--h|help` - Print help. -* `--help-verbose` - Print detailed help. -* `--V|version` - Print version information and exit. -* `--g|group-by ` - Fields to use as key. -* `--H|header` - Treat the first line of each file as a header. -* `--w|write-header` - Write an output header even if there is no input header. -* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) -* `--v|values-delimiter CHR` - Values delimiter. Default: vertical bar (\|). (Single byte UTF-8 characters only.) -* `--p|float-precision NUM` - 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12 -* `--x|exclude-missing` - Exclude missing (empty) fields from calculations. -* `--r|replace-missing STR` - Replace missing (empty) fields with STR in calculations. - -**Operators:** -* `--count` - Count occurrences of each unique key (`--g|group-by`), or the total number of records if no key field is specified. -* `--count-header STR` - Count occurrences of each unique key, like `--count`, but use STR as the header. -* `--retain ` - Retain one copy of the field. The field header is unchanged. -* `--first [:STR]` - First value seen. -* `--last [:STR]`- Last value seen. -* `--min [:STR]` - Min value. (Numeric fields only.) -* `--max [:STR]` - Max value. (Numeric fields only.) -* `--range [:STR]` - Difference between min and max values. (Numeric fields only.) -* `--sum [:STR]` - Sum of the values. (Numeric fields only.) -* `--mean [:STR]` - Mean (average). (Numeric fields only.) -* `--median [:STR]` - Median value. (Numeric fields only. Reads all values into memory.) -* `--quantile :p[,p...][:STR]` - Quantiles. One or more fields, then one or more 0.0-1.0 probabilities. (Numeric fields only. Reads all values into memory.) -* `--mad [:STR]` - Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.) -* `--var [:STR]` - Variance. (Sample variance, numeric fields only). -* `--stdev [:STR]` - Standard deviation. (Sample st.dev, numeric fields only). -* `--mode [:STR]` - Mode. The most frequent value. (Reads all unique values into memory.) -* `--mode-count [:STR]` - Count of the most frequent value. (Reads all unique values into memory.) -* `--unique-count [:STR]` - Number of unique values. (Reads all unique values into memory). -* `--missing-count [:STR]` - Number of missing (empty) fields. Not affected by the `--x|exclude-missing` or `--r|replace-missing` options. -* `--not-missing-count [:STR]` - Number of filled (non-empty) fields. Not affected by `--r|replace-missing`. -* `--values [:STR]` - All the values, separated by `--v|values-delimiter`. (Reads all values into memory.) -* `--unique-values [:STR]` - All the unique values, separated by `--v|values-delimiter`. (Reads all unique values into memory.) - -_**Tip:**_ Bash completion is very helpful when using commands like `tsv-summarize` that have many options. See [Enable bash-completion](TipsAndTricks.md#enable-bash-completion) for details. - ---- - -## tsv-uniq reference - -`tsv-uniq` identifies equivalent lines in files or standard input. Input is read line by line, recording a key based on one or more of the fields. Two lines are equivalent if they have the same key. When operating in the default 'uniq' mode, the first time a key is seen the line is written to standard output. Subsequent lines having the same key are discarded. This is similar to the Unix `uniq` program, but based on individual fields and without requiring sorted data. - -`tsv-uniq` can be run without specifying a key field. In this case the whole line is used as a key, same as the Unix `uniq` program. As with `uniq`, this works on any line-oriented text file, not just TSV files. There is no need to sort the data and the original input order is preserved. - -The alternates to the default 'uniq' mode are 'number' mode and 'equiv-class' mode. In 'equiv-class' mode (`--e|equiv`), all lines are written to standard output, but with a field appended marking equivalent entries with an ID. The ID is a one-upped counter. - -'Number' mode (`--z|number`) also writes all lines to standard output, but with a field appended numbering the occurrence count for the line's key. The first line with a specific key is assigned the number '1', the second with the key is assigned number '2', etc. 'Number' and 'equiv-class' modes can be used together. - -The `--r|repeated` option can be used to print only lines occurring more than once. Specifically, the second occurrence of a key is printed. The `--a|at-least N` option is similar, printing lines occurring at least N times. (Like repeated, the Nth line with the key is printed.) - -The `--m|max MAX` option changes the behavior to output the first MAX lines for each key, rather than just the first line for each key. - -If both `--a|at-least` and `--m|max` are specified, the occurrences starting with 'at-least' and ending with 'max' are output. - -**Synopsis:** tsv-uniq [options] [file...] - -**Options:** -* `-h|help` - Print help. -* `--help-verbose` - Print detailed help. -* `--V|version` - Print version information and exit. -* `--H|header` - Treat the first line of each file as a header. -* `--f|fields ` - Fields to use as the key. Default: 0 (entire line). -* `--i|ignore-case` - Ignore case when comparing keys. -* `--e|equiv` - Output equiv class IDs rather than uniq'ing entries. -* `--equiv-header STR` - Use STR as the equiv-id field header. Applies when using `--header --equiv`. Default: `equiv_id`. -* `--equiv-start INT` - Use INT as the first equiv-id. Default: 1. -* `--z|number` - Output equivalence class occurrence counts rather than uniq'ing entries. -* `--number-header STR` - Use STR as the `--number` field header (when using `-H --number`). Default: `equiv_line`. -* `--r|repeated` - Output only lines that are repeated (based on the key). -* `--a|at-least INT` - Output only lines that are repeated INT times (based on the key). Zero and one are ignored. -* `--m|max INT` - Max number of each unique key to output (zero is ignored). -* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) - -**Examples:** -``` -$ # Uniq a file, using the full line as the key -$ tsv-uniq data.txt - -$ # Same as above, but case-insensitive -$ tsv-uniq --ignore-case data.txt - -$ # Unique a file based on one field -$ tsv-unique -f 1 data.tsv - -$ # Unique a file based on two fields -$ tsv-uniq -f 1,2 data.tsv - -$ # Unique a file based on the first four fields -$ tsv-uniq -f 1-4 data.tsv - -$ # Output all the lines, generating an ID for each unique entry -$ tsv-uniq -f 1,2 --equiv data.tsv - -$ # Generate uniq IDs, but account for headers -$ tsv-uniq -f 1,2 --equiv --header data.tsv - -$ # Generate line numbers specific to each key -$ tsv-uniq -f 1,2 --number --header data.tsv - -$ # --Examples showing the data-- - -$ cat data.tsv -field1 field2 field2 -ABCD 1234 PQR -efgh 5678 stu -ABCD 1234 PQR -wxyz 1234 stu -efgh 5678 stu -ABCD 1234 PQR - -$ # Uniq using the full line as key -$ tsv-uniq -H data.tsv -field1 field2 field2 -ABCD 1234 PQR -efgh 5678 stu -wxyz 1234 stu - -$ # Uniq using field 2 as key -$ tsv-uniq -H -f 2 data.tsv -field1 field2 field2 -ABCD 1234 PQR -efgh 5678 stu - -$ # Generate equivalence class IDs -$ tsv-uniq -H --equiv data.tsv -field1 field2 field2 equiv_id -ABCD 1234 PQR 1 -efgh 5678 stu 2 -ABCD 1234 PQR 1 -wxyz 1234 stu 3 -efgh 5678 stu 2 -ABCD 1234 PQR 1 - -$ # Generate equivalence class IDs and line numbers -$ tsv-uniq -H --equiv --number data.tsv -field1 field2 field2 equiv_id equiv_line -ABCD 1234 PQR 1 1 -efgh 5678 stu 2 1 -ABCD 1234 PQR 1 2 -wxyz 1234 stu 3 1 -efgh 5678 stu 2 2 -ABCD 1234 PQR 1 3 -``` +Documentation in the above files is for the current toolkit version. There were significant changes to the documents in release 2.0.0 due to the addition of named fields. Documentation for earlier versions is available in [Tools Reference v1.6](ToolReference_v1.6.md). diff --git a/docs/ToolReference_v1.6.md b/docs/ToolReference_v1.6.md new file mode 100644 index 00000000..50cf3453 --- /dev/null +++ b/docs/ToolReference_v1.6.md @@ -0,0 +1,1005 @@ +_Visit the [main page](../README.md)_ + +# Tools Reference v1.6 + +*This is the documentation for the version 1.6 of tsv-utils toolkit, prior to the introduction of named fields. For the curent documentation go to the primary [Tools Reference](ToolReference.md) page.* + +This page provides detailed documentation about the different tools as well as examples. Material for the individual tools is also available via the `--help` option. + +* [Common options and behavior](#common-options-and-behavior) +* [csv2tsv](#csv2tsv-reference) +* [keep-header](#keep-header-reference) +* [number-lines](#number-lines-reference) +* [tsv-append](#tsv-append-reference) +* [tsv-filter](#tsv-filter-reference) +* [tsv-join](#tsv-join-reference) +* [tsv-pretty](#tsv-pretty-reference) +* [tsv-sample](#tsv-sample-reference) +* [tsv-select](#tsv-select-reference) +* [tsv-split](#tsv-split-reference) +* [tsv-summarize](#tsv-summarize-reference) +* [tsv-uniq](#tsv-uniq-reference) + +___ + +## Common options and behavior + +Information in this section applies to all the tools. + +### Specifying options + +Multi-letter options are specified with a double dash. Single letter options can be specified with a single dash or double dash. For example: +``` +$ tsv-select -f 1,2 # Valid +$ tsv-select --f 1,2 # Valid +$ tsv-select --fields 1,2 # Valid +$ tsv-select -fields 1,2 # Invalid. +``` + +### Help (`-h`, `--help`, `--help-verbose`) + +All tools print help if given the `-h` or `--help` option. Many provide more detail via the `--help-verbose` option. + +### Field numbers and field-lists. + +Field numbers are one-upped integers, following Unix conventions. Some tools use zero to represent the entire line (`tsv-join`, `tsv-uniq`). + +In many cases multiple fields can be entered as a "field-list". A field-list is a comma separated list of field numbers or field ranges. For example: + +``` +$ tsv-select -f 3 # Field 3 +$ tsv-select -f 3,5 # Fields 3 and 5 +$ tsv-select -f 3-5 # Fields 3, 4, 5 +$ tsv-select -f 1,3-5 # Fields 1, 3, 4, 5 +``` + +Most tools process or output fields in the order listed, and repeated use is usually fine: +``` +$ tsv-select -f 5-1 # Fields 5, 4, 3, 2, 1 +$ tsv-select -f 1-3,2,1 # Fields 1, 2, 3, 2, 1 +``` + +### UTF-8 input + +These tools assume data is utf-8 encoded. + +### Line endings + +These tools have been tested on Unix platforms, including macOS, but not Windows. On Unix platforms, Unix line endings (`\n`) are expected, with the notable exception of `tsv2csv`. Not all the tools are affected by DOS and Windows line endings (`\r\n`), those that are check the first line and flag an error. `csv2tsv` explicitly handles DOS and Windows line endings, converting to Unix line endings as part of the conversion. + +The `dos2unix` tool can be used to convert Windows line endings to Unix format. See [Convert newline format and character encoding with dos2unix and iconv](TipsAndTricks.md#convert-newline-format-and-character-encoding-with-dos2unix-and-iconv) + +The tools were written to respect platform line endings. If built on Windows, then Windows line endings. However, given the lack of testing, a Windows build should be expected to have some issues with line endings. + +### File format and alternate delimiters (`--delimiter`) + +Any character can be used as a delimiter, TAB is the default. However, there is no escaping for including the delimiter character or newlines within a field. This differs from CSV file format which provides an escaping mechanism. In practice the lack of an escaping mechanism is not a meaningful limitation for data oriented files. + +Aside from a header line, all lines are expected to have data. There is no comment mechanism and no special handling for blank lines. Tools taking field indices as arguments expect the specified fields to be available on every line. + +### Headers (`-H`, `--header`) + +Most tools handle the first line of files as a header when given the `-H` or `--header` option. For example, `tsv-filter` passes the header through without filtering it. When `--header` is used, all files and stdin are assumed to have header lines. Only one header line is written to stdout. If multiple files are being processed, header lines from subsequent files are discarded. + +### Multiple files and standard input + +Tools can read from any number of files and from standard input. As per typical Unix behavior, a single dash represents standard input when included in a list of files. Terminate non-file arguments with a double dash (`--`) when using a single dash in this fashion. Example: +``` +$ head -n 1000 file-c.tsv | tsv-filter --eq 2:1000 -- file-a.tsv file-b.tsv - > out.tsv +``` + +The above passes `file-a.tsv`, `file-b.tsv`, and the first 1000 lines of `file-c.tsv` to `tsv-filter` and write the results to `out.tsv`. + +--- + +## csv2tsv reference + +**Synopsis:** csv2tsv [options] [file...] + +csv2tsv converts CSV (comma-separated) text to TSV (tab-separated) format. Records are read from files or standard input, converted records are written to standard output. + +Both formats represent tabular data, each record on its own line, fields separated by a delimiter character. The key difference is that CSV uses escape sequences to represent newlines and field separators in the data, whereas TSV disallows these characters in the data. The most common field delimiters are comma for CSV and tab for TSV, but any character can be used. See [Comparing TSV and CSV formats](comparing-tsv-and-csv.md) for addition discussion of the formats. + +Conversion to TSV is done by removing CSV escape syntax, changing field delimiters, and replacing newlines and tabs in the data. By default, newlines and tabs in the data are replaced by spaces. Most details are customizable. + +There is no single spec for CSV, any number of variants can be found. The escape syntax is common enough: fields containing newlines or field delimiters are placed in double quotes. Inside a quoted field, a double quote is represented by a pair of double quotes. As with field separators, the quoting character is customizable. + +Behaviors of this program that often vary between CSV implementations: +* Newlines are supported in quoted fields. +* Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules. +* Each record can have a different numbers of fields. +* The three common forms of newlines are supported: CR, CRLF, LF. +* A newline will be added if the file does not end with one. +* No whitespace trimming is done. + +This program does not validate CSV correctness, but will terminate with an error upon reaching an inconsistent state. Improperly terminated quoted fields are the primary cause. + +UTF-8 input is assumed. Convert other encodings prior to invoking this tool. + +**Options:** +* `--h|help` - Print help. +* `--help-verbose` - Print detailed help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. Only the header of the first file is output. +* `--q|quote CHR` - Quoting character in CSV data. Default: double-quote (") +* `--c|csv-delim CHR` - Field delimiter in CSV data. Default: comma (,). +* `--t|tsv-delim CHR` - Field delimiter in TSV data. Default: TAB +* `--r|replacement STR` - Replacement for newline and TSV field delimiters found in CSV input. Default: Space. + +--- + +## keep-header reference + +**Synopsis:** keep-header [file...] \-- program [args] + +Execute a command against one or more files in a header-aware fashion. The first line of each file is assumed to be a header. The first header is output unchanged. Remaining lines are sent to the given command via standard input, excluding the header lines of subsequent files. Output from the command is appended to the initial header line. A double dash (\--) delimits the command, similar to how the pipe operator (\|) delimits commands. + +The following commands sort files in the usual way, except for retaining a single header line: +``` +$ keep-header file1.txt -- sort +$ keep-header file1.txt file2.txt -- sort -k1,1nr +``` + +Data can also be read from from standard input. For example: +``` +$ cat file1.txt | keep-header -- sort +$ keep-header file1.txt -- sort -r | keep-header -- grep red +``` + +The last example can be simplified using a shell command: +``` +$ keep-header file1.txt -- /bin/sh -c '(sort -r | grep red)' +``` + +`keep-header` is especially useful for commands like `sort` and `shuf` that reorder input lines. It is also useful with filtering commands like `grep`, many `awk` uses, and even `tail`, where the header should be retained without filtering or evaluation. + +`keep-header` works on any file where the first line is delimited by a newline character. This includes all TSV files and the majority of CSV files. It won't work on CSV files having embedded newlines in the header. + +**Options:** +* `--h|help` - Print help. +* `--V|version` - Print version information and exit. + +--- + +## number-lines reference + +**Synopsis:** number-lines [options] [file...] + +number-lines reads from files or standard input and writes each line to standard output preceded by a line number. It is a simplified version of the Unix `nl` program. It supports one feature `nl` does not: the ability to treat the first line of files as a header. This is useful when working with tab-separated-value files. If header processing used, a header line is written for the first file, and the header lines are dropped from any subsequent files. + +**Options:** +* `--h|help` - Print help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. The first input file's header is output, subsequent file headers are discarded. +* `--s|header-string STR` - String to use as the header for the line number field. Implies `--header`. Default: 'line'. +* `--n|start-number NUM` - Number to use for the first line. Default: 1. +* `--d|delimiter CHR` - Character appended to line number, preceding the rest of the line. Default: TAB (Single byte UTF-8 characters only.) + +**Examples:** +``` +$ # Number lines in a file +$ number-lines file.tsv + +$ # Number lines from multiple files. Treat the first line of each file +$ # as a header. +$ number-lines --header data*.tsv +``` + +**See Also:** + +* [tsv-uniq](#tsv-uniq-reference) supports numbering lines grouped by key. + +--- + +## tsv-append reference + +**Synopsis:** tsv-append [options] [file...] + +tsv-append concatenates multiple TSV files, similar to the Unix `cat` utility. Unlike `cat`, it is header-aware (`--H|header`), writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. Results are written to standard output. + +Concatenation with header support is useful when preparing data for traditional Unix utilities like `sort` and `sed` or applications that read a single file. + +Source tracking is useful when creating long/narrow form tabular data, a format used by many statistics and data mining packages. In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. + +The file-name (without extension) is used as the source value. This can customized using the `--f|file` option. + +Example: Header processing: +``` +$ tsv-append -H file1.tsv file2.tsv file3.tsv +``` + +Example: Header processing and source tracking: +``` +$ tsv-append -H -t file1.tsv file2.tsv file3.tsv +``` + +Example: Source tracking with custom source values: +``` +$ tsv-append -H -s test_id -f test1=file1.tsv -f test2=file2.tsv + ``` + +**Options:** +* `--h|help` - Print help. +* `--help-verbose` - Print detailed help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. +* `--t|track-source` - Track the source file. Adds an column with the source name. +* `--s|source-header STR` - Use STR as the header for the source column. Implies `--H|header` and `--t|track-source`. Default: 'file' +* `--f|file STR=FILE` - Read file FILE, using STR as the 'source' value. Implies `--t|track-source`. +* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) + +--- + +## tsv-filter reference + +_Note: See the [tsv-filter](../README.md#tsv-filter) description in the project [README](../README.md) for a tutorial style introduction._ + +**Synopsis:** tsv-filter [options] [file...] + +Filter lines of tab-delimited files via comparison tests against fields. Multiple tests can be specified, by default they are evaluated as AND clause. Lines satisfying the tests are written to standard output. + +**General options:** +* `--help` - Print help. +* `--help-verbose` - Print detailed help. +* `--help-options` - Print the options list by itself. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. +* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) +* `--or` - Evaluate tests as an OR rather than an AND. This applies globally. +* `--v|invert` - Invert the filter, printing lines that do not match. This applies globally. + +**Tests:** + +Empty and blank field tests: +* `--empty ` - True if field is empty (no characters) +* `--not-empty ` - True if field is not empty. +* `--blank ` - True if field is empty or all whitespace. +* `--not-blank ` - True if field contains a non-whitespace character. + +Numeric type tests: +* `--is-numeric ` - True if the field can be interpreted as a number. +* `--is-finite ` - True if the field can be interpreted as a number, and it is not NaN or infinity. +* `--is-nan ` - True if the field is NaN (including: "nan", "NaN", "NAN"). +* `--is-infinity ` - True if the field is infinity (including: "inf", "INF", "-inf", "-INF") + +Numeric comparisons: +* `--le :NUM` - FIELD <= NUM (numeric). +* `--lt :NUM` - FIELD < NUM (numeric). +* `--ge :NUM` - FIELD >= NUM (numeric). +* `--gt :NUM` - FIELD > NUM (numeric). +* `--eq :NUM` - FIELD == NUM (numeric). +* `--ne :NUM` - FIELD != NUM (numeric). + +String comparisons: +* `--str-le :STR` - FIELD <= STR (string). +* `--str-lt :STR` - FIELD < STR (string). +* `--str-ge :STR` - FIELD >= STR (string). +* `--str-gt :STR` - FIELD > STR (string). +* `--str-eq :STR` - FIELD == STR (string). +* `--istr-eq :STR` - FIELD == STR (string, case-insensitive). +* `--str-ne :STR` - FIELD != STR (string). +* `--istr-ne :STR` - FIELD != STR (string, case-insensitive). +* `--str-in-fld :STR` - FIELD contains STR (substring search). +* `--istr-in-fld :STR` - FIELD contains STR (substring search, case-insensitive). +* `--str-not-in-fld :STR` - FIELD does not contain STR (substring search). +* `--istr-not-in-fld :STR` - FIELD does not contain STR (substring search, case-insensitive). + +Regular expression tests: +* `--regex :REGEX` - FIELD matches regular expression. +* `--iregex :REGEX` - FIELD matches regular expression, case-insensitive. +* `--not-regex :REGEX` - FIELD does not match regular expression. +* `--not-iregex :REGEX` - FIELD does not match regular expression, case-insensitive. + +Field length tests +* `--char-len-le :NUM` - FIELD character length <= NUM. +* `--char-len-lt :NUM` - FIELD character length < NUM. +* `--char-len-ge :NUM` - FIELD character length >= NUM. +* `--char-len-gt :NUM` - FIELD character length > NUM. +* `--char-len-eq :NUM` - FIELD character length == NUM. +* `--char-len-ne :NUM` - FIELD character length != NUM. +* `--byte-len-le :NUM` - FIELD byte length <= NUM. +* `--byte-len-lt :NUM` - FIELD byte length < NUM. +* `--byte-len-ge :NUM` - FIELD byte length >= NUM. +* `--byte-len-gt :NUM` - FIELD byte length > NUM. +* `--byte-len-eq :NUM` - FIELD byte length == NUM. +* `--byte-len-ne :NUM` - FIELD byte length != NUM. + +Field to field comparisons: +* `--ff-le FIELD1:FIELD2` - FIELD1 <= FIELD2 (numeric). +* `--ff-lt FIELD1:FIELD2` - FIELD1 < FIELD2 (numeric). +* `--ff-ge FIELD1:FIELD2` - FIELD1 >= FIELD2 (numeric). +* `--ff-gt FIELD1:FIELD2` - FIELD1 > FIELD2 (numeric). +* `--ff-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (numeric). +* `--ff-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (numeric). +* `--ff-str-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (string). +* `--ff-istr-eq FIELD1:FIELD2` - FIELD1 == FIELD2 (string, case-insensitive). +* `--ff-str-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (string). +* `--ff-istr-ne FIELD1:FIELD2` - FIELD1 != FIELD2 (string, case-insensitive). +* `--ff-absdiff-le FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) <= NUM +* `--ff-absdiff-gt FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) > NUM +* `--ff-reldiff-le FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) <= NUM +* `--ff-reldiff-gt FIELD1:FIELD2:NUM` - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) > NUM + +**Examples:** + +Basic comparisons: +``` +$ # Field 2 non-zero +$ tsv-filter --ne 2:0 data.tsv + +$ # Field 1 == 0 and Field 2 >= 100, first line is a header. +$ tsv-filter --header --eq 1:0 --ge 2:100 data.tsv + +$ # Field 1 == -1 or Field 1 > 100 +$ tsv-filter --or --eq 1:-1 --gt 1:100 + +$ # Field 3 is foo, Field 4 contains bar +$ tsv-filter --header --str-eq 3:foo --str-in-fld 4:bar data.tsv + +$ # Field 3 == field 4 (numeric test) +$ tsv-filter --header --ff-eq 3:4 data.tsv +``` + +Field lists: + +Field lists can be used to run the same test on multiple fields. For example: +``` +$ # Test that fields 1-10 are not blank +$ tsv-filter --not-blank 1-10 data.tsv + +$ # Test that fields 1-5 are not zero +$ tsv-filter --ne 1-5:0 + +$ # Test that fields 1-5, 7, and 10-20 are less than 100 +$ tsv-filter --lt 1-5,7,10-20:100 +``` + +Regular expressions: + +The regular expression syntax supported is that defined by the [D regex library](). The basic syntax has become quite standard and is used by many tools. It will rarely be necessary to consult the D language documentation. A general reference such as the guide available at [Regular-Expressions.info](http://www.regular-expressions.info/) will suffice in nearly all cases. (Note: Unicode properties are supported.) + +``` +$ # Field 2 has a sequence with two a's, one or more digits, then 2 a's. +$ tsv-filter --regex '2:aa[0-9]+aa' data.tsv + +$ # Same thing, except the field starts and ends with the two a's. +$ tsv-filter --regex '2:^aa[0-9]+aa$' data.tsv + +$ # Field 2 is a sequence of "word" characters with two or more embedded +$ # whitespace sequences (match against entire field) +$ tsv-filter --regex '2:^\w+\s+(\w+\s+)+\w+$' data.tsv + +$ # Field 2 containing at least one cyrillic character. +$ tsv-filter --regex '2:\p{Cyrillic}' data.tsv +``` + +Short-circuiting expressions: + +Numeric tests like `--gt` (greater-than) assume field values can be interpreted as numbers. An error occurs if the field cannot be parsed as a number, halting the program. This can be avoiding by including a testing ensure the field is recognizable as a number. For example: + +``` +$ # Ensure field 2 is a number before testing for greater-than 10. +$ tsv-filter --is-numeric 2 --gt 2:10 data.tsv + +$ # Ensure field 2 is a number, not NaN or infinity before greater-than test. +$ tsv-filter --is-finite 2 --gt 2:10 data.tsv +``` + +The above tests work because `tsv-filter` short-circuits evaluation, only running as many tests as necessary to filter each line. Tests are run in the order listed on the command line. In the first example, if `--is-numeric 2` is false, the remaining tests do not get run. + +_**Tip:**_ Bash completion is very helpful when using commands like `tsv-filter` that have many options. See [Enable bash-completion](TipsAndTricks.md#enable-bash-completion) for details. + +--- + +## tsv-join reference + +**Synopsis:** tsv-join --filter-file file [options] file [file...] + +tsv-join matches input lines against lines from a 'filter' file. The match is based on exact match comparison of one or more 'key' fields. Fields are TAB delimited by default. Matching lines are written to standard output, along with any additional fields from the key file that have been specified. + +**Options:** +* `--h|help` - Print help. +* `--h|help-verbose` - Print detailed help. +* `--V|version` - Print version information and exit. +* `--f|filter-file FILE` - (Required) File with records to use as a filter. +* `--k|key-fields ` - Fields to use as join key. Default: 0 (entire line). +* `--d|data-fields ` - Data record fields to use as join key, if different than `--key-fields`. +* `--a|append-fields ` - Filter fields to append to matched records. +* `--H|header` - Treat the first line of each file as a header. +* `--p|prefix STR` - String to use as a prefix for `--append-fields` when writing a header line. +* `--w|write-all STR` - Output all data records. STR is the `--append-fields` value when writing unmatched records. This is an outer join. +* `--e|exclude` - Exclude matching records. This is an anti-join. +* `--delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) +* `--z|allow-duplicate-keys` - Allow duplicate keys with different append values (last entry wins). Default behavior is that this is an error. + +**Examples:** + +Filter one file based on another, using the full line as the key. +``` +$ # Output lines in data.txt that appear in filter.txt +$ tsv-join -f filter.txt data.txt + +$ # Output lines in data.txt that do not appear in filter.txt +$ tsv-join -f filter.txt --exclude data.txt +``` + +Filter multiple files, using fields 2 & 3 as the filter key. +``` +$ tsv-join -f filter.tsv --key-fields 2,3 data1.tsv data2.tsv data3.tsv +``` + +Same as previous, except use field 4 & 5 from the data files. +``` +$ tsv-join -f filter.tsv --key-fields 2,3 --data-fields 4,5 data1.tsv data2.tsv data3.tsv +``` + +Append fields from the filter file to matched records. +``` +$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2-5 data.tsv +``` + +Write out all records from the data file, but when there is no match, write the 'append fields' as NULL. This is an outer join. +``` +$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2 --write-all NULL data.tsv +``` + +Managing headers: Often it's useful to join a field from one data file to anther, where the data fields are related and the headers have the same name in both files. They can be kept distinct by adding a prefix to the filter file header. Example: +``` +$ tsv-join -f run1.tsv --header --key-fields 1 --append-fields 2 --prefix run1_ run2.tsv +``` + +--- + +## tsv-pretty reference + +**Synopsis:** tsv-pretty [options] [file...] + +`tsv-pretty` outputs TSV data in a format intended to be more human readable when working on the command line. This is done primarily by lining up data into fixed-width columns. Text is left aligned, numbers are right aligned. Floating points numbers are aligned on the decimal point when feasible. + +Processing begins by reading the initial set of lines into memory to determine the field widths and data types of each column. This look-ahead buffer is used for header detection as well. Output begins after this processing is complete. + +By default, only the alignment is changed, the actual values are not modified. Several of the formatting options do modify the values. + +Features: + +* Floating point numbers: Floats can be printed in fixed-width precision, using the same precision for all floats in a column. This makes then line up nicely. Precision is determined by values seen during look-ahead processing. The max precision defaults to 9, this can be changed when smaller or larger values are desired. See the `--f|format-floats` and `--p|precision` options. + +* Header lines: Headers are detected automatically when possible. This can be overridden when automatic detection doesn't work as desired. Headers can be underlined and repeated at regular intervals. + +* Missing values: A substitute value can be used for empty fields. This is often less confusing than spaces. See `--e|replace-empty` and `--E|empty-replacement`. + +* Exponential notion: As part of float formatting, `--f|format-floats` re-formats columns where exponential notation is found so all the values in the column are displayed using exponential notation and the same precision. + +* Preamble: A number of initial lines can be designated as a preamble and output unchanged. The preamble is before the header, if a header is present. Preamble lines can be auto-detected via the heuristic that they lack field delimiters. This works well when the field delimiter is a TAB. + +* Fonts: Fixed-width fonts are assumed. CJK characters are assumed to be double width. This is not always correct, but works well in most cases. + +**Options:** + +* `--help-verbose` - Print full help. +* `--H|header` - Treat the first line of each file as a header. +* `--x|no-header` - Assume no header. Turns off automatic header detection. +* `--l|lookahead NUM` - Lines to read to interpret data before generating output. Default: 1000 +* `--r|repeat-header NUM` - Lines to print before repeating the header. Default: No repeating header +* `--u|underline-header` - Underline the header. +* `--f|format-floats` - Format floats for better readability. Default: No +* `--p|precision NUM` - Max floating point precision. Implies --format-floats. Default: 9 +* `--e|replace-empty` - Replace empty fields with `--`. +* `--E|empty-replacement STR` - Replace empty fields with a string. +* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) +* `--s|space-between-fields NUM` - Spaces between each field (Default: 2) +* `--m|max-text-width NUM` - Max reserved field width for variable width text fields. Default: 40 +* `--a|auto-preamble` - Treat initial lines in a file as a preamble if the line contains no field delimiters. The preamble is output unchanged. +* `--b|preamble NUM` - Treat the first NUM lines as a preamble and output them unchanged. +* `--V|version` - Print version information and exit. +* `--h|help` - This help information. + +**Examples:** + +A tab-delimited file printed without any formatting: +``` +$ cat sample.tsv +Color Count Ht Wt +Brown 106 202.2 1.5 +Canary Yellow 7 106 0.761 +Chartreuse 1139 77.02 6.22 +Fluorescent Orange 422 1141.7 7.921 +Grey 19 140.3 1.03 +``` +The same file printed with `tsv-pretty`: +``` +$ tsv-pretty sample.tsv +Color Count Ht Wt +Brown 106 202.2 1.5 +Canary Yellow 7 106 0.761 +Chartreuse 1139 77.02 6.22 +Fluorescent Orange 422 1141.7 7.921 +Grey 19 140.3 1.03 +``` +Printed with float formatting and header underlining: +``` +$ tsv-pretty -f -u sample.tsv +Color Count Ht Wt +----- ----- -- -- +Brown 106 202.20 1.500 +Canary Yellow 7 106.00 0.761 +Chartreuse 1139 77.02 6.220 +Fluorescent Orange 422 1141.70 7.921 +Grey 19 140.30 1.030 +``` +Printed with setting the precision to one: +``` +$ tsv-pretty -u -p 1 sample.tsv +Color Count Ht Wt +----- ----- -- -- +Brown 106 202.2 1.5 +Canary Yellow 7 106.0 0.8 +Chartreuse 1139 77.0 6.2 +Fluorescent Orange 422 1141.7 7.9 +Grey 19 140.3 1.0 +``` + +--- + +## tsv-sample reference + +**Synopsis:** tsv-sample [options] [file...] + +`tsv-sample` subsamples input lines or randomizes their order. Several techniques are available: shuffling, simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling. These are provided via several different modes operation: + +* **Shuffling** (_default_): All lines are read into memory and output in random order. All orderings are equally likely. +* **Simple random sampling** (`--n|num N`): A random sample of `N` lines is selected and written to standard output. Selected lines are written in random order, similar to shuffling. All sample sets and orderings are equally likely. Use `--i|inorder` to preserve the original input order. +* **Weighted random sampling** (`--n|num N`, `--w|weight-field F`): A weighted sample of N lines is selected using weights from a field on each line. Selected lines are written in weighted selection order. Use `--i|inorder` to preserve the original input order. Omit `--n|num` to shuffle all input lines (weighted shuffling). +* **Sampling with replacement** (`--r|replace`, `--n|num N`): All lines are read into memory, then lines are selected one at a time at random and written out. Lines can be selected multiple times. Output continues until `N` samples have been written. Output continues forever if `--n|num` is zero or not specified. +* **Bernoulli sampling** (`--p|prob P`): Lines are read one-at-a-time in a streaming fashion and a random subset is output based on the inclusion probability. For example, `--prob 0.2` gives each line a 20% chance of being selected. All lines have an equal likelihood of being selected. The order of the lines is unchanged. +* **Distinct sampling** (`--k|key-fields F`, `--p|prob P`): Input lines are sampled based on a key from each line. A key is made up of one or more fields. A subset of the keys are chosen based on the inclusion probability (a "distinct" set of keys). All lines with one of the selected keys are output. This is a streaming operation: a decision is made on each line as it is read. The order of the lines is not changed. + +**Sample size**: The `--n|num` option controls the sample size for all sampling methods. In the case of simple and weighted random sampling it also limits the amount of memory required. + +**Performance and memory use**: `tsv-sample` is designed for large data sets. Algorithms make one pass over the data, using reservoir sampling and hashing when possible to limit the memory required. Bernoulli sampling and distinct sampling make immediate decisions on each line, with no memory accumulation. They can operate on arbitrary length data streams. Sampling with replacement reads all lines into memory and is limited by available memory. Shuffling also reads all lines into memory and is similarly limited. Simple and weighted random sampling use reservoir sampling algorithms and only need to hold the sample size (`--n|num`) in memory. See [Shuffling large files](TipsAndTricks.md#shuffling-large-files) for ways to use disk when available memory is not sufficient. + +**Controlling randomization**: Each run produces a different randomization. Using `--s|static-seed` changes this so multiple runs produce the same randomization. This works by using the same random seed each run. The random seed can be specified using `--v|seed-value`. This takes a non-zero, 32-bit positive integer. A zero value is a no-op and ignored. + +**Weighted sampling**: Weighted line order randomization is done using an algorithm for weighted reservoir sampling described by Pavlos Efraimidis and Paul Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is *not* necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more information see: +* Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling +* "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256) + +**Distinct sampling**: Distinct sampling selects a subset based on a key in data. Consider a query log with records consisting of triples. Distinct sampling selects all records matching a subset of values from one of the fields. For example, all events for ten percent of the users. This is important for certain types of analysis. Distinct sampling works by converting the specified probability (`--p|prob`) into a set of buckets and mapping every key into one of the buckets. One bucket is used to select records in the sample. Buckets are equal size and therefore may be a bit larger than the inclusion probability. Since every key is assigned a bucket, this method can also be used to fully divide a set of records into distinct groups. (See *Printing random values* below.) The term "distinct sampling" originates from algorithms estimating the number of distinct elements in extremely large data sets. + +**Printing random values**: Most of these algorithms work by generating a random value for each line. (See also "Compatibility mode" below.) The nature of these values depends on the sampling algorithm. They are used for both line selection and output ordering. The `--print-random` option can be used to print these values. The random value is prepended to the line separated by the `--d|delimiter` char (TAB by default). The `--gen-random-inorder` option takes this one step further, generating random values for all input lines without changing the input order. The types of values currently used are specific to the sampling algorithm: +* Shuffling, simple random sampling, Bernoulli sampling: Uniform random value in the interval [0,1]. +* Weighted random sampling: Value in the interval [0,1]. Distribution depends on the values in the weight field. +* Distinct sampling: An integer, zero and up, representing a selection group (aka. "bucket"). The inclusion probability determines the number of selection groups. +* Sampling with replacement: Random value printing is not supported. + +The specifics behind these random values are subject to change in future releases. + +**Compatibility mode**: As described above, many of the sampling algorithms assign a random value to each line. This is useful when printing random values. It has another occasionally useful property: repeated runs with the same static seed but different selection parameters are more compatible with each other, as each line gets assigned the same random value on every run. This property comes at a cost: in some cases there are faster algorithms that don't assign random values to each line. By default, `tsv-sample` will use the fastest algorithm available. The `--compatibility-mode` option changes this, switching to algorithms that assign a random value per line. Printing random values also engages compatibility mode. Compatibility mode is beneficial primarily when using Bernoulli sampling or random sampling: +* Bernoulli sampling - A run with a larger probability will be a superset of a smaller probability. In the example below, all lines selected in the first run are also selected in the second. + ``` + $ tsv-sample --static-seed --compatibility-mode --prob 0.2 data.tsv + $ tsv-sample --static-seed --compatibility-mode --prob 0.3 data.tsv + ``` +* Random sampling - A run with a larger sample size will be a superset of a smaller sample size. In the example below, all lines selected in the first run are also selected in the second. + ``` + $ tsv-sample --static-seed --compatibility-mode -n 1000 data.tsv + $ tsv-sample --static-seed --compatibility-mode -n 1500 data.tsv + ``` + This works for weighted sampling as well. + +**Options:** + +* `--h|help` - This help information. +* `--help-verbose` - Print more detailed help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. +* `--n|num NUM` - Maximum number of lines to output. All selected lines are output if not provided or zero. +* `--p|prob NUM` - Inclusion probability (0.0 < NUM <= 1.0). For Bernoulli sampling, the probability each line is selected output. For distinct sampling, the probability each unique key is selected for output. +* `--k|key-fields ` - Fields to use as key for distinct sampling. Use with `--p|prob`. Specify `--k|key-fields 0` to use the entire line as the key. +* `--w|weight-field NUM` - Field containing weights. All lines get equal weight if not provided or zero. +* `--r|replace` - Simple random sampling with replacement. Use `--n|num` to specify the sample size. +* `--s|static-seed` - Use the same random seed every run. +* `--v|seed-value NUM` - Sets the random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op. +* `--print-random` - Output the random values that were assigned. +* `--gen-random-inorder` - Output all lines with assigned random values prepended, no changes to the order of input. +* `--random-value-header` - Header to use with `--print-random` and `--gen-random-inorder`. Default: `random_value`. +* `--compatibility-mode` - Turns on "compatibility mode". +* `--d|delimiter CHR` - Field delimiter. +* `--prefer-skip-sampling` - (Internal) Prefer the skip-sampling algorithm for Bernoulli sampling. Used for testing and diagnostics. +* `--prefer-algorithm-r` - (Internal) Prefer Algorithm R for unweighted line order randomization. Used for testing and diagnostics. + +--- + +## tsv-select reference + +**Synopsis:** tsv-select [options] [file...] + +tsv-select reads files or standard input and writes specified fields to standard output in the order listed. Similar to Unix `cut` with the ability to reorder fields. + +Fields numbers start with one. They are comma separated, and ranges can be used. Fields can be listed more than once, and fields not listed can be selected as a group using the `--rest` option. When working with multiple files, the `--header` option can be used to retain the header from the just the first file. + +Fields can be excluded using `--e|exclude`. All fields not excluded are output. `--f|fields` and `--r|rest` can be used with `--e|exclude` to change the order of non-excluded fields. + +**Options:** +* `--h|help` - Print help. +* `--help-verbose` - Print more detailed help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. +* `--f|fields ` - Fields to retain. Fields are output in the order listed. +* `--e|--exclude ` - Fields to exclude. +* `--r|rest first|last` - Output location for fields not included in the `--f|fields` field-list. +* `--d|delimiter CHR` - Character to use as field delimiter. Default: TAB. (Single byte UTF-8 characters only.) + +**Notes:** +* One of `--f|fields` or `--e|exclude` is required. +* Fields specified by `--f|fields` and `--e|exclude` cannot overlap. +* When `--f|fields` and `--e|exclude` are used together, the effect is to specify `--rest last`. This can be overridden by specifying `--rest first`. +* Each input line must be long enough to contain all fields specified with `--f|fields`. This is not necessary for `--e|exclude` fields. + +**Examples:** +``` +$ # Keep the first field from two files +$ tsv-select -f 1 file1.tsv file2.tsv + +$ # Keep fields 1 and 2, retain the header from the first file +$ tsv-select -H -f 1,2 file1.tsv file2.tsv + +$ # Output fields 2 and 1, in that order +$ tsv-select -f 2,1 file.tsv + +$ # Output a range of fields +$ tsv-select -f 3-30 file.tsv + +$ # Output a range of fields in reverse order +$ tsv-select -f 30-3 file.tsv + +$ # Drop the first field, keep everything else +$ # Equivalent to 'cut -f 2- file.tsv' +$ tsv-select --exclude 1 file.tsv +$ tsv-select -e 1 file.tsv + +$ # Move field 1 to the end of the line +$ tsv-select -f 1 --rest first file.tsv + +$ # Move fields 7 and 3 to the start of the line +$ tsv-select -f 7,3 --rest last file.tsv + +# Output with repeating fields +$ tsv-select -f 1,2,1 file.tsv +$ tsv-select -f 1-3,3-1 file.tsv + +$ # Read from standard input +$ cat file*.tsv | tsv-select -f 1,4-7,11 + +$ # Read from a file and standard input. The '--' terminates command +$ # option processing, '-' represents standard input. +$ cat file1.tsv | tsv-select -f 1-3 -- - file2.tsv + +$ # Files using comma as the separator ('simple csv') +$ # (Note: Does not handle CSV escapes.) +$ tsv-select -d , --fields 5,1,2 file.csv + +$ # Move field 2 to the front and drop fields 10-15 +$ tsv-select -f 2 -e 10-15 file.tsv + +$ # Move field 2 to the end, dropping fields 10-15 +$ tsv-select -f 2 -rest first -e 10-15 file.tsv +``` + +--- + +## tsv-split reference + +Synopsis: tsv-split [options] [file...] + +Split input lines into multiple output files. There are three modes of operation: + +* **Fixed number of lines per file** (`--l|lines-per-file NUM`): Each input block of NUM lines is written to a new file. Similar to Unix `split`. + +* **Random assignment** (`--n|num-files NUM`): Each input line is written to a randomly selected output file. Random selection is from NUM files. + +* **Random assignment by key** (`--n|num-files NUM`, `--k|key-fields FIELDS`): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file. + +**Output files**: By default, files are written to the current directory and have names of the form `part_NNN`, with `NNN` being a number and `` being the extension of the first input file. If the input file is `file.txt`, the names will take the form `part_NNN.txt`. The suffix is empty when reading from standard input. The numeric part defaults to 3 digits for `--l|lines-per-files`. For `--n|num-files` enough digits are used so all filenames are the same length. The output directory and file names are customizable. + +**Header lines**: There are two ways to handle input with headers: write a header to all output files (`--H|header`), or exclude headers from all output files (`--I|header-in-only`). The best choice depends on the follow-up processing. All tsv-utils tools support header lines in multiple input files, but many other tools do not. For example, [GNU parallel](https://www.gnu.org/software/parallel/) works best on files without header lines. (See [Faster processing using GNU parallel](TipsAndTricks.md#faster-processing-using-gnu-parallel) for some info on using GNU parallel and tsv-utils together.) + +**About Random assignment** (`--n|num-files`): Random distribution of records to a set of files is a common task. When data fits in memory the preferred approach is usually to shuffle the data and split it into fixed sized blocks. Both of the following command lines accomplish this: +``` +$ shuf data.tsv | split -l NUM +$ tsv-sample data.tsv | tsv-split -l NUM +``` + +However, alternate approaches are needed when data is too large for convenient shuffling. tsv-split's random assignment feature can be useful in these cases. Each input line is written to a randomly selected output file. Note that output files will have similar but not identical numbers of records. + +**About Random assignment by key** (`--n|num-files NUM`, `--k|key-fields FIELDS`): This splits a data set into multiple files sharded by key. All lines with the same key are written to the same file. This partitioning enables parallel computation based on the key. For example, statistical calculation (`tsv-summarize --group-by`) or duplicate removal (`tsv-uniq --fields`). These operations can be parallelized using tools like GNU parallel, which simplifies concurrent operations on multiple files. + +**Random seed**: By default, each tsv-split invocation using random assignment or random assignment by key produces different assignments to the output files. Using `--s|static-seed` changes this so multiple runs produce the same assignments. This works by using the same random seed each run. The seed can be specified using `--v|seed-value`. + +**Appending to existing files**: By default, an error is triggered if an output file already exists. `--a|append` changes this so that lines are appended to existing files. (Header lines are not appended to files with data.) This is useful when adding new data to files created by a previous `tsv-split` run. Random assignment should use the same `--n|num-files` value each run, but different random seeds (avoid `--s|static-seed`). Random assignment by key should use the same `--n|num-files`, `--k|key-fields`, and seed (`--s|static-seed` or `--v|seed-value`) each run. + +**Max number of open files**: Random assignment and random assignment by key are dramatically faster when all output files are kept open. However, keeping a large numbers of open files can bump into system limits or limit resources available to other processes. By default, `tsv-split` uses up to 4096 open files or the system per-process limit, whichever is smaller. This can be changed using `--max-open-files`, though it cannot be set larger than the system limit. The system limit varies considerably between systems. On many systems it is unlimited. On MacOS it is often set to 256. Use Unix `ulimit` to display and modify the limits: +``` +$ ulimit -n # Show the "soft limit". The per-process maximum. +$ ulimit -Hn # Show the "hard limit". The max allowed soft limit. +$ ulimit -Sn NUM # Change the "soft limit" to NUM. +``` + +**Examples**: +``` +$ # Split a 10 million line file into 1000 files, 10,000 lines each. +$ # Output files are part_000.txt, part_001.txt, ... part_999.txt. +$ tsv-split data.txt --lines-per-file 10000 + +$ # Same as the previous example, but write files to a subdirectory. +$ tsv-split data.txt --dir split_files --lines-per-file 10000 + +$ # Split a file into 10,000 line files, writing a header line to each +$ tsv-split data.txt -H --lines-per-file 10000 + +$ # Same as the previous example, but dropping the header line. +$ tsv-split data.txt -I --lines-per-file 10000 + +$ # Randomly assign lines to 1000 files +$ tsv-split data.txt --num-files 1000 + +$ # Randomly assign lines to 1000 files while keeping unique keys from +$ # field 3 together. +$ tsv-split data.tsv --num-files 1000 -k 3 + +$ # Randomly assign lines to 1000 files. Later, randomly assign lines +$ # from a second data file to the same output files. +$ tsv-split data1.tsv -n 1000 +$ tsv-split data2.tsv -n 1000 --append + +$ # Randomly assign lines to 1000 files using field 3 as a key. +$ # Later, add a second file to the same output files. +$ tsv-split data1.tsv -n 1000 -k 3 --static-seed +$ tsv-split data2.tsv -n 1000 -k 3 --static-seed --append + +$ # Change the system per-process open file limit for one command. +$ # The parens create a sub-shell. The current shell is not changed. +$ ( ulimit -Sn 1000 && tsv-split --num-files 1000 data.txt ) +``` + +**Options**: +* `--h|--help` - Print help. +* `--help-verbose` - Print more detailed help. +* `--V|--version` - Print version information and exit. +* `--H|header` - Input files have a header line. Write the header to each output file. +* `--I|header-in-only` - Input files have a header line. Do not write the header to output files. +* `--l|lines-per-file NUM` - Number of lines to write to each output file (excluding the header line). +* `--n|num-files NUM` - Number of output files to generate. +* `--k|key-fields ` - Fields to use as key. Lines with the same key are written to the same output file. Use `--k|key-fields 0` to use the entire line as the key. +* `--dir STR` - Directory to write to. Default: Current working directory. +* `--prefix STR` - Filename prefix. Default: `part_` +* `--suffix STR` - Filename suffix. Default: First input file extension. None for standard input. +* `--w|digit-width NUM` - Number of digits in filename numeric portion. Default: `--l|lines-per-file`: 3. `--n|num-files`: Chosen so filenames have the same length. `--w|digit-width 0` uses the default. +* `--a|append` - Append to existing files. +* `--s|static-seed` - Use the same random seed every run. +* `--v|seed-value NUM` - Sets the random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op. +* `--d|delimiter CHR` - Field delimiter. +* `--max-open-files NUM` - Maximum open file handles to use. Min of 5 required. + +--- + +## tsv-summarize reference + +Synopsis: tsv-summarize [options] file [file...] + +`tsv-summarize` generates summary statistics on fields of a TSV file. A variety of statistics are supported. Calculations can run against the entire data stream or grouped by key. Consider the file data.tsv: +``` +make color time +ford blue 131 +chevy green 124 +ford red 128 +bmw black 118 +bmw black 126 +ford blue 122 +``` + +The min and average 'time' values for the 'make' field is generated by the command: +``` +$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv +``` + +This produces: +``` +make time_min time_mean +ford 122 127 +chevy 124 124 +bmw 118 122 +``` + +Using `--group-by 1,2` will group by both 'make' and 'color'. Omitting the `--group-by` entirely summarizes fields for full file. + +The program tries to generate useful headers, but custom headers can be specified. Example: +``` +$ tsv-summarize --header --group-by 1 --min 3:fastest --mean 3:average data.tsv +make fastest average +ford 122 127 +chevy 124 124 +bmw 118 122 +``` + +Most operators take custom headers in a manner shown above, following the syntax: +``` +-- FIELD[:header] +``` + +Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Examples: +``` +--median 2,3,4 +--median 1,5-8 +``` + +The quantile operator requires one or more probabilities after the fields: +``` +--quantile 2:0.25 # Quantile 1 of field 2 +--quantile 2-4:0.25,0.5,0.75 # Q1, Median, Q3 of fields 2, 3, 4 +``` + +Summarization operators available are: +``` + count range mad values + retain sum var unique-values + first mean stddev unique-count + last median mode missing-count + min quantile mode-count not-missing-count + max +``` + +Calculated numeric values are printed to 12 significant digits by default. This can be changed using the `--p|float-precision` option. If six or less it sets the number of significant digits after the decimal point. If greater than six it sets the total number of significant digits. + +Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled. + +Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found. + +Missing values are not treated specially by default, this can be changed using the `--x|exclude-missing` or `--r|replace-missing` option. The former turns off processing for missing values, the latter uses a replacement value. + +**Options:** +* `--h|help` - Print help. +* `--help-verbose` - Print detailed help. +* `--V|version` - Print version information and exit. +* `--g|group-by ` - Fields to use as key. +* `--H|header` - Treat the first line of each file as a header. +* `--w|write-header` - Write an output header even if there is no input header. +* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) +* `--v|values-delimiter CHR` - Values delimiter. Default: vertical bar (\|). (Single byte UTF-8 characters only.) +* `--p|float-precision NUM` - 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12 +* `--x|exclude-missing` - Exclude missing (empty) fields from calculations. +* `--r|replace-missing STR` - Replace missing (empty) fields with STR in calculations. + +**Operators:** +* `--count` - Count occurrences of each unique key (`--g|group-by`), or the total number of records if no key field is specified. +* `--count-header STR` - Count occurrences of each unique key, like `--count`, but use STR as the header. +* `--retain ` - Retain one copy of the field. The field header is unchanged. +* `--first [:STR]` - First value seen. +* `--last [:STR]`- Last value seen. +* `--min [:STR]` - Min value. (Numeric fields only.) +* `--max [:STR]` - Max value. (Numeric fields only.) +* `--range [:STR]` - Difference between min and max values. (Numeric fields only.) +* `--sum [:STR]` - Sum of the values. (Numeric fields only.) +* `--mean [:STR]` - Mean (average). (Numeric fields only.) +* `--median [:STR]` - Median value. (Numeric fields only. Reads all values into memory.) +* `--quantile :p[,p...][:STR]` - Quantiles. One or more fields, then one or more 0.0-1.0 probabilities. (Numeric fields only. Reads all values into memory.) +* `--mad [:STR]` - Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.) +* `--var [:STR]` - Variance. (Sample variance, numeric fields only). +* `--stdev [:STR]` - Standard deviation. (Sample st.dev, numeric fields only). +* `--mode [:STR]` - Mode. The most frequent value. (Reads all unique values into memory.) +* `--mode-count [:STR]` - Count of the most frequent value. (Reads all unique values into memory.) +* `--unique-count [:STR]` - Number of unique values. (Reads all unique values into memory). +* `--missing-count [:STR]` - Number of missing (empty) fields. Not affected by the `--x|exclude-missing` or `--r|replace-missing` options. +* `--not-missing-count [:STR]` - Number of filled (non-empty) fields. Not affected by `--r|replace-missing`. +* `--values [:STR]` - All the values, separated by `--v|values-delimiter`. (Reads all values into memory.) +* `--unique-values [:STR]` - All the unique values, separated by `--v|values-delimiter`. (Reads all unique values into memory.) + +_**Tip:**_ Bash completion is very helpful when using commands like `tsv-summarize` that have many options. See [Enable bash-completion](TipsAndTricks.md#enable-bash-completion) for details. + +--- + +## tsv-uniq reference + +`tsv-uniq` identifies equivalent lines in files or standard input. Input is read line by line, recording a key based on one or more of the fields. Two lines are equivalent if they have the same key. When operating in the default 'uniq' mode, the first time a key is seen the line is written to standard output. Subsequent lines having the same key are discarded. This is similar to the Unix `uniq` program, but based on individual fields and without requiring sorted data. + +`tsv-uniq` can be run without specifying a key field. In this case the whole line is used as a key, same as the Unix `uniq` program. As with `uniq`, this works on any line-oriented text file, not just TSV files. There is no need to sort the data and the original input order is preserved. + +The alternates to the default 'uniq' mode are 'number' mode and 'equiv-class' mode. In 'equiv-class' mode (`--e|equiv`), all lines are written to standard output, but with a field appended marking equivalent entries with an ID. The ID is a one-upped counter. + +'Number' mode (`--z|number`) also writes all lines to standard output, but with a field appended numbering the occurrence count for the line's key. The first line with a specific key is assigned the number '1', the second with the key is assigned number '2', etc. 'Number' and 'equiv-class' modes can be used together. + +The `--r|repeated` option can be used to print only lines occurring more than once. Specifically, the second occurrence of a key is printed. The `--a|at-least N` option is similar, printing lines occurring at least N times. (Like repeated, the Nth line with the key is printed.) + +The `--m|max MAX` option changes the behavior to output the first MAX lines for each key, rather than just the first line for each key. + +If both `--a|at-least` and `--m|max` are specified, the occurrences starting with 'at-least' and ending with 'max' are output. + +**Synopsis:** tsv-uniq [options] [file...] + +**Options:** +* `-h|help` - Print help. +* `--help-verbose` - Print detailed help. +* `--V|version` - Print version information and exit. +* `--H|header` - Treat the first line of each file as a header. +* `--f|fields ` - Fields to use as the key. Default: 0 (entire line). +* `--i|ignore-case` - Ignore case when comparing keys. +* `--e|equiv` - Output equiv class IDs rather than uniq'ing entries. +* `--equiv-header STR` - Use STR as the equiv-id field header. Applies when using `--header --equiv`. Default: `equiv_id`. +* `--equiv-start INT` - Use INT as the first equiv-id. Default: 1. +* `--z|number` - Output equivalence class occurrence counts rather than uniq'ing entries. +* `--number-header STR` - Use STR as the `--number` field header (when using `-H --number`). Default: `equiv_line`. +* `--r|repeated` - Output only lines that are repeated (based on the key). +* `--a|at-least INT` - Output only lines that are repeated INT times (based on the key). Zero and one are ignored. +* `--m|max INT` - Max number of each unique key to output (zero is ignored). +* `--d|delimiter CHR` - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.) + +**Examples:** +``` +$ # Uniq a file, using the full line as the key +$ tsv-uniq data.txt + +$ # Same as above, but case-insensitive +$ tsv-uniq --ignore-case data.txt + +$ # Unique a file based on one field +$ tsv-unique -f 1 data.tsv + +$ # Unique a file based on two fields +$ tsv-uniq -f 1,2 data.tsv + +$ # Unique a file based on the first four fields +$ tsv-uniq -f 1-4 data.tsv + +$ # Output all the lines, generating an ID for each unique entry +$ tsv-uniq -f 1,2 --equiv data.tsv + +$ # Generate uniq IDs, but account for headers +$ tsv-uniq -f 1,2 --equiv --header data.tsv + +$ # Generate line numbers specific to each key +$ tsv-uniq -f 1,2 --number --header data.tsv + +$ # --Examples showing the data-- + +$ cat data.tsv +field1 field2 field2 +ABCD 1234 PQR +efgh 5678 stu +ABCD 1234 PQR +wxyz 1234 stu +efgh 5678 stu +ABCD 1234 PQR + +$ # Uniq using the full line as key +$ tsv-uniq -H data.tsv +field1 field2 field2 +ABCD 1234 PQR +efgh 5678 stu +wxyz 1234 stu + +$ # Uniq using field 2 as key +$ tsv-uniq -H -f 2 data.tsv +field1 field2 field2 +ABCD 1234 PQR +efgh 5678 stu + +$ # Generate equivalence class IDs +$ tsv-uniq -H --equiv data.tsv +field1 field2 field2 equiv_id +ABCD 1234 PQR 1 +efgh 5678 stu 2 +ABCD 1234 PQR 1 +wxyz 1234 stu 3 +efgh 5678 stu 2 +ABCD 1234 PQR 1 + +$ # Generate equivalence class IDs and line numbers +$ tsv-uniq -H --equiv --number data.tsv +field1 field2 field2 equiv_id equiv_line +ABCD 1234 PQR 1 1 +efgh 5678 stu 2 1 +ABCD 1234 PQR 1 2 +wxyz 1234 stu 3 1 +efgh 5678 stu 2 2 +ABCD 1234 PQR 1 3 +``` diff --git a/docs/ToolReference_v2.0.md b/docs/ToolReference_v2.0.md deleted file mode 100644 index 75c7680a..00000000 --- a/docs/ToolReference_v2.0.md +++ /dev/null @@ -1,21 +0,0 @@ -_Visit the [TSV Utilities main page](../README.md)_ - -# Tools Reference - -The TSV Utilities Tools Reference provides detailed documentation about each tool. Each tool has it's own page, available through the links below. The [Common options and behavior](tool_reference/common-options-and-behavior.md) page provides information about features and options common to all the tools. - -Documentation for individual tools is also available via the `--help` option available on every tool. - -* [Common options and behavior](tool_reference/common-options-and-behavior.md) -* [csv2tsv](tool_reference/csv2tsv.md) -* [keep-header](tool_reference/keep-header.md) -* [number-lines](tool_reference/number-lines.md) -* [tsv-append](tool_reference/tsv-append.md) -* [tsv-filter](tool_reference/tsv-filter.md) -* [tsv-join](tool_reference/tsv-join.md) -* [tsv-pretty](tool_reference/tsv-pretty.md) -* [tsv-sample](tool_reference/tsv-sample.md) -* [tsv-select](tool_reference/tsv-select.md) -* [tsv-split](tool_reference/tsv-split.md) -* [tsv-summarize](tool_reference/tsv-summarize.md) -* [tsv-uniq](tool_reference/tsv-uniq.md)