diff --git a/.travis.yml b/.travis.yml index 4c1be84b..94e7db67 100644 --- a/.travis.yml +++ b/.travis.yml @@ -41,14 +41,14 @@ matrix: dist: xenial group: travis_latest language: d - d: ldc-1.22.0 + d: ldc-1.23.0 env: LINUX_SPECIAL=1 DEPLOY=1 # if: type IN (pull_request, cron) - os: osx osx_image: xcode11.5 group: travis_latest language: d - d: ldc-1.22.0 + d: ldc-1.23.0 env: LTOPGO_V2=default DEPLOY=1 # if: type IN (pull_request, cron) - os: osx diff --git a/README.md b/README.md index da32c92a..d8d67320 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ File an [issue](https://github.com/eBay/tsv-utils/issues) if you have problems, These tools perform data manipulation and statistical calculations on tab delimited data. They are intended for large files. Larger than ideal for loading entirely in memory in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments. The features supported are useful both for standalone analysis and for preparing data for use in R, Pandas, and similar toolkits. -The tools work like traditional Unix command line utilities such as `cut`, `sort`, `grep` and `awk`, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. Fields are identified either by field name or field number. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (`tsv-filter`). Documentation is available for each tool by invoking it with the `--help` option. TSV format is similar to CSV, see [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for the differences. +The tools work like traditional Unix command line utilities such as `cut`, `sort`, `grep` and `awk`, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. Fields are identified either by field name or field number. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (`tsv-filter`). Documentation is available for each tool by invoking it with the `--help` option. Most tools provide a `--help-verbose` option offering more extensive, reference style documentation. TSV format is similar to CSV, see [Comparing TSV and CSV formats](docs/comparing-tsv-and-csv.md) for the differences. The rest of this section contains descriptions of each tool. Click on the links below to jump directly to one of the tools. Full documentation is available in the [Tools Reference](docs/ToolReference.md). The first tool listed, [tsv-filter](#tsv-filter), provides a tutorial introduction to features found throughout the toolkit. @@ -349,7 +349,7 @@ Note that many CSV files do not use escapes, and in-fact follow a strict delimit `csv2tsv` differs from many csv-to-tsv conversion tools in that it produces output free of CSV escapes. Many conversion tools produce data with CSV style escapes, but switching the field delimiter from comma to TAB. Such data cannot be reliably processed by Unix tools like `cut`, `awk`, `sort`, etc. -`csv2tsv` avoids escapes by replacing TAB and newline characters in the data with a single space. These characters are rare in data mining scenarios, and space is usually a good substitute in cases where they do occur. The replacement string is customizable to enable alternate handling when needed. +`csv2tsv` avoids escapes by replacing TAB and newline characters in the data with a single space. These characters are rare in data mining scenarios, and space is usually a good substitute in cases where they do occur. The replacement strings are customizable to enable alternate handling when needed. The `csv2tsv` converter often has a second benefit: regularizing newlines. CSV files are often exported using Windows newline conventions. `csv2tsv` converts all newlines to Unix format. @@ -458,10 +458,10 @@ There are several ways to obtain the tools: [prebuilt binaries](#prebuilt-binari ### Prebuilt binaries -Prebuilt binaries are available for Linux and Mac, these can be found on the [Github releases](https://github.com/eBay/tsv-utils/releases) page. Download and unpack the tar.gz file. Executables are in the `bin` directory. Add the `bin` directory or individual tools to the `PATH` environment variable. As an example, the 2.0.0 releases for Linux and MacOS can be downloaded and unpacked with these commands: +Prebuilt binaries are available for Linux and Mac, these can be found on the [Github releases](https://github.com/eBay/tsv-utils/releases) page. Download and unpack the tar.gz file. Executables are in the `bin` directory. Add the `bin` directory or individual tools to the `PATH` environment variable. As an example, the 2.1.0 releases for Linux and MacOS can be downloaded and unpacked with these commands: ``` -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_linux-x86_64_ldc2.tar.gz | tar xz -$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_osx-x86_64_ldc2.tar.gz | tar xz +$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_linux-x86_64_ldc2.tar.gz | tar xz +$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_osx-x86_64_ldc2.tar.gz | tar xz ``` See the [Github releases](https://github.com/eBay/tsv-utils/releases) page for the latest release. @@ -502,10 +502,10 @@ The above requires LDC 1.9.0 or later. See [Building with Link Time Optimization ### Install using DUB -If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace `2.0.0` with the current version): +If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace `2.1.0` with the current version): ``` $ dub fetch tsv-utils --cache=local -$ cd tsv-utils-2.0.0/tsv-utils +$ cd tsv-utils-2.1.0/tsv-utils $ dub run # For LDC: dub run -- --compiler=ldc2 ``` diff --git a/docs/ToolReference.md b/docs/ToolReference.md index f18ba1e8..d0e12bb8 100644 --- a/docs/ToolReference.md +++ b/docs/ToolReference.md @@ -4,7 +4,7 @@ _Visit the [TSV Utilities main page](../README.md)_ The TSV Utilities Tools Reference provides detailed documentation about each tool. Each tool has it's own page, available through the links below. The [Common options and behavior](tool_reference/common-options-and-behavior.md) page provides information about features and options common to all the tools. -Documentation for individual tools is also available via the `--help` option available on every tool. +Documentation for individual tools is also available via the `--help` option available on every tool. Most tools provide a `--help-verbose` option offering more extensive documentation similar to what is available in the Tool Reference. * [Common options and behavior](tool_reference/common-options-and-behavior.md) * [csv2tsv](tool_reference/csv2tsv.md) diff --git a/docs/tool_reference/csv2tsv.md b/docs/tool_reference/csv2tsv.md index 750beb1b..5f76bd8a 100644 --- a/docs/tool_reference/csv2tsv.md +++ b/docs/tool_reference/csv2tsv.md @@ -7,9 +7,9 @@ _Visit the [TSV Utilities main page](../../README.md)_ csv2tsv converts CSV (comma-separated) text to TSV (tab-separated) format. Records are read from files or standard input, converted records are written to standard output. -Both formats represent tabular data, each record on its own line, fields separated by a delimiter character. The key difference is that CSV uses escape sequences to represent newlines and field separators in the data, whereas TSV disallows these characters in the data. The most common field delimiters are comma for CSV and tab for TSV, but any character can be used. See [Comparing TSV and CSV formats](../comparing-tsv-and-csv.md) for addition discussion of the formats. +Both formats represent tabular data, each record on its own line, fields separated by a delimiter character. The key difference is that CSV uses escape sequences to represent newlines and field separators in the data, whereas TSV disallows these characters in the data. The most common field delimiters are comma for CSV and TAB for TSV, but any character can be used. See [Comparing TSV and CSV formats](../comparing-tsv-and-csv.md) for addition discussion of the formats. -Conversion to TSV is done by removing CSV escape syntax, changing field delimiters, and replacing newlines and tabs in the data. By default, newlines and tabs in the data are replaced by spaces. Most details are customizable. +Conversion to TSV is done by removing CSV escape syntax, changing field delimiters, and replacing newlines and TABs in the data. By default, newlines and TABs in the data are replaced by spaces. Most details are customizable. There is no single spec for CSV, any number of variants can be found. The escape syntax is common enough: fields containing newlines or field delimiters are placed in double quotes. Inside a quoted field, a double quote is represented by a pair of double quotes. As with field separators, the quoting character is customizable. @@ -17,8 +17,9 @@ Behaviors of this program that often vary between CSV implementations: * Newlines are supported in quoted fields. * Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules. * Each record can have a different numbers of fields. -* The three common forms of newlines are supported: CR, CRLF, LF. +* The three common forms of newlines are supported: CR, CRLF, LF. Output is written using Unix newlines (LF). * A newline will be added if the file does not end with one. +* A UTF-8 Byte Order Mark (BOM) at the start of an input file will be removed. * No whitespace trimming is done. This program does not validate CSV correctness, but will terminate with an error upon reaching an inconsistent state. Improperly terminated quoted fields are the primary cause. @@ -33,4 +34,5 @@ UTF-8 input is assumed. Convert other encodings prior to invoking this tool. * `--q|quote CHR` - Quoting character in CSV data. Default: double-quote (") * `--c|csv-delim CHR` - Field delimiter in CSV data. Default: comma (,). * `--t|tsv-delim CHR` - Field delimiter in TSV data. Default: TAB -* `--r|replacement STR` - Replacement for newline and TSV field delimiters found in CSV input. Default: Space. +* `--r|tab-replacement STR` - Replacement for TSV field delimiters (typically TABs) found in CSV input. Default: Space. +* `--n|newline-replacement STR` - Replacement for newlines found in CSV input. Default: Space.