Skip to content

Commit

Permalink
Make TSV finally true TSV (#923)
Browse files Browse the repository at this point in the history
* Spec-TSV

* doc mods; more test cases
  • Loading branch information
johnkerl authored Feb 6, 2022
1 parent ac47c70 commit 66c4a07
Show file tree
Hide file tree
Showing 30 changed files with 705 additions and 139 deletions.
3 changes: 2 additions & 1 deletion .vimrc
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
map \d :w<C-m>:!clear;echo Building ...; echo; make mlr<C-m>
map \f :w<C-m>:!clear;echo Building ...; echo; make ut<C-m>
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
"map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-lib<C-m>
map \t :w<C-m>:!clear;go test github.com/johnkerl/miller/internal/pkg/transformers/...<C-m>
28 changes: 13 additions & 15 deletions docs/src/file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,36 +104,34 @@ NIDX: implicitly numerically indexed (Unix-toolkit style)

When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.

Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)

**TSV (tab-separated values):** the following are synonymous pairs:
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
has an embedded newline, that newline is replaced by `\n`.

* `--tsv` and `--csv --fs tab`
* `--itsv` and `--icsv --ifs tab`
* `--otsv` and `--ocsv --ofs tab`
* `--tsvlite` and `--csvlite --fs tab`
* `--itsvlite` and `--icsvlite --ifs tab`
* `--otsvlite` and `--ocsvlite --ofs tab`
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.

**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.

**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.

Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.

Here are the differences between CSV and CSV-lite:

* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.

* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.

* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).

* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.

Here are things they have in common:
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.

## JSON

Expand Down
28 changes: 13 additions & 15 deletions docs/src/file-formats.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -16,36 +16,34 @@ GENMD-EOF

When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.

Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)

**TSV (tab-separated values):** the following are synonymous pairs:
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
has an embedded newline, that newline is replaced by `\n`.

* `--tsv` and `--csv --fs tab`
* `--itsv` and `--icsv --ifs tab`
* `--otsv` and `--ocsv --ofs tab`
* `--tsvlite` and `--csvlite --fs tab`
* `--itsvlite` and `--icsvlite --ifs tab`
* `--otsvlite` and `--ocsvlite --ofs tab`
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.

**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.

**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.

Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.

Here are the differences between CSV and CSV-lite:

* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.

* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.

* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).

* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.

Here are things they have in common:
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.

## JSON

Expand Down
4 changes: 2 additions & 2 deletions docs/src/keystroke-savers.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,11 @@ If there's more than one input file, you can use `--mfrom`, then however many fi
The following have even shorter versions:

* `-c` is the same as `--csv`
* `-t` is the same as `--tsvlite`
* `-t` is the same as `--tsv`
* `-j` is the same as `--json`

I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.

## .mlrrc file

Expand Down
4 changes: 2 additions & 2 deletions docs/src/keystroke-savers.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ GENMD-EOF
The following have even shorter versions:

* `-c` is the same as `--csv`
* `-t` is the same as `--tsvlite`
* `-t` is the same as `--tsv`
* `-j` is the same as `--json`

I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.

## .mlrrc file

Expand Down
6 changes: 3 additions & 3 deletions docs/src/manpage.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ FILE-FORMAT FLAGS
--oxtab Use XTAB format for output data.
--pprint Use PPRINT format for input and output data.
--tsv Use TSV format for input and output data.
--tsvlite or -t Use TSV-lite format for input and output data.
--tsv or -t Use TSV-lite format for input and output data.
--usv or --usvlite Use USV format for input and output data.
--xtab Use XTAB format for input and output data.
-i {format name} Use format name for input data. For example: `-i csv`
Expand Down Expand Up @@ -708,7 +708,6 @@ SEPARATOR FLAGS
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -763,6 +762,7 @@ SEPARATOR FLAGS
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"

--fs {string} Specify FS for input and output.
Expand Down Expand Up @@ -3157,5 +3157,5 @@ SEE ALSO



2022-02-05 MILLER(1)
2022-02-06 MILLER(1)
</pre>
6 changes: 3 additions & 3 deletions docs/src/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ FILE-FORMAT FLAGS
--oxtab Use XTAB format for output data.
--pprint Use PPRINT format for input and output data.
--tsv Use TSV format for input and output data.
--tsvlite or -t Use TSV-lite format for input and output data.
--tsv or -t Use TSV-lite format for input and output data.
--usv or --usvlite Use USV format for input and output data.
--xtab Use XTAB format for input and output data.
-i {format name} Use format name for input data. For example: `-i csv`
Expand Down Expand Up @@ -687,7 +687,6 @@ SEPARATOR FLAGS
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -742,6 +741,7 @@ SEPARATOR FLAGS
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"

--fs {string} Specify FS for input and output.
Expand Down Expand Up @@ -3136,4 +3136,4 @@ SEE ALSO



2022-02-05 MILLER(1)
2022-02-06 MILLER(1)
4 changes: 2 additions & 2 deletions docs/src/reference-main-flag-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ are overridden in all cases by setting output format to `format2`.
* `--oxtab`: Use XTAB format for output data.
* `--pprint`: Use PPRINT format for input and output data.
* `--tsv`: Use TSV format for input and output data.
* `--tsvlite or -t`: Use TSV-lite format for input and output data.
* `--tsv`: Use TSV format for input and output data.
* `--usv or --usvlite`: Use USV format for input and output data.
* `--xtab`: Use XTAB format for input and output data.
* `-i {format name}`: Use format name for input data. For example: `-i csv` is the same as `--icsv`.
Expand Down Expand Up @@ -405,7 +405,6 @@ Notes about all other separators:
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -460,6 +459,7 @@ Notes about all other separators:
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"


Expand Down
3 changes: 2 additions & 1 deletion docs/src/reference-main-separators.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,8 +261,9 @@ a:4;b:5;c:6;d:>>>,|||;<<<

Notes:

* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
* JSON: ignores all separator flags from the command line.
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
Expand Down
3 changes: 2 additions & 1 deletion docs/src/reference-main-separators.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,9 @@ GENMD-EOF

Notes:

* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
* JSON: ignores all separator flags from the command line.
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
Expand Down
Loading

0 comments on commit 66c4a07

Please sign in to comment.