Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TSV finally true TSV #923

Merged
merged 2 commits into from
Feb 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .vimrc
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
map \d :w<C-m>:!clear;echo Building ...; echo; make mlr<C-m>
map \f :w<C-m>:!clear;echo Building ...; echo; make ut<C-m>
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
"map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-lib<C-m>
map \t :w<C-m>:!clear;go test github.com/johnkerl/miller/internal/pkg/transformers/...<C-m>
28 changes: 13 additions & 15 deletions docs/src/file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,36 +104,34 @@ NIDX: implicitly numerically indexed (Unix-toolkit style)

When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.

Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)

**TSV (tab-separated values):** the following are synonymous pairs:
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
has an embedded newline, that newline is replaced by `\n`.

* `--tsv` and `--csv --fs tab`
* `--itsv` and `--icsv --ifs tab`
* `--otsv` and `--ocsv --ofs tab`
* `--tsvlite` and `--csvlite --fs tab`
* `--itsvlite` and `--icsvlite --ifs tab`
* `--otsvlite` and `--ocsvlite --ofs tab`
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.

**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.

**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.

Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.

Here are the differences between CSV and CSV-lite:

* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.

* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.

* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).

* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.

Here are things they have in common:
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.

## JSON

Expand Down
28 changes: 13 additions & 15 deletions docs/src/file-formats.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -16,36 +16,34 @@ GENMD-EOF

When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.

Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)

**TSV (tab-separated values):** the following are synonymous pairs:
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
has an embedded newline, that newline is replaced by `\n`.

* `--tsv` and `--csv --fs tab`
* `--itsv` and `--icsv --ifs tab`
* `--otsv` and `--ocsv --ofs tab`
* `--tsvlite` and `--csvlite --fs tab`
* `--itsvlite` and `--icsvlite --ifs tab`
* `--otsvlite` and `--ocsvlite --ofs tab`
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.

**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.

**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.

Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.

Here are the differences between CSV and CSV-lite:

* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.

* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.

* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).

* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.

Here are things they have in common:
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.

## JSON

Expand Down
4 changes: 2 additions & 2 deletions docs/src/keystroke-savers.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,11 @@ If there's more than one input file, you can use `--mfrom`, then however many fi
The following have even shorter versions:

* `-c` is the same as `--csv`
* `-t` is the same as `--tsvlite`
* `-t` is the same as `--tsv`
* `-j` is the same as `--json`

I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.

## .mlrrc file

Expand Down
4 changes: 2 additions & 2 deletions docs/src/keystroke-savers.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ GENMD-EOF
The following have even shorter versions:

* `-c` is the same as `--csv`
* `-t` is the same as `--tsvlite`
* `-t` is the same as `--tsv`
* `-j` is the same as `--json`

I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.

## .mlrrc file

Expand Down
6 changes: 3 additions & 3 deletions docs/src/manpage.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ FILE-FORMAT FLAGS
--oxtab Use XTAB format for output data.
--pprint Use PPRINT format for input and output data.
--tsv Use TSV format for input and output data.
--tsvlite or -t Use TSV-lite format for input and output data.
--tsv or -t Use TSV-lite format for input and output data.
--usv or --usvlite Use USV format for input and output data.
--xtab Use XTAB format for input and output data.
-i {format name} Use format name for input data. For example: `-i csv`
Expand Down Expand Up @@ -708,7 +708,6 @@ SEPARATOR FLAGS
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -763,6 +762,7 @@ SEPARATOR FLAGS
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"

--fs {string} Specify FS for input and output.
Expand Down Expand Up @@ -3157,5 +3157,5 @@ SEE ALSO



2022-02-05 MILLER(1)
2022-02-06 MILLER(1)
</pre>
6 changes: 3 additions & 3 deletions docs/src/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ FILE-FORMAT FLAGS
--oxtab Use XTAB format for output data.
--pprint Use PPRINT format for input and output data.
--tsv Use TSV format for input and output data.
--tsvlite or -t Use TSV-lite format for input and output data.
--tsv or -t Use TSV-lite format for input and output data.
--usv or --usvlite Use USV format for input and output data.
--xtab Use XTAB format for input and output data.
-i {format name} Use format name for input data. For example: `-i csv`
Expand Down Expand Up @@ -687,7 +687,6 @@ SEPARATOR FLAGS
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -742,6 +741,7 @@ SEPARATOR FLAGS
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"

--fs {string} Specify FS for input and output.
Expand Down Expand Up @@ -3136,4 +3136,4 @@ SEE ALSO



2022-02-05 MILLER(1)
2022-02-06 MILLER(1)
4 changes: 2 additions & 2 deletions docs/src/reference-main-flag-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ are overridden in all cases by setting output format to `format2`.
* `--oxtab`: Use XTAB format for output data.
* `--pprint`: Use PPRINT format for input and output data.
* `--tsv`: Use TSV format for input and output data.
* `--tsvlite or -t`: Use TSV-lite format for input and output data.
* `--tsv`: Use TSV format for input and output data.
* `--usv or --usvlite`: Use USV format for input and output data.
* `--xtab`: Use XTAB format for input and output data.
* `-i {format name}`: Use format name for input data. For example: `-i csv` is the same as `--icsv`.
Expand Down Expand Up @@ -405,7 +405,6 @@ Notes about all other separators:
alignment impossible.
* OPS may be multi-character for XTAB format, in which case alignment is
disabled.
* TSV is simply CSV using tab as field separator (`--fs tab`).
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since they are not relevant
to the JSON format.
Expand Down Expand Up @@ -460,6 +459,7 @@ Notes about all other separators:
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
tsv " " N/A "\n"
xtab "\n" " " "\n\n"


Expand Down
3 changes: 2 additions & 1 deletion docs/src/reference-main-separators.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,8 +261,9 @@ a:4;b:5;c:6;d:>>>,|||;<<<

Notes:

* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
* JSON: ignores all separator flags from the command line.
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
Expand Down
3 changes: 2 additions & 1 deletion docs/src/reference-main-separators.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,9 @@ GENMD-EOF

Notes:

* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
* JSON: ignores all separator flags from the command line.
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
Expand Down
Loading