diff --git a/.vimrc b/.vimrc index 97c60ada49..fb07498b29 100644 --- a/.vimrc +++ b/.vimrc @@ -1,4 +1,5 @@ map \d :w:!clear;echo Building ...; echo; make mlr map \f :w:!clear;echo Building ...; echo; make ut -map \r :w:!clear;echo Building ...; echo; make ut-scan ut-mlv +"map \r :w:!clear;echo Building ...; echo; make ut-scan ut-mlv +map \r :w:!clear;echo Building ...; echo; make ut-lib map \t :w:!clear;go test github.com/johnkerl/miller/internal/pkg/transformers/... diff --git a/docs/src/file-formats.md b/docs/src/file-formats.md index 3838cb1bca..05353ef14f 100644 --- a/docs/src/file-formats.md +++ b/docs/src/file-formats.md @@ -104,36 +104,34 @@ NIDX: implicitly numerically indexed (Unix-toolkit style) When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream. -Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).) +Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).) -**TSV (tab-separated values):** the following are synonymous pairs: +**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for +Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return, +newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field +has an embedded newline, that newline is replaced by `\n`. -* `--tsv` and `--csv --fs tab` -* `--itsv` and `--icsv --ifs tab` -* `--otsv` and `--ocsv --ofs tab` -* `--tsvlite` and `--csvlite --fs tab` -* `--itsvlite` and `--icsvlite --ifs tab` -* `--otsvlite` and `--ocsvlite --ofs tab` +**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively. -**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively. - -**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively. +**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively. Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform. Here are the differences between CSV and CSV-lite: +* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way. + * CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not. * CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)). -* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader. +* TSV-lite is simply CSV-lite with field separator set to tab instead of comma. -Here are things they have in common: +* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character. -* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on. +* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.) -* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output. +CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output. ## JSON diff --git a/docs/src/file-formats.md.in b/docs/src/file-formats.md.in index 5a17fa91e8..1fc321f1b6 100644 --- a/docs/src/file-formats.md.in +++ b/docs/src/file-formats.md.in @@ -16,36 +16,34 @@ GENMD-EOF When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream. -Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).) +Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).) -**TSV (tab-separated values):** the following are synonymous pairs: +**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for +Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return, +newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field +has an embedded newline, that newline is replaced by `\n`. -* `--tsv` and `--csv --fs tab` -* `--itsv` and `--icsv --ifs tab` -* `--otsv` and `--ocsv --ofs tab` -* `--tsvlite` and `--csvlite --fs tab` -* `--itsvlite` and `--icsvlite --ifs tab` -* `--otsvlite` and `--ocsvlite --ofs tab` +**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively. -**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively. - -**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively. +**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively. Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform. Here are the differences between CSV and CSV-lite: +* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way. + * CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not. * CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)). -* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader. +* TSV-lite is simply CSV-lite with field separator set to tab instead of comma. -Here are things they have in common: +* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character. -* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on. +* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.) -* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output. +CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output. ## JSON diff --git a/docs/src/keystroke-savers.md b/docs/src/keystroke-savers.md index c6dc27f1a8..1cc2485a12 100644 --- a/docs/src/keystroke-savers.md +++ b/docs/src/keystroke-savers.md @@ -92,11 +92,11 @@ If there's more than one input file, you can use `--mfrom`, then however many fi The following have even shorter versions: * `-c` is the same as `--csv` -* `-t` is the same as `--tsvlite` +* `-t` is the same as `--tsv` * `-j` is the same as `--json` I don't use these within these documents, since I want the docs to be self-explanatory on every page, and -I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use. +I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use. ## .mlrrc file diff --git a/docs/src/keystroke-savers.md.in b/docs/src/keystroke-savers.md.in index db4d67eb34..b8cb2b3c50 100644 --- a/docs/src/keystroke-savers.md.in +++ b/docs/src/keystroke-savers.md.in @@ -37,11 +37,11 @@ GENMD-EOF The following have even shorter versions: * `-c` is the same as `--csv` -* `-t` is the same as `--tsvlite` +* `-t` is the same as `--tsv` * `-j` is the same as `--json` I don't use these within these documents, since I want the docs to be self-explanatory on every page, and -I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use. +I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use. ## .mlrrc file diff --git a/docs/src/manpage.md b/docs/src/manpage.md index e7cad2062a..1650a33698 100644 --- a/docs/src/manpage.md +++ b/docs/src/manpage.md @@ -386,7 +386,7 @@ FILE-FORMAT FLAGS --oxtab Use XTAB format for output data. --pprint Use PPRINT format for input and output data. --tsv Use TSV format for input and output data. - --tsvlite or -t Use TSV-lite format for input and output data. + --tsv or -t Use TSV-lite format for input and output data. --usv or --usvlite Use USV format for input and output data. --xtab Use XTAB format for input and output data. -i {format name} Use format name for input data. For example: `-i csv` @@ -708,7 +708,6 @@ SEPARATOR FLAGS alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. - * TSV is simply CSV using tab as field separator (`--fs tab`). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -763,6 +762,7 @@ SEPARATOR FLAGS markdown " " N/A "\n" nidx " " N/A "\n" pprint " " N/A "\n" + tsv " " N/A "\n" xtab "\n" " " "\n\n" --fs {string} Specify FS for input and output. @@ -3157,5 +3157,5 @@ SEE ALSO - 2022-02-05 MILLER(1) + 2022-02-06 MILLER(1) diff --git a/docs/src/manpage.txt b/docs/src/manpage.txt index 880960425e..a1addc1e3d 100644 --- a/docs/src/manpage.txt +++ b/docs/src/manpage.txt @@ -365,7 +365,7 @@ FILE-FORMAT FLAGS --oxtab Use XTAB format for output data. --pprint Use PPRINT format for input and output data. --tsv Use TSV format for input and output data. - --tsvlite or -t Use TSV-lite format for input and output data. + --tsv or -t Use TSV-lite format for input and output data. --usv or --usvlite Use USV format for input and output data. --xtab Use XTAB format for input and output data. -i {format name} Use format name for input data. For example: `-i csv` @@ -687,7 +687,6 @@ SEPARATOR FLAGS alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. - * TSV is simply CSV using tab as field separator (`--fs tab`). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -742,6 +741,7 @@ SEPARATOR FLAGS markdown " " N/A "\n" nidx " " N/A "\n" pprint " " N/A "\n" + tsv " " N/A "\n" xtab "\n" " " "\n\n" --fs {string} Specify FS for input and output. @@ -3136,4 +3136,4 @@ SEE ALSO - 2022-02-05 MILLER(1) + 2022-02-06 MILLER(1) diff --git a/docs/src/reference-main-flag-list.md b/docs/src/reference-main-flag-list.md index a32a6615b2..f5aeb32dc2 100644 --- a/docs/src/reference-main-flag-list.md +++ b/docs/src/reference-main-flag-list.md @@ -177,7 +177,7 @@ are overridden in all cases by setting output format to `format2`. * `--oxtab`: Use XTAB format for output data. * `--pprint`: Use PPRINT format for input and output data. * `--tsv`: Use TSV format for input and output data. -* `--tsvlite or -t`: Use TSV-lite format for input and output data. +* `--tsv`: Use TSV format for input and output data. * `--usv or --usvlite`: Use USV format for input and output data. * `--xtab`: Use XTAB format for input and output data. * `-i {format name}`: Use format name for input data. For example: `-i csv` is the same as `--icsv`. @@ -405,7 +405,6 @@ Notes about all other separators: alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. -* TSV is simply CSV using tab as field separator (`--fs tab`). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -460,6 +459,7 @@ Notes about all other separators: markdown " " N/A "\n" nidx " " N/A "\n" pprint " " N/A "\n" + tsv " " N/A "\n" xtab "\n" " " "\n\n" diff --git a/docs/src/reference-main-separators.md b/docs/src/reference-main-separators.md index 8b939dbca8..c13241e659 100644 --- a/docs/src/reference-main-separators.md +++ b/docs/src/reference-main-separators.md @@ -261,8 +261,9 @@ a:4;b:5;c:6;d:>>>,|||;<<< Notes: -* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc). * CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.) +* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.) +* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV. * JSON: ignores all separator flags from the command line. * Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md). * For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`. diff --git a/docs/src/reference-main-separators.md.in b/docs/src/reference-main-separators.md.in index 921b3098ce..5ed4c63c73 100644 --- a/docs/src/reference-main-separators.md.in +++ b/docs/src/reference-main-separators.md.in @@ -151,8 +151,9 @@ GENMD-EOF Notes: -* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc). * CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.) +* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.) +* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV. * JSON: ignores all separator flags from the command line. * Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md). * For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`. diff --git a/internal/pkg/cli/option_parse.go b/internal/pkg/cli/option_parse.go index a9abf650f0..0b6e46b76f 100644 --- a/internal/pkg/cli/option_parse.go +++ b/internal/pkg/cli/option_parse.go @@ -147,7 +147,6 @@ Notes about all other separators: alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. -* TSV is simply CSV using tab as field separator (` + "`--fs tab`" + `). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -629,9 +628,7 @@ var FileFormatFlagSection = FlagSection{ name: "--itsv", help: "Use TSV format for input data.", parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" - options.ReaderOptions.ifsWasSpecified = true + options.ReaderOptions.InputFileFormat = "tsv" *pargi += 1 }, }, @@ -824,7 +821,7 @@ var FileFormatFlagSection = FlagSection{ name: "--otsv", help: "Use TSV format for output data.", parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.WriterOptions.OutputFileFormat = "csv" + options.WriterOptions.OutputFileFormat = "tsv" options.WriterOptions.OFS = "\t" options.WriterOptions.ofsWasSpecified = true *pargi += 1 @@ -981,27 +978,19 @@ var FileFormatFlagSection = FlagSection{ name: "--tsv", help: "Use TSV format for input and output data.", parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.WriterOptions.OutputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" - options.WriterOptions.OFS = "\t" - options.ReaderOptions.ifsWasSpecified = true - options.WriterOptions.ofsWasSpecified = true + options.ReaderOptions.InputFileFormat = "tsv" + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, { - name: "--tsvlite", + name: "--tsv", help: "Use TSV-lite format for input and output data.", altNames: []string{"-t"}, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csvlite" - options.WriterOptions.OutputFileFormat = "csvlite" - options.ReaderOptions.IFS = "\t" - options.WriterOptions.OFS = "\t" - options.ReaderOptions.ifsWasSpecified = true - options.WriterOptions.ofsWasSpecified = true + options.ReaderOptions.InputFileFormat = "tsv" + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, @@ -1181,11 +1170,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "csv" - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" + options.WriterOptions.OutputFileFormat = "tsv" options.ReaderOptions.irsWasSpecified = true - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true *pargi += 1 }, }, @@ -1308,12 +1294,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "csv" - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true - options.WriterOptions.orsWasSpecified = true *pargi += 1 }, }, @@ -1324,11 +1306,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "dkvp" - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1339,12 +1318,9 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "nidx" options.WriterOptions.OFS = " " - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true options.WriterOptions.ofsWasSpecified = true *pargi += 1 }, @@ -1356,13 +1332,10 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "json" options.WriterOptions.WrapJSONOutputInOuterList = true options.WriterOptions.JSONOutputMultiline = true - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1373,13 +1346,10 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "json" options.WriterOptions.WrapJSONOutputInOuterList = false options.WriterOptions.JSONOutputMultiline = false - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1390,11 +1360,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "pprint" - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1405,12 +1372,9 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "pprint" options.WriterOptions.BarredPprintOutput = true - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1421,11 +1385,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "xtab" - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1436,11 +1397,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ // need to print a tedious 60-line list. suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { - options.ReaderOptions.InputFileFormat = "csv" - options.ReaderOptions.IFS = "\t" + options.ReaderOptions.InputFileFormat = "tsv" options.WriterOptions.OutputFileFormat = "markdown" - options.ReaderOptions.ifsWasSpecified = true - options.ReaderOptions.irsWasSpecified = true *pargi += 1 }, }, @@ -1465,7 +1423,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "dkvp" - options.WriterOptions.OutputFileFormat = "csv" + options.WriterOptions.OutputFileFormat = "tsv" options.WriterOptions.OFS = "\t" options.WriterOptions.ofsWasSpecified = true options.WriterOptions.orsWasSpecified = true @@ -1585,10 +1543,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "nidx" - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, @@ -1703,10 +1658,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "json" - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, @@ -1805,10 +1757,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "json" - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, @@ -1910,11 +1859,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "pprint" options.ReaderOptions.IFS = " " - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" + options.WriterOptions.OutputFileFormat = "tsv" options.ReaderOptions.ifsWasSpecified = true - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true *pargi += 1 }, }, @@ -2028,10 +1974,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{ suppressFlagEnumeration: true, parser: func(args []string, argc int, pargi *int, options *TOptions) { options.ReaderOptions.InputFileFormat = "xtab" - options.WriterOptions.OutputFileFormat = "csv" - options.WriterOptions.OFS = "\t" - options.WriterOptions.ofsWasSpecified = true - options.WriterOptions.orsWasSpecified = true + options.WriterOptions.OutputFileFormat = "tsv" *pargi += 1 }, }, diff --git a/internal/pkg/cli/separators.go b/internal/pkg/cli/separators.go index e5e0c385f1..6a52c3f2c0 100644 --- a/internal/pkg/cli/separators.go +++ b/internal/pkg/cli/separators.go @@ -89,6 +89,7 @@ var defaultFSes = map[string]string{ "nidx": " ", "markdown": " ", "pprint": " ", + "tsv": "\t", "xtab": "\n", // todo: windows-dependent ... } @@ -100,6 +101,7 @@ var defaultPSes = map[string]string{ "markdown": "N/A", "nidx": "N/A", "pprint": "N/A", + "tsv": "N/A", "xtab": " ", } @@ -111,6 +113,7 @@ var defaultRSes = map[string]string{ "markdown": "\n", "nidx": "\n", "pprint": "\n", + "tsv": "\n", "xtab": "\n\n", // todo: maybe jettison the idea of this being alterable } @@ -122,5 +125,6 @@ var defaultAllowRepeatIFSes = map[string]bool{ "markdown": false, "nidx": false, "pprint": true, + "tsv": false, "xtab": false, } diff --git a/internal/pkg/input/record_reader_factory.go b/internal/pkg/input/record_reader_factory.go index 9edb3789e4..2a501831b5 100644 --- a/internal/pkg/input/record_reader_factory.go +++ b/internal/pkg/input/record_reader_factory.go @@ -20,6 +20,8 @@ func Create(readerOptions *cli.TReaderOptions, recordsPerBatch int64) (IRecordRe return NewRecordReaderNIDX(readerOptions, recordsPerBatch) case "pprint": return NewRecordReaderPPRINT(readerOptions, recordsPerBatch) + case "tsv": + return NewRecordReaderTSV(readerOptions, recordsPerBatch) case "xtab": return NewRecordReaderXTAB(readerOptions, recordsPerBatch) case "gen": diff --git a/internal/pkg/input/record_reader_tsv.go b/internal/pkg/input/record_reader_tsv.go new file mode 100644 index 0000000000..c7a06d2de6 --- /dev/null +++ b/internal/pkg/input/record_reader_tsv.go @@ -0,0 +1,378 @@ +package input + +import ( + "container/list" + "fmt" + "io" + "strconv" + "strings" + + "github.com/johnkerl/miller/internal/pkg/cli" + "github.com/johnkerl/miller/internal/pkg/lib" + "github.com/johnkerl/miller/internal/pkg/mlrval" + "github.com/johnkerl/miller/internal/pkg/types" +) + +// recordBatchGetterTSV points to either an explicit-TSV-header or +// implicit-TSV-header record-batch getter. +type recordBatchGetterTSV func( + reader *RecordReaderTSV, + linesChannel <-chan *list.List, + filename string, + context *types.Context, + errorChannel chan error, +) ( + recordsAndContexts *list.List, + eof bool, +) + +type RecordReaderTSV struct { + readerOptions *cli.TReaderOptions + recordsPerBatch int64 // distinct from readerOptions.RecordsPerBatch for join/repl + + fieldSplitter iFieldSplitter + recordBatchGetter recordBatchGetterTSV + + inputLineNumber int64 + headerStrings []string +} + +func NewRecordReaderTSV( + readerOptions *cli.TReaderOptions, + recordsPerBatch int64, +) (*RecordReaderTSV, error) { + if readerOptions.IFS != "\t" { + return nil, fmt.Errorf("for TSV, IFS cannot be altered") + } + if readerOptions.IRS != "\n" && readerOptions.IRS != "\r\n" { + return nil, fmt.Errorf("for TSV, IRS cannot be altered; LF vs CR/LF is autodetected") + } + reader := &RecordReaderTSV{ + readerOptions: readerOptions, + recordsPerBatch: recordsPerBatch, + fieldSplitter: newFieldSplitter(readerOptions), + } + if reader.readerOptions.UseImplicitCSVHeader { + reader.recordBatchGetter = getRecordBatchImplicitTSVHeader + } else { + reader.recordBatchGetter = getRecordBatchExplicitTSVHeader + } + return reader, nil +} + +func (reader *RecordReaderTSV) Read( + filenames []string, + context types.Context, + readerChannel chan<- *list.List, // list of *types.RecordAndContext + errorChannel chan error, + downstreamDoneChannel <-chan bool, // for mlr head +) { + if filenames != nil { // nil for mlr -n + if len(filenames) == 0 { // read from stdin + handle, err := lib.OpenStdin( + reader.readerOptions.Prepipe, + reader.readerOptions.PrepipeIsRaw, + reader.readerOptions.FileInputEncoding, + ) + if err != nil { + errorChannel <- err + return + } + reader.processHandle( + handle, + "(stdin)", + &context, + readerChannel, + errorChannel, + downstreamDoneChannel, + ) + } else { + for _, filename := range filenames { + handle, err := lib.OpenFileForRead( + filename, + reader.readerOptions.Prepipe, + reader.readerOptions.PrepipeIsRaw, + reader.readerOptions.FileInputEncoding, + ) + if err != nil { + errorChannel <- err + return + } + reader.processHandle( + handle, + filename, + &context, + readerChannel, + errorChannel, + downstreamDoneChannel, + ) + handle.Close() + } + } + } + readerChannel <- types.NewEndOfStreamMarkerList(&context) +} + +func (reader *RecordReaderTSV) processHandle( + handle io.Reader, + filename string, + context *types.Context, + readerChannel chan<- *list.List, // list of *types.RecordAndContext + errorChannel chan error, + downstreamDoneChannel <-chan bool, // for mlr head +) { + context.UpdateForStartOfFile(filename) + reader.inputLineNumber = 0 + reader.headerStrings = nil + + recordsPerBatch := reader.recordsPerBatch + lineScanner := NewLineScanner(handle, reader.readerOptions.IRS) + linesChannel := make(chan *list.List, recordsPerBatch) + go channelizedLineScanner(lineScanner, linesChannel, downstreamDoneChannel, recordsPerBatch) + + for { + recordsAndContexts, eof := reader.recordBatchGetter(reader, linesChannel, filename, context, errorChannel) + if recordsAndContexts.Len() > 0 { + readerChannel <- recordsAndContexts + } + if eof { + break + } + } +} + +func getRecordBatchExplicitTSVHeader( + reader *RecordReaderTSV, + linesChannel <-chan *list.List, + filename string, + context *types.Context, + errorChannel chan error, +) ( + recordsAndContexts *list.List, + eof bool, +) { + recordsAndContexts = list.New() + dedupeFieldNames := reader.readerOptions.DedupeFieldNames + + lines, more := <-linesChannel + if !more { + return recordsAndContexts, true + } + + for e := lines.Front(); e != nil; e = e.Next() { + line := e.Value.(string) + + reader.inputLineNumber++ + + // Check for comments-in-data feature + // TODO: function-pointer this away + if reader.readerOptions.CommentHandling != cli.CommentsAreData { + if strings.HasPrefix(line, reader.readerOptions.CommentString) { + if reader.readerOptions.CommentHandling == cli.PassComments { + recordsAndContexts.PushBack(types.NewOutputString(line+"\n", context)) + continue + } else if reader.readerOptions.CommentHandling == cli.SkipComments { + continue + } + // else comments are data + } + } + + if line == "" { + // Reset to new schema + reader.headerStrings = nil + continue + } + + fields := reader.fieldSplitter.Split(line) + + if reader.headerStrings == nil { + reader.headerStrings = fields + // Get data lines on subsequent loop iterations + } else { + if !reader.readerOptions.AllowRaggedCSVInput && len(reader.headerStrings) != len(fields) { + err := fmt.Errorf( + "mlr: TSV header/data length mismatch %d != %d "+ + "at filename %s line %d.\n", + len(reader.headerStrings), len(fields), filename, reader.inputLineNumber, + ) + errorChannel <- err + return + } + + record := mlrval.NewMlrmapAsRecord() + if !reader.readerOptions.AllowRaggedCSVInput { + for i, field := range fields { + field = lib.TSVDecodeField(field) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + } else { + nh := int64(len(reader.headerStrings)) + nd := int64(len(fields)) + n := lib.IntMin2(nh, nd) + var i int64 + for i = 0; i < n; i++ { + field := lib.TSVDecodeField(fields[i]) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + if nh < nd { + // if header shorter than data: use 1-up itoa keys + for i = nh; i < nd; i++ { + key := strconv.FormatInt(i+1, 10) + field := lib.TSVDecodeField(fields[i]) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(key, value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + } + if nh > nd { + // if header longer than data: use "" values + for i = nd; i < nh; i++ { + record.PutCopy(reader.headerStrings[i], mlrval.VOID) + } + } + } + + context.UpdateForInputRecord() + recordsAndContexts.PushBack(types.NewRecordAndContext(record, context)) + } + } + + return recordsAndContexts, false +} + +func getRecordBatchImplicitTSVHeader( + reader *RecordReaderTSV, + linesChannel <-chan *list.List, + filename string, + context *types.Context, + errorChannel chan error, +) ( + recordsAndContexts *list.List, + eof bool, +) { + recordsAndContexts = list.New() + dedupeFieldNames := reader.readerOptions.DedupeFieldNames + + lines, more := <-linesChannel + if !more { + return recordsAndContexts, true + } + + for e := lines.Front(); e != nil; e = e.Next() { + line := e.Value.(string) + + reader.inputLineNumber++ + + // Check for comments-in-data feature + // TODO: function-pointer this away + if reader.readerOptions.CommentHandling != cli.CommentsAreData { + if strings.HasPrefix(line, reader.readerOptions.CommentString) { + if reader.readerOptions.CommentHandling == cli.PassComments { + recordsAndContexts.PushBack(types.NewOutputString(line+"\n", context)) + continue + } else if reader.readerOptions.CommentHandling == cli.SkipComments { + continue + } + // else comments are data + } + } + + // This is how to do a chomp: + line = strings.TrimRight(line, reader.readerOptions.IRS) + + line = strings.TrimRight(line, "\r") + + if line == "" { + // Reset to new schema + reader.headerStrings = nil + continue + } + + fields := reader.fieldSplitter.Split(line) + + if reader.headerStrings == nil { + n := len(fields) + reader.headerStrings = make([]string, n) + for i := 0; i < n; i++ { + reader.headerStrings[i] = strconv.Itoa(i + 1) + } + } else { + if !reader.readerOptions.AllowRaggedCSVInput && len(reader.headerStrings) != len(fields) { + err := fmt.Errorf( + "mlr: TSV header/data length mismatch %d != %d "+ + "at filename %s line %d.\n", + len(reader.headerStrings), len(fields), filename, reader.inputLineNumber, + ) + errorChannel <- err + return + } + } + + record := mlrval.NewMlrmapAsRecord() + if !reader.readerOptions.AllowRaggedCSVInput { + for i, field := range fields { + field = lib.TSVDecodeField(field) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + } else { + nh := int64(len(reader.headerStrings)) + nd := int64(len(fields)) + n := lib.IntMin2(nh, nd) + var i int64 + for i = 0; i < n; i++ { + field := lib.TSVDecodeField(fields[i]) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + if nh < nd { + // if header shorter than data: use 1-up itoa keys + key := strconv.FormatInt(i+1, 10) + field := lib.TSVDecodeField(fields[i]) + value := mlrval.FromDeferredType(field) + _, err := record.PutReferenceMaybeDedupe(key, value, dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + if nh > nd { + // if header longer than data: use "" values + for i = nd; i < nh; i++ { + _, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], mlrval.VOID.Copy(), dedupeFieldNames) + if err != nil { + errorChannel <- err + return + } + } + } + } + + context.UpdateForInputRecord() + recordsAndContexts.PushBack(types.NewRecordAndContext(record, context)) + } + + return recordsAndContexts, false +} diff --git a/internal/pkg/lib/tsv_codec.go b/internal/pkg/lib/tsv_codec.go new file mode 100644 index 0000000000..2320320d0b --- /dev/null +++ b/internal/pkg/lib/tsv_codec.go @@ -0,0 +1,68 @@ +package lib + +import ( + "bytes" +) + +// * https://en.wikipedia.org/wiki/Tab-separated_values +// * https://www.iana.org/assignments/media-types/text/tab-separated-values +// \n for newline, +// \r for carriage return, +// \t for tab, +// \\ for backslash. + +// TSVDecodeField is for the TSV record-reader. +func TSVDecodeField(input string) string { + var buffer bytes.Buffer + n := len(input) + for i := 0; i < n; /* increment in loop */ { + c := input[i] + if c == '\\' && i < n-1 { + d := input[i+1] + if d == '\\' { + buffer.WriteByte('\\') + i += 2 + } else if d == 'n' { + buffer.WriteByte('\n') + i += 2 + } else if d == 'r' { + buffer.WriteByte('\r') + i += 2 + } else if d == 't' { + buffer.WriteByte('\t') + i += 2 + } else { + buffer.WriteByte(c) + i++ + } + } else { + buffer.WriteByte(c) + i++ + } + } + return buffer.String() +} + +// TSVEncodeField is for the TSV record-writer. +func TSVEncodeField(input string) string { + var buffer bytes.Buffer + for i := range input { + c := input[i] + if c == '\\' { + buffer.WriteByte('\\') + buffer.WriteByte('\\') + } else if c == '\n' { + buffer.WriteByte('\\') + buffer.WriteByte('n') + } else if c == '\r' { + buffer.WriteByte('\\') + buffer.WriteByte('r') + } else if c == '\t' { + buffer.WriteByte('\\') + buffer.WriteByte('t') + } else { + buffer.WriteByte(c) + } + } + return buffer.String() +} diff --git a/internal/pkg/lib/tsv_codec_test.go b/internal/pkg/lib/tsv_codec_test.go new file mode 100644 index 0000000000..0deb6a66ea --- /dev/null +++ b/internal/pkg/lib/tsv_codec_test.go @@ -0,0 +1,35 @@ +package lib + +import ( + "testing" + + "github.com/stretchr/testify/assert" +) + +func TestTSVDecodeField(t *testing.T) { + assert.Equal(t, "", TSVDecodeField("")) + assert.Equal(t, "a", TSVDecodeField("a")) + assert.Equal(t, "abc", TSVDecodeField("abc")) + assert.Equal(t, `\`, TSVDecodeField(`\`)) + assert.Equal(t, "\n", TSVDecodeField(`\n`)) + assert.Equal(t, "\r", TSVDecodeField(`\r`)) + assert.Equal(t, "\t", TSVDecodeField(`\t`)) + assert.Equal(t, "\\", TSVDecodeField(`\\`)) + assert.Equal(t, `\n`, TSVDecodeField(`\\n`)) + assert.Equal(t, "\\\n", TSVDecodeField(`\\\n`)) + assert.Equal(t, "abc\r\ndef\r\n", TSVDecodeField(`abc\r\ndef\r\n`)) +} + +func TestTSVEncodeField(t *testing.T) { + assert.Equal(t, "", TSVEncodeField("")) + assert.Equal(t, "a", TSVEncodeField("a")) + assert.Equal(t, "abc", TSVEncodeField("abc")) + assert.Equal(t, `\\`, TSVEncodeField(`\`)) + assert.Equal(t, `\n`, TSVEncodeField("\n")) + assert.Equal(t, `\r`, TSVEncodeField("\r")) + assert.Equal(t, `\t`, TSVEncodeField("\t")) + assert.Equal(t, `\\`, TSVEncodeField("\\")) + assert.Equal(t, `\\n`, TSVEncodeField("\\n")) + assert.Equal(t, `\\\n`, TSVEncodeField("\\\n")) + assert.Equal(t, `abc\r\ndef\r\n`, TSVEncodeField("abc\r\ndef\r\n")) +} diff --git a/internal/pkg/output/record_writer_factory.go b/internal/pkg/output/record_writer_factory.go index 3649f25bb8..a48c68f995 100644 --- a/internal/pkg/output/record_writer_factory.go +++ b/internal/pkg/output/record_writer_factory.go @@ -22,6 +22,8 @@ func Create(writerOptions *cli.TWriterOptions) (IRecordWriter, error) { return NewRecordWriterNIDX(writerOptions) case "pprint": return NewRecordWriterPPRINT(writerOptions) + case "tsv": + return NewRecordWriterTSV(writerOptions) case "xtab": return NewRecordWriterXTAB(writerOptions) default: diff --git a/internal/pkg/output/record_writer_tsv.go b/internal/pkg/output/record_writer_tsv.go new file mode 100644 index 0000000000..3a7b539531 --- /dev/null +++ b/internal/pkg/output/record_writer_tsv.go @@ -0,0 +1,104 @@ +package output + +import ( + "bufio" + "fmt" + "strings" + + "github.com/johnkerl/miller/internal/pkg/cli" + "github.com/johnkerl/miller/internal/pkg/colorizer" + "github.com/johnkerl/miller/internal/pkg/lib" + "github.com/johnkerl/miller/internal/pkg/mlrval" +) + +type RecordWriterTSV struct { + writerOptions *cli.TWriterOptions + // For reporting schema changes: we print a newline and the new header + lastJoinedHeader *string + // Only write one blank line for schema changes / blank input lines + justWroteEmptyLine bool +} + +func NewRecordWriterTSV(writerOptions *cli.TWriterOptions) (*RecordWriterTSV, error) { + if writerOptions.OFS != "\t" { + return nil, fmt.Errorf("for TSV, OFS cannot be altered") + } + if writerOptions.ORS != "\n" && writerOptions.ORS != "\r\n" { + return nil, fmt.Errorf("for CSV, ORS cannot be altered") + } + return &RecordWriterTSV{ + writerOptions: writerOptions, + lastJoinedHeader: nil, + justWroteEmptyLine: false, + }, nil +} + +func (writer *RecordWriterTSV) Write( + outrec *mlrval.Mlrmap, + bufferedOutputStream *bufio.Writer, + outputIsStdout bool, +) { + // End of record stream: nothing special for this output format + if outrec == nil { + return + } + + if outrec.IsEmpty() { + if !writer.justWroteEmptyLine { + bufferedOutputStream.WriteString(writer.writerOptions.ORS) + } + joinedHeader := "" + writer.lastJoinedHeader = &joinedHeader + writer.justWroteEmptyLine = true + return + } + + needToPrintHeader := false + joinedHeader := strings.Join(outrec.GetKeys(), ",") + if writer.lastJoinedHeader == nil || *writer.lastJoinedHeader != joinedHeader { + if writer.lastJoinedHeader != nil { + if !writer.justWroteEmptyLine { + bufferedOutputStream.WriteString(writer.writerOptions.ORS) + } + writer.justWroteEmptyLine = true + } + writer.lastJoinedHeader = &joinedHeader + needToPrintHeader = true + } + + if needToPrintHeader && !writer.writerOptions.HeaderlessCSVOutput { + for pe := outrec.Head; pe != nil; pe = pe.Next { + bufferedOutputStream.WriteString( + colorizer.MaybeColorizeKey( + lib.TSVEncodeField( + pe.Key, + ), + outputIsStdout, + ), + ) + + if pe.Next != nil { + bufferedOutputStream.WriteString(writer.writerOptions.OFS) + } + } + + bufferedOutputStream.WriteString(writer.writerOptions.ORS) + } + + for pe := outrec.Head; pe != nil; pe = pe.Next { + bufferedOutputStream.WriteString( + colorizer.MaybeColorizeValue( + lib.TSVEncodeField( + pe.Value.String(), + ), + outputIsStdout, + ), + ) + if pe.Next != nil { + bufferedOutputStream.WriteString(writer.writerOptions.OFS) + } + } + bufferedOutputStream.WriteString(writer.writerOptions.ORS) + + writer.justWroteEmptyLine = false +} diff --git a/man/manpage.txt b/man/manpage.txt index 880960425e..a1addc1e3d 100644 --- a/man/manpage.txt +++ b/man/manpage.txt @@ -365,7 +365,7 @@ FILE-FORMAT FLAGS --oxtab Use XTAB format for output data. --pprint Use PPRINT format for input and output data. --tsv Use TSV format for input and output data. - --tsvlite or -t Use TSV-lite format for input and output data. + --tsv or -t Use TSV-lite format for input and output data. --usv or --usvlite Use USV format for input and output data. --xtab Use XTAB format for input and output data. -i {format name} Use format name for input data. For example: `-i csv` @@ -687,7 +687,6 @@ SEPARATOR FLAGS alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. - * TSV is simply CSV using tab as field separator (`--fs tab`). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -742,6 +741,7 @@ SEPARATOR FLAGS markdown " " N/A "\n" nidx " " N/A "\n" pprint " " N/A "\n" + tsv " " N/A "\n" xtab "\n" " " "\n\n" --fs {string} Specify FS for input and output. @@ -3136,4 +3136,4 @@ SEE ALSO - 2022-02-05 MILLER(1) + 2022-02-06 MILLER(1) diff --git a/man/mlr.1 b/man/mlr.1 index 39e84305ed..21021a5198 100644 --- a/man/mlr.1 +++ b/man/mlr.1 @@ -2,12 +2,12 @@ .\" Title: mlr .\" Author: [see the "AUTHOR" section] .\" Generator: ./mkman.rb -.\" Date: 2022-02-05 +.\" Date: 2022-02-06 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "MILLER" "1" "2022-02-05" "\ \&" "\ \&" +.TH "MILLER" "1" "2022-02-06" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Portability definitions .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -444,7 +444,7 @@ are overridden in all cases by setting output format to `format2`. --oxtab Use XTAB format for output data. --pprint Use PPRINT format for input and output data. --tsv Use TSV format for input and output data. ---tsvlite or -t Use TSV-lite format for input and output data. +--tsv or -t Use TSV-lite format for input and output data. --usv or --usvlite Use USV format for input and output data. --xtab Use XTAB format for input and output data. -i {format name} Use format name for input data. For example: `-i csv` @@ -830,7 +830,6 @@ Notes about all other separators: alignment impossible. * OPS may be multi-character for XTAB format, in which case alignment is disabled. -* TSV is simply CSV using tab as field separator (`--fs tab`). * FS/PS are ignored for markdown format; RS is used. * All FS and PS options are ignored for JSON format, since they are not relevant to the JSON format. @@ -885,6 +884,7 @@ Notes about all other separators: markdown " " N/A "\en" nidx " " N/A "\en" pprint " " N/A "\en" + tsv " " N/A "\en" xtab "\en" " " "\en\en" --fs {string} Specify FS for input and output. diff --git a/test/cases/io-multi/0030/expout b/test/cases/io-multi/0030/expout index b7aa374ab7..d772fae2b1 100644 --- a/test/cases/io-multi/0030/expout +++ b/test/cases/io-multi/0030/expout @@ -1,5 +1,5 @@ -"a b i x y" -"pan pan 1 0.3467901443380824 0.7268028627434533" -"eks pan 2 0.7586799647899636 0.5221511083334797" -"wye wye 3 0.20460330576630303 0.33831852551664776" -"eks wye 4 0.38139939387114097 0.13418874328430463" +a\tb\ti\tx\ty +pan\tpan\t1\t0.3467901443380824\t0.7268028627434533 +eks\tpan\t2\t0.7586799647899636\t0.5221511083334797 +wye\twye\t3\t0.20460330576630303\t0.33831852551664776 +eks\twye\t4\t0.38139939387114097\t0.13418874328430463 diff --git a/test/cases/io-spec-tsv/0001/cmd b/test/cases/io-spec-tsv/0001/cmd new file mode 100644 index 0000000000..0344d8df8d --- /dev/null +++ b/test/cases/io-spec-tsv/0001/cmd @@ -0,0 +1 @@ +mlr --itsv --ojson cat ${CASEDIR}/data.tsv diff --git a/test/cases/io-spec-tsv/0001/data.tsv b/test/cases/io-spec-tsv/0001/data.tsv new file mode 100644 index 0000000000..497f14443c --- /dev/null +++ b/test/cases/io-spec-tsv/0001/data.tsv @@ -0,0 +1,2 @@ +a\tb,c\nd,e +1\r2,3\\4,5 diff --git a/test/cases/io-spec-tsv/0001/experr b/test/cases/io-spec-tsv/0001/experr new file mode 100644 index 0000000000..e69de29bb2 diff --git a/test/cases/io-spec-tsv/0001/expout b/test/cases/io-spec-tsv/0001/expout new file mode 100644 index 0000000000..b5678c4aa2 --- /dev/null +++ b/test/cases/io-spec-tsv/0001/expout @@ -0,0 +1,5 @@ +[ +{ + "a\\tb,c\\nd,e": "1\r2,3\\4,5" +} +] diff --git a/test/cases/io-spec-tsv/0002/cmd b/test/cases/io-spec-tsv/0002/cmd new file mode 100644 index 0000000000..2819081d36 --- /dev/null +++ b/test/cases/io-spec-tsv/0002/cmd @@ -0,0 +1 @@ +mlr --ijson --otsv cat ${CASEDIR}/data.json diff --git a/test/cases/io-spec-tsv/0002/data.json b/test/cases/io-spec-tsv/0002/data.json new file mode 100644 index 0000000000..b5678c4aa2 --- /dev/null +++ b/test/cases/io-spec-tsv/0002/data.json @@ -0,0 +1,5 @@ +[ +{ + "a\\tb,c\\nd,e": "1\r2,3\\4,5" +} +] diff --git a/test/cases/io-spec-tsv/0002/experr b/test/cases/io-spec-tsv/0002/experr new file mode 100644 index 0000000000..e69de29bb2 diff --git a/test/cases/io-spec-tsv/0002/expout b/test/cases/io-spec-tsv/0002/expout new file mode 100644 index 0000000000..257ae848aa --- /dev/null +++ b/test/cases/io-spec-tsv/0002/expout @@ -0,0 +1,2 @@ +a\\tb,c\\nd,e +1\r2,3\\4,5 diff --git a/todo.txt b/todo.txt index 41ad4345f6..4edafd8c76 100644 --- a/todo.txt +++ b/todo.txt @@ -2,26 +2,41 @@ RELEASES * plan 6.1.0 + ! IANA-TSV w/ \{X} ? w/ natural sort order ? strptime ? datediff et al. ? mlr join --left-fields a,b,c ? rank - ? ?foo and ??foo @ repl help o fmt/unfmt/regex doc o FAQ/examples reorg k default colors; bold/underline/reverse k array concat k format/unformat - k split + k split verb k slwin & shift-lead + m unicode string literals k 0o.. octal literals in the DSL + k codeql/codespell/goreleaseer binaries/zips + k :rb + k ?foo and ??foo @ repl help + k doc-improves * plan 6.2.0 ? YAML ================================================================ FEATURES +---------------------------------------------------------------- +TSV etc + +? also: some escapes perhaps for dkvp, xtab, pprint -- ? + o nidx is a particular pure-text, leave-as-is +? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS + o make/publicize a shorthand for this -- ? + o --words && --lines & --paragraphs -- ? +* still need csv --lazy-quotes + ---------------------------------------------------------------- * natural sort order https://github.com/facette/natsort