Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick a meaning of TSV and use it consistently #1566

Open
7 of 28 tasks
tsibley opened this issue Aug 2, 2024 · 12 comments
Open
7 of 28 tasks

Pick a meaning of TSV and use it consistently #1566

tsibley opened this issue Aug 2, 2024 · 12 comments

Comments

@tsibley
Copy link
Member

tsibley commented Aug 2, 2024

Copied from #1566 (comment)

This seems settled on emitting RFC 4180 CSV-like TSVs.

We should

To properly handle CSV-like TSVs, I believe our rules should be

  1. Use csvtk. The --lazy (-l) option and fix-quotes/del-quotes commands should not be necessary and should be avoided.
  2. Use tsv-utils, but only after first passing input thru csv2tsv --csv-delim $'\t' (to convert from CSV-like TSV to IANA TSV, possibly lossily) and last passing output thru csvtk fix-quotes --tabs (to convert back from IANA TSV to CSV-like TSV).

original issue context

We should pick between either IANA TSVs or RFC 4180 CSV-like TSVs and then be consistent about that choice everywhere. Problems arise when mixing the two, e.g. expecting one but getting the other.

IANA TSVs prohibit tabs and newlines in field values. Parsing is much simpler because of this, but generators need to ensure tabs and newlines are stripped/replaced or cause an error, lest they risk generating corrupt output. The programs in tsv-utils work with IANA TSVs.

CSV-like TSVs can represent all values without limitation. Parsing is more complicated because of quoting and escaping (but libraries exist). The programs in csvtk work with CSV-like TSVs. The Python csv module works with CSV-like TSVs by default, but can be configured to work with IANA-like TSVs instead. (IANA-like, because it will error on some outputs despite them having valid IANA TSVs forms. Some inputs may also misbehave, despite being valid.)

Augur mostly uses CSV-like TSV with the csv module configured in QUOTE_MINIMAL quoting mode (the default). The exception is augur curate commands which use IANA-like TSV with the csv module configured in QUOTE_NONE mode, meaning it will error if field values contain a tab or newline.

Nextclade outputs CSV-like TSV.

Related to:

@genehack
Copy link
Contributor

genehack commented Aug 2, 2024

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

The reverse is not true: something expecting an IANA TSV will not accept a CSV-like TSV that contains a field with an embedded tab or newline.

If both of those statements are accurate, settling on CSV-like TSVs feels like the most compatible option — why would we not want to do that?

@tsibley
Copy link
Member Author

tsibley commented Aug 2, 2024

Correct. And I agree. Actually, not correct.

While this issue is about consistency within Augur itself, it relates to consistency within the broader Nextstrain ecosystem when using TSVs. We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

@tsibley tsibley mentioned this issue Aug 2, 2024
7 tasks
@corneliusroemer
Copy link
Member

corneliusroemer commented Aug 3, 2024

Thanks @tsibley for raising this issue (and teaching me something new)! I was not fully conscious of there being 2 incompatible ways libraries/tools parse and generate TSVs.

Technically, RFC 4180 does not mention TSVs or any separators other than comma at all. So if we wanted to do what the most authoritative spec says, it would clearly be IANA TSV we should be using. However, it is true that many tools seem to treat TSVs as RFC 4180 CSVs with the only difference being the separator used.

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error. Also roundtripping through RFC-CSV might not be idempotent. E.g. Implementing tools one single quote is perfectly valdouble quotes ("""") will be wrongly (and silently wrongly) interpreted by a library that expects CSV-TSVs as only containing a single double quote ".

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.

>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""

The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV. So we can implement both in Augur with minimal (though breaking) changes. Whatever we decide, we break some augur command's handling of TSVs.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI. In fact, I would speculate that the reason TSVs are so commonly used in bioinformatics is precisely because one doesn't need to escape commas and double quotes. What's the point of using TSVs if you end up having to escape common characters (double quotes)? I'm pretty sure that double quotes are more commonly encountered in common augur input data than tabs or newlines.

Let me list all the options we have here, I think there's actually more than picking one or the other:

  • A) Interpret all TSVs according to IANA. If \t (and \n within fields) are encountered, throw error, telling the user their data is not in line with IANA TSV spec and that they should use csv instead. If double quotes are encountered, treat them literally
  • B) Interpret all TSVs according to RFC spec, using Python's minimal quoting dialect, which only adds double quote escapes in case of fields containing ", \t or \n - potentially
  • C) Support both, exposing the type of TSV to use as input/output as a command line option, choose either one as default. Input and output modes could possibly be different. That would mean we'd have two options: one for input type, one for output type.

If we were to support both, one could also add an "auto" input mode, that dynamically decides the mode when encountering the first line that positively identifies the file as being according to IANA or RFC spec (i.e. when it encounters a line that's invalid as IANA but valid as RFC, use RFC mode and vice versa, sometimes both might be valid but have semantic differences, e.g. a "" field - in that case auto mode could either fail or use a default mode). Possibly, input "auto" mode could also be used to set the output mode automatically.

I think it would really help us if we analyzed input TSV files in our repos and see whether they are a) valid for both, or b) valid only for IANA, c) valid only for RFC. With some data, it's easier to make a good decision here.

Sorry for the essay here - this is quite a fascinating topic :)

We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

It's good to know that some RFC-TSVs can be converted to IANA as long as only " but not \t or \n are used in fields. But we should only do this if all of the downstream usage assumes IANA tsvs, otherwise, one would have to unnecessarily re-escape ".

I wonder if it might make sense to start using non-ambiguous extensions, e.g. .itsv (IANA) and .ctsv (RFC-CSV), to be explicit about which spec the file follows.

@tsibley
Copy link
Member Author

tsibley commented Aug 13, 2024

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error.

Ah, yes, you're right. Thanks for this catch!

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.

>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""

The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV

In that example, the single double quote, ", isn't being doubled per se. It's being parsed to an empty string and then when writing that empty string, it's represented in quotes per RFC standards.

You can get the output you expect when writing a value of " if you do more than set delimiter, for example:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['"']]);
A
"

but that's abusing quotechar and will break for values with a literal tab in them (which is ~expected, but it breaks for the wrong reason (no escape char instead of no quote char)).

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['\t']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set

But even the dialect options I set above don't handle this case below of a single empty field:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: single empty field record must be quoted

despite that having a valid IANA representation.

The csv module is really meant for RFC CSVs and related formats and it's non-trivial to make it produce true IANA TSVs. Maybe even impossible.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI.

I've seen plenty of tabs and newlines and other stuff in metadata before. If we're going to emit IANA TSVs for data we read from other formats, we need to be consistent about removing/replacing those chars before serialization. Using RFC TSVs means we don't have to worry about it: the csv library will take care of it for us.

Let me list all the options we have here, I think there's actually more than picking one or the other:

There are other options, like the ones you've listed, yes, but I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

@tsibley
Copy link
Member Author

tsibley commented Aug 13, 2024

(I've updated the issue description to differentiate between IANA TSVs and IANA-like TSVs, i.e. what we can do with Python's csv.)

@tsibley
Copy link
Member Author

tsibley commented Sep 3, 2024

I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

This is my main point with this issue. Currently to properly handle various Augur outputs requires knowing a priori which particular variant of TSVs it produces. It is not straightforward or clear, much less documented. We should pick one and stick to it and document it (and patterns for working with them).

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

@genehack
Copy link
Contributor

genehack commented Sep 4, 2024

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

+1

tsibley added a commit to nextstrain/measles that referenced this issue Sep 10, 2024
This construction reads much clearer and cleaner.

Moves the Nextclade field map directly and more conveniently into the
YAML config instead of referencing a separate TSV file.  Putting the
field map into a separate file seemed to be only for the sake of the
--kv-file (-k) interface provided by `cvstk rename2`, which we're no
longer using here.  For backwards compatibility, configs that reference
a TSV file are still supported and will be handled on-the-fly.

Note that `augur curate` commands currently emit CSV-like TSVs that are
limited to be IANA-like¹ such that parsing them with tsv-utils is most
appropriate, hence the switch from `csvtk cut` to `tsv-select`.

¹ See <nextstrain/augur#1566>.
tsibley added a commit to nextstrain/measles that referenced this issue Sep 10, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the rule now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
tsibley added a commit to nextstrain/measles that referenced this issue Sep 10, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
@joverlee521
Copy link
Contributor

Discussion in nextstrain/measles#52 (comment), prompted me to look into how Nextclade writes TSVs. I assume it's producing CSV-like TSVs since it's using the csv::WriteBuilder and only specifies the delimiter without changing any of the other defaults such as quoting.

@tsibley
Copy link
Member Author

tsibley commented Sep 12, 2024

@joverlee521 Ah! Yes, I concur. I'd looked into that a week or two ago and forgot to write it down/forgot I did! orz

But I came to the same conclusion as you. I also validated it by finding CSV quoting in the ncov/open nextclade.tsv.

@tsibley
Copy link
Member Author

tsibley commented Sep 17, 2024

This seems settled on emitting RFC 4180 CSV-like TSVs.

We should

  • Ensure all Augur TSV output is compliant
  • Update workflows and other places that we manipulate TSVs to properly handle them

To properly handle CSV-like TSVs, I believe our rules should be

  1. Use csvtk. The --lazy (-l) option and fix-quotes/del-quotes commands should not be necessary and should be avoided.
  2. Use tsv-utils, but only after first passing input thru csv2tsv --csv-delim $'\t' (to convert from CSV-like TSV to IANA TSV, possibly lossily) and last passing output thru csvtk fix-quotes --tabs (to convert back from IANA TSV to CSV-like TSV).

tsibley added a commit to nextstrain/pathogen-repo-guide that referenced this issue Oct 3, 2024
This construction reads much clearer and cleaner.

Moves the Nextclade field map directly and more conveniently into the
YAML config instead of referencing a separate TSV file.  Putting the
field map into a separate file seemed to be only for the sake of the
--kv-file (-k) interface provided by `cvstk rename2`, which we're no
longer using here.  Backwards compatibility with configs that name a TSV
file is not preserved since this pathogen-repo-guide is expected to be
used to stamp out new repos, and we don't have any particular
process/plan for how to update previously stamped out repos.

Note that `augur curate` commands currently emit CSV-like TSVs that are
limited to be IANA-like¹ such that parsing them with tsv-utils is most
appropriate, hence the switch from `csvtk cut` to `tsv-select`.

¹ See <nextstrain/augur#1566>.

Ported-from: <nextstrain/measles@faebd64>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
tsibley added a commit to nextstrain/pathogen-repo-guide that referenced this issue Oct 3, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields.  We really
need to sort out our TSV formats³, but that's for a larger project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>

Ported-from: <nextstrain/measles@4d73b7f>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
joverlee521 added a commit to nextstrain/docs.nextstrain.org that referenced this issue Nov 26, 2024
Proper handling of TSVs with `csvtk`/`tsv-utils` was originally
recommended by @tsibley

<nextstrain/augur#1566 (comment)>
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 6, 2024

While checking on whether rust-csv specifies its output format for TSV files, I noticed that RFC 4180 also requires CRLF (\r\n) line endings. Important if we want to be pedantic here. Though, in practice will probably cause more harm than good.

@tsibley
Copy link
Member Author

tsibley commented Dec 9, 2024

@ivan-aksamentov Yeah, all MIME-y things are standardized on CRLF (e.g. SMTP, HTTP, IMAP, etc.), but agreed it would be more harmful than good here. The RFC does say:

As per section 4.1.1. of RFC 2046 [3], this media type uses CRLF to denote line breaks. However, implementors should be aware that some implementations may use other values.

And also note that we're already defining our TSVs as "RFC 4180-like" because the RFC doesn't technically allow for alternate separators either. But it's very common in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants