-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick a meaning of TSV and use it consistently #1566
Comments
Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint. The reverse is not true: something expecting an IANA TSV will not accept a CSV-like TSV that contains a field with an embedded tab or newline. If both of those statements are accurate, settling on CSV-like TSVs feels like the most compatible option — why would we not want to do that? |
While this issue is about consistency within Augur itself, it relates to consistency within the broader Nextstrain ecosystem when using TSVs. We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru |
Thanks @tsibley for raising this issue (and teaching me something new)! I was not fully conscious of there being 2 incompatible ways libraries/tools parse and generate TSVs. Technically, RFC 4180 does not mention TSVs or any separators other than comma at all. So if we wanted to do what the most authoritative spec says, it would clearly be IANA TSV we should be using. However, it is true that many tools seem to treat TSVs as RFC 4180 CSVs with the only difference being the separator used.
That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like Example: Python's csv package if configured with
The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV. So we can implement both in Augur with minimal (though breaking) changes. Whatever we decide, we break some augur command's handling of TSVs. IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI. In fact, I would speculate that the reason TSVs are so commonly used in bioinformatics is precisely because one doesn't need to escape commas and double quotes. What's the point of using TSVs if you end up having to escape common characters (double quotes)? I'm pretty sure that double quotes are more commonly encountered in common augur input data than tabs or newlines. Let me list all the options we have here, I think there's actually more than picking one or the other:
If we were to support both, one could also add an "auto" input mode, that dynamically decides the mode when encountering the first line that positively identifies the file as being according to IANA or RFC spec (i.e. when it encounters a line that's invalid as IANA but valid as RFC, use RFC mode and vice versa, sometimes both might be valid but have semantic differences, e.g. a I think it would really help us if we analyzed input TSV files in our repos and see whether they are a) valid for both, or b) valid only for IANA, c) valid only for RFC. With some data, it's easier to make a good decision here. Sorry for the essay here - this is quite a fascinating topic :)
It's good to know that some RFC-TSVs can be converted to IANA as long as only I wonder if it might make sense to start using non-ambiguous extensions, e.g. |
Ah, yes, you're right. Thanks for this catch!
In that example, the single double quote, You can get the output you expect when writing a value of >>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['"']]);
A
" but that's abusing >>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['\t']]);
A
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set But even the dialect options I set above don't handle this case below of a single empty field: >>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['']]);
A
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: single empty field record must be quoted despite that having a valid IANA representation. The
I've seen plenty of tabs and newlines and other stuff in metadata before. If we're going to emit IANA TSVs for data we read from other formats, we need to be consistent about removing/replacing those chars before serialization. Using RFC TSVs means we don't have to worry about it: the
There are other options, like the ones you've listed, yes, but I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally). |
(I've updated the issue description to differentiate between IANA TSVs and IANA-like TSVs, i.e. what we can do with Python's |
This is my main point with this issue. Currently to properly handle various Augur outputs requires knowing a priori which particular variant of TSVs it produces. It is not straightforward or clear, much less documented. We should pick one and stick to it and document it (and patterns for working with them). From my viewpoint, since Python's |
+1 |
This construction reads much clearer and cleaner. Moves the Nextclade field map directly and more conveniently into the YAML config instead of referencing a separate TSV file. Putting the field map into a separate file seemed to be only for the sake of the --kv-file (-k) interface provided by `cvstk rename2`, which we're no longer using here. For backwards compatibility, configs that reference a TSV file are still supported and will be handled on-the-fly. Note that `augur curate` commands currently emit CSV-like TSVs that are limited to be IANA-like¹ such that parsing them with tsv-utils is most appropriate, hence the switch from `csvtk cut` to `tsv-select`. ¹ See <nextstrain/augur#1566>.
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the rule now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>
Discussion in nextstrain/measles#52 (comment), prompted me to look into how Nextclade writes TSVs. I assume it's producing CSV-like TSVs since it's using the csv::WriteBuilder and only specifies the delimiter without changing any of the other defaults such as quoting. |
@joverlee521 Ah! Yes, I concur. I'd looked into that a week or two ago and forgot to write it down/forgot I did! orz But I came to the same conclusion as you. I also validated it by finding CSV quoting in the ncov/open |
This seems settled on emitting RFC 4180 CSV-like TSVs. We should
To properly handle CSV-like TSVs, I believe our rules should be
|
This construction reads much clearer and cleaner. Moves the Nextclade field map directly and more conveniently into the YAML config instead of referencing a separate TSV file. Putting the field map into a separate file seemed to be only for the sake of the --kv-file (-k) interface provided by `cvstk rename2`, which we're no longer using here. Backwards compatibility with configs that name a TSV file is not preserved since this pathogen-repo-guide is expected to be used to stamp out new repos, and we don't have any particular process/plan for how to update previously stamped out repos. Note that `augur curate` commands currently emit CSV-like TSVs that are limited to be IANA-like¹ such that parsing them with tsv-utils is most appropriate, hence the switch from `csvtk cut` to `tsv-select`. ¹ See <nextstrain/augur#1566>. Ported-from: <nextstrain/measles@faebd64> Related-to: <nextstrain/measles#52> Related-to: <#65>
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields. We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566> Ported-from: <nextstrain/measles@4d73b7f> Related-to: <nextstrain/measles#52> Related-to: <#65>
Proper handling of TSVs with `csvtk`/`tsv-utils` was originally recommended by @tsibley <nextstrain/augur#1566 (comment)>
@ivan-aksamentov Yeah, all MIME-y things are standardized on CRLF (e.g. SMTP, HTTP, IMAP, etc.), but agreed it would be more harmful than good here. The RFC does say:
And also note that we're already defining our TSVs as "RFC 4180-like" because the RFC doesn't technically allow for alternate separators either. But it's very common in practice. |
Copied from #1566 (comment)
This seems settled on emitting RFC 4180 CSV-like TSVs.
We should
To properly handle CSV-like TSVs, I believe our rules should be
csvtk
. The--lazy
(-l
) option andfix-quotes
/del-quotes
commands should not be necessary and should be avoided.tsv-utils
, but only after first passing input thrucsv2tsv --csv-delim $'\t'
(to convert from CSV-like TSV to IANA TSV, possibly lossily) and last passing output thrucsvtk fix-quotes --tabs
(to convert back from IANA TSV to CSV-like TSV).original issue context
We should pick between either IANA TSVs or RFC 4180 CSV-like TSVs and then be consistent about that choice everywhere. Problems arise when mixing the two, e.g. expecting one but getting the other.
IANA TSVs prohibit tabs and newlines in field values. Parsing is much simpler because of this, but generators need to ensure tabs and newlines are stripped/replaced or cause an error, lest they risk generating corrupt output. The programs in tsv-utils work with IANA TSVs.
CSV-like TSVs can represent all values without limitation. Parsing is more complicated because of quoting and escaping (but libraries exist). The programs in csvtk work with CSV-like TSVs. The Python csv module works with CSV-like TSVs by default, but can be configured to work with IANA-like TSVs instead. (IANA-like, because it will error on some outputs despite them having valid IANA TSVs forms. Some inputs may also misbehave, despite being valid.)
Augur mostly uses CSV-like TSV with the csv module configured in
QUOTE_MINIMAL
quoting mode (the default). The exception isaugur curate
commands which use IANA-like TSV with the csv module configured inQUOTE_NONE
mode, meaning it will error if field values contain a tab or newline.Nextclade outputs CSV-like TSV.
Related to:
augur curate passthru
can add double quotations #1312 (comment)csvtk
The text was updated successfully, but these errors were encountered: