-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io: Parse metadata with C engine, restrict to either CSV or TSV #812
Conversation
You could also use |
I'm not sure if this is an acceptable trade off. Diving into history, Augur's supported CSV for metadata since at least 2018 and delimiter sniffing since mid-2020. @huddlej may have thoughts here too. |
I agree, @tsibley, that we can't drop CSV support. The original context for the current implementation is described in #574. There are two separate problems:
In the older Augur implementations, we addressed problem 1 by inspecting the extension of the input filename. This led to the problems in #574. We opted for the convenience of pandas's delimiter sniffer in Python parser mode, to solve this problem at the expense of a slower solution to problem 2. As @fanninpm points out, we could use |
af42673
to
585c671
Compare
585c671
to
44ee1ba
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #812 +/- ##
==========================================
+ Coverage 68.39% 68.42% +0.02%
==========================================
Files 63 63
Lines 6812 6818 +6
Branches 1671 1672 +1
==========================================
+ Hits 4659 4665 +6
Misses 1843 1843
Partials 310 310
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@joverlee521 I pushed a couple touch-ups (70d8150...740a330), do those LGTY? |
2783ef8
to
740a330
Compare
@victorlin Changes look good! (Ignoring the unrelated failing Cram test) |
Previously, the delimiter could be anything arbitrary. However, all Augur subcommands that use this function only advertise compatibility with CSV and TSV. I don't think there's a good reason to support arbitrary delimiters.
The python engine was only used to detect the delimiter. Now that the delimiter is detected separately, use the C engine since it is faster.
Avoids re-defining this list at each use case and prevents them from getting out of sync.
740a330
to
9f48ff2
Compare
Description of proposed changes
See commit messages.
Related issue(s)
Thinking about this since I'm making similar changes for the augur filter database implementation.
Testing
Checklist