-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read a subset of metadata columns #1294
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
76d036d
io: Add class interface for working with metadata files
victorlin 982bc1c
filter: Stream filtered metadata to outputs directly
victorlin 0c5a137
filter: Add function to extract variables from Pandas query string
victorlin f2b807b
filter: Don't apply numerical conversion to columns not in query
victorlin b0a0d11
filter: Add --query-columns option
victorlin 3c6d090
read_metadata: Add option to read a subset of columns
victorlin d0f36a1
filter: Read a subset of metadata columns
victorlin dce0374
frequencies: Read a subset of metadata columns
victorlin 00a600f
refine: Read a subset of metadata columns
victorlin b56f699
Update changelog
victorlin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real-world testing
I ran this against the metadata file produced by ncov-ingest (
s3://nextstrain-data/files/ncov/open/metadata.tsv.zst
) which has 8.5 million rows x 58 columns. I used the following command to sample to 10 random sequences:This took 8m21s to run on
master
, and 6m21s with changes from this PR. Here's profiling results before and after, which I visualized in Snakeviz. A summary:to_csv
takes just a fraction of a second because the metadata for the 10 sequences is already loaded into memory.The example command benefits from a net positive improvement in run time. Although writing time increased due to 5173cb7, reading time decreased even more due to ac23e80.
This was a "best case scenario" for these changes though, since no metadata columns were used, only strain name. I should probably test with
--group-by
,--min-date
, and other options that load additional columns to get a better picture.I did not do any memory profiling. Memory usage is not an issue without these changes, and should be less of an issue with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get round to doing more testing/profiling with
--group-by
and--min-date
and what not?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not yet. Still planning to do so before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested using the ncov 100k subsample as input to an
augur filter
command I grabbed from an ncov build. Run time was 16s onmaster
and 7.37s with these changes (cProfile files).Summary:
readers.py:read
)indexing.py
)write_metadata_based_outputs
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I triggered a ncov GISAID trial run using a Docker image including these changes - it completed successfully in 6h 36m 55s. This is pretty much the same as another trial run 2 days before at 6h 40m 27s. I don't know how much variance there is between run times, and I don't want to compare against non-trial runs or older runs (those have additional Slack notifications and different input metadata sizes). So by this comparison alone, there doesn't seem to be a significant performance benefit for the ncov workflow with GISAID configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
ISTM that last time I looked at ncov's execution profile, by far the slowest step was TreeTime. So not altogether surprising that filter's speed isn't a big impact in the context of a full build.
I grabbed the
benchmarks/subsample_*
files from those runs to get a little more granular insight into differences in wall clock time and max RSS for each subsample rule invocation.avg(after - before)
for wall clock time was -112s, so it shaved roughly 2 min off each subsample step on average. Equivalent for max RSS is -276 (MB, I believe).