-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Add taxonomic utilities for LINs and enable tax metagenome
#2469
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2469 +/- ##
==========================================
+ Coverage 84.81% 85.08% +0.27%
==========================================
Files 133 133
Lines 14814 15062 +248
Branches 2513 2585 +72
==========================================
+ Hits 12564 12816 +252
+ Misses 1948 1944 -4
Partials 302 302
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
We could simplify the |
random thought: what about just |
done:
done, using |
Something to think about in this PR: lin positions are 0-based throughout the code. This really only affects users in one spot: when they use the |
I added this to #2519 to avoid merge conflicts from modifying the same code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work!
Add taxonomic utilities for LINs; enable and test
tax metagenome
With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we are (mostly) no longer tied to named ranks. Here, I add a class for LIN taxonomies and use it within
tax metagenome
to allow summarization up LINs and reporting at specifiedlingroups
.With this PR, users can now use the flag
--lins
to read and uselin
taxonomies from the provided tax (-t
,--taxonomy
) file. If used,sourmash tax
will look for alin
column in the taxonomy file instead of looking forsuperkingdom
...strain
columns. Thelin
column should contain;
-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length).For
tax metagenome
:By default,
tax metagenome
will summarize up all available ranks/LIN positions. If alingroup
file is provided, we will also report a subset of this summary: just the LIN prefixes that match groups in thelingroup
file. Thelingroup
file requires two columns in any order:name
, the name of the group, andlin
, the lin prefix of the group. The prefix will be used to select results from the full summary for reporting. Thelingroup
format will build a file with the following name:{base}.lingroup.tsv
, where{base}
is the name provided via the-o
,--output-base
option.Demo / Tutorial
A draft tutorial is available here. Note that it does not contain the installation info for this branch (see below). You can run the interactive version via binder here
Testing
Option A: Use the Demo Binder
You can test via the binder. You can add new cells or modify any existing cells, and even download additional files for testing. The downside is that you'll have to make sure to download and save your results, since the binder won't save them for you.
Option B: Alternatively, install on your own computer/cluster:
Here is one way to test this code before it gets fully integrated into sourmash:
mamba
, instructions here instead.mamba
, replace the wordconda
withmamba
in the following commands.Download an environment file that points to this branch:
Create a virtual environment using this file:
Activate that environment:
make sure
--lins
is in the--help
forsourmash tax metagenome
:Command to run
The command to run is this one:
Types of files you'll need
ident
,lin
)name
,lin
)To exit the environment when you're done testing, use
conda deactivate
example
lingroup
output format. Note that the1;0
.. paths are always grouped together, but may come before or after the0;0
and2;0
groups.A few implementation details:
tax_utils.py
, I add aLINLineageInfo
class for using and manipulated LIN taxonomies. It implements new methods to enable specifically reading inLIN
taxonomies into the class, but otherwise uses the taxonomic utilities available inBaseLineageInfo
, e.g. taxonomic summarization up ranks, assessing whether two taxonomies are a match at a given rank.tax_utils.py
, I add functionality for readinglingroup
information and reporting taxonomic summarization specifically at these ranks.Changes and Additions:
LINLineageInfo
for working withLIN
taxonomiesLIN
s intoLineageDB
LINgroups
and summarizing to theseLineageInfo
to performbuild_tree
,find_lca
functions (originally inlca_utils.py
) and produce an ordered list of lineage pathsLIN
s taxonomy in:The following require additional changes and will be punted to an issue/separate PR (see #2499):