Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Add taxonomic utilities for LINs and enable tax metagenome #2469

Merged
merged 72 commits into from
Apr 5, 2023

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Feb 9, 2023

Add taxonomic utilities for LINs; enable and test tax metagenome

With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we are (mostly) no longer tied to named ranks. Here, I add a class for LIN taxonomies and use it within tax metagenome to allow summarization up LINs and reporting at specified lingroups.

With this PR, users can now use the flag --lins to read and use lin taxonomies from the provided tax (-t, --taxonomy) file. If used, sourmash tax will look for a lin column in the taxonomy file instead of looking for superkingdom...strain columns. The lin column should contain ;-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length).

For tax metagenome:

By default, tax metagenome will summarize up all available ranks/LIN positions. If a lingroup file is provided, we will also report a subset of this summary: just the LIN prefixes that match groups in the lingroup file. The lingroup file requires two columns in any order: name, the name of the group, and lin, the lin prefix of the group. The prefix will be used to select results from the full summary for reporting. The lingroup format will build a file with the following name: {base}.lingroup.tsv, where {base} is the name provided via the -o, --output-base option.

Demo / Tutorial

A draft tutorial is available here. Note that it does not contain the installation info for this branch (see below). You can run the interactive version via binder here

Testing

Option A: Use the Demo Binder

You can test via the binder. You can add new cells or modify any existing cells, and even download additional files for testing. The downside is that you'll have to make sure to download and save your results, since the binder won't save them for you.

Option B: Alternatively, install on your own computer/cluster:

Here is one way to test this code before it gets fully integrated into sourmash:

  • If you don't have conda, I'd recommend installing mamba, instructions here instead.
    • if you do have mamba, replace the word conda with mamba in the following commands.

Download an environment file that points to this branch:

curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/sourmashLIN.yml

Create a virtual environment using this file:

conda env create -f sourmashLIN.yml

Activate that environment:

conda activate smashLIN

make sure --lins is in the --help for sourmash tax metagenome:

sourmash tax metagenome --help

Command to run

The command to run is this one:

sourmash tax metagenome -g $gather_csv  -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv

Types of files you'll need

  1. sketches of query metagenome
  2. sketches of reference genomes (database)
  3. taxonomy file with LIN information (two columns required: ident, lin)
  4. lingroup information file (two columns required: name, lin)

To exit the environment when you're done testing, use conda deactivate

Reminder, if you have mamba, you can use it in place of conda in the commands above.

example lingroup output format. Note that the 1;0.. paths are always grouped together, but may come before or after the 0;0 and 2;0 groups.

name	lin	percent_containment	num_bp_contained
lg3	2;0;0	1.56	192000
lg1	0;0;0	5.82	714000
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
name	lin	percent_containment	num_bp_contained
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
lg1	0;0;0	5.82	714000
lg3	2;0;0	1.56	192000

A few implementation details:

  • In tax_utils.py, I add a LINLineageInfo class for using and manipulated LIN taxonomies. It implements new methods to enable specifically reading in LIN taxonomies into the class, but otherwise uses the taxonomic utilities available in BaseLineageInfo, e.g. taxonomic summarization up ranks, assessing whether two taxonomies are a match at a given rank.
  • In tax_utils.py, I add functionality for reading lingroup information and reporting taxonomic summarization specifically at these ranks.

Changes and Additions:

  • Add LINLineageInfo for working with LIN taxonomies
  • Add method for reading LINs into LineageDB
  • Add methods for reading LINgroups and summarizing to these
  • Add LineageTree that can use LineageInfo to perform build_tree, find_lca functions (originally in lca_utils.py) and produce an ordered list of lineage paths
  • Add code + tests to use LINs taxonomy in:
    • tax metagenome
    • tax annotate
    • tax summarize

The following require additional changes and will be punted to an issue/separate PR (see #2499):

  • tax genome
  • tax prepare
  • tax grep

@codecov
Copy link

codecov bot commented Feb 9, 2023

Codecov Report

Merging #2469 (5804124) into latest (43ec4ec) will increase coverage by 0.27%.
The diff coverage is 98.44%.

@@            Coverage Diff             @@
##           latest    #2469      +/-   ##
==========================================
+ Coverage   84.81%   85.08%   +0.27%     
==========================================
  Files         133      133              
  Lines       14814    15062     +248     
  Branches     2513     2585      +72     
==========================================
+ Hits        12564    12816     +252     
+ Misses       1948     1944       -4     
  Partials      302      302              
Flag Coverage Δ
python 92.76% <98.44%> (+0.18%) ⬆️
rust 51.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/lca_utils.py 92.37% <ø> (ø)
src/sourmash/cli/tax/genome.py 97.36% <93.33%> (+6.45%) ⬆️
src/sourmash/cli/utils.py 98.83% <96.55%> (-1.17%) ⬇️
src/sourmash/tax/tax_utils.py 98.29% <98.76%> (+0.39%) ⬆️
src/sourmash/cli/tax/annotate.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/metagenome.py 100.00% <100.00%> (+6.66%) ⬆️
src/sourmash/cli/tax/summarize.py 100.00% <100.00%> (ø)
src/sourmash/tax/__main__.py 94.09% <100.00%> (+0.32%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Base automatically changed from alt-lindb to latest February 13, 2023 16:01
@bluegenes
Copy link
Contributor Author

bluegenes commented Mar 7, 2023

We could simplify the lingroups file a bit: perhaps shorten lingroup_name to lingroup and lingroup_prefix to lin or prefix?

@ctb
Copy link
Contributor

ctb commented Mar 7, 2023

--LINgroup_report --> lingroup_report

random thought: what about just lingroup? You're already using -F/--report-format for this, so it should be clear it's for a report.

@bluegenes
Copy link
Contributor Author

We could simplify the lingroups file a bit: perhaps shorten lingroup_name to lingroup and lingroup_prefix to lin or prefix?

done: name, lin

random thought: what about just lingroup? You're already using -F/--report-format for this, so it should be clear it's for a report.

done, using lingroup. Also modified the cli to allow --lingroup as the file, to avoid confusion between --lingroups file and "lingroup" format.

@bluegenes
Copy link
Contributor Author

Something to think about in this PR: lin positions are 0-based throughout the code. This really only affects users in one spot: when they use the --rank/--position argument with tax functions. Perhaps we should just include a reminder that lin positions are 0-based in the help for this argument?

@bluegenes
Copy link
Contributor Author

bluegenes commented Mar 9, 2023

Something to think about in this PR: lin positions are 0-based throughout the code. This really only affects users in one spot: when they use the --rank/--position argument with tax functions. Perhaps we should just include a reminder that lin positions are 0-based in the help for this argument?

I added this to #2519 to avoid merge conflicts from modifying the same code.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work!

@ctb ctb mentioned this pull request Apr 5, 2023
1 task
@bluegenes bluegenes merged commit 827b897 into latest Apr 5, 2023
@bluegenes bluegenes deleted the lins-v2 branch April 5, 2023 14:00
@bluegenes bluegenes restored the lins-v2 branch April 5, 2023 14:19
@bluegenes bluegenes deleted the lins-v2 branch April 5, 2023 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants