Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Add taxonomic utilities for LINs and enable tax metagenome #2469

Merged
merged 72 commits into from
Apr 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
7cc6e3f
fix LineagePair usage?
ctb Feb 7, 2023
95bcf8e
read in taxids if avail and use for kreport
bluegenes Feb 8, 2023
3418594
fix comment
bluegenes Feb 8, 2023
956c158
mod lineage_dict init for taxpath
bluegenes Feb 8, 2023
50619cd
use RankLineageInfo to read and lineages csv
bluegenes Feb 9, 2023
03cf9e3
addl tests
bluegenes Feb 9, 2023
01f8196
Merge branch 'latest' into alt-lindb
bluegenes Feb 9, 2023
8f722af
err if n_positions insufficient for provided lineage_str
bluegenes Jan 17, 2023
4ae1c7c
wording
bluegenes Jan 17, 2023
0db767e
test init fail
bluegenes Jan 17, 2023
a3cf4a1
fix
bluegenes Feb 9, 2023
7386e5b
fix2
bluegenes Feb 9, 2023
6b5f2cd
resolve issues from merge
bluegenes Feb 9, 2023
0f882d7
test for missing taxids; taxpath shorter than provided rank names
bluegenes Feb 9, 2023
842ba39
clarify comment
bluegenes Feb 9, 2023
ce3c991
clarify comment2
bluegenes Feb 9, 2023
87f7e50
undelete line
bluegenes Feb 10, 2023
de496c4
Merge branch 'alt-lindb' into lins-v2
bluegenes Feb 10, 2023
9b139b6
add filled_pos
bluegenes Feb 10, 2023
d72df57
read LIN into LineageDB
bluegenes Feb 10, 2023
f39aa54
actually add LIN test taxonomy
bluegenes Feb 10, 2023
360cfd9
allow LIN with tax metagenome
bluegenes Feb 10, 2023
6a80bc1
Merge branch 'latest' into lins-v2
bluegenes Feb 13, 2023
ee2bb20
actually save conflict resolution
bluegenes Feb 13, 2023
9c72740
add init LINSLineageInfo from tuples (for LineageDB compatibility)
bluegenes Feb 13, 2023
97e52cb
naming
bluegenes Feb 14, 2023
e7efbf7
tmp save
bluegenes Feb 14, 2023
d573cb1
add LINgroup summarization utilities
bluegenes Feb 15, 2023
69ed6a9
add LINgroup summarization
bluegenes Feb 15, 2023
871708b
add lingroup summarization method
bluegenes Feb 15, 2023
49558d9
add fn to read LINgroups file into dict
bluegenes Feb 15, 2023
6f26e0b
fix assigned; add full lg test
bluegenes Feb 15, 2023
a040a4b
test more lg reading failures
bluegenes Feb 15, 2023
7cb5700
test bad cli inputs
bluegenes Feb 15, 2023
10ad4e6
Merge branch 'latest' into lins-v2
bluegenes Feb 16, 2023
6e6a34c
rm print
bluegenes Feb 16, 2023
e0eff6e
Merge branch 'latest' into lins-v2
bluegenes Feb 17, 2023
acfc843
lingroup output as tsv
bluegenes Feb 17, 2023
bb1aea3
rm remaining lca_utils usage in tax
bluegenes Feb 17, 2023
7dba708
rm remaining lca_utils usage in tax main
bluegenes Feb 18, 2023
1bb6990
rm print st
bluegenes Feb 18, 2023
4558644
enable LIN for summarize to rm lca utilities
bluegenes Feb 18, 2023
6fe08f1
LIN pos for human summary; test LIN pos
bluegenes Feb 20, 2023
8fcd26b
allow completely empty LIN initialization
bluegenes Feb 20, 2023
8ebdfe2
add find_lca method to LineageInfo classes
bluegenes Feb 20, 2023
7379b02
Merge branch 'latest' into lins-v2
bluegenes Feb 28, 2023
6a4449c
enable LIN for tax annotate
bluegenes Mar 3, 2023
3402c73
punt tax genome to separate PR
bluegenes Mar 3, 2023
ad367f2
change LINs test filename
bluegenes Mar 3, 2023
2e82b19
clean up
bluegenes Mar 3, 2023
95ef04b
Merge branch 'latest' into lins-v2
bluegenes Mar 3, 2023
caa42f6
add some docs
bluegenes Mar 3, 2023
2dd45b6
MRG: LineageTree class to help with LINGroup ordering (#2496)
bluegenes Mar 6, 2023
d117fca
Merge branch 'latest' into lins-v2
bluegenes Mar 6, 2023
a08b46b
simplify linputs
bluegenes Mar 6, 2023
c57f688
allow --lins or --lin-taxonomy
bluegenes Mar 6, 2023
f21eb7c
add demo as tutorial
bluegenes Mar 7, 2023
ac32ff3
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
68e9afa
add data ref
bluegenes Mar 7, 2023
b89f826
fix typo
bluegenes Mar 7, 2023
edd360a
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
9617bea
fix typo
bluegenes Mar 7, 2023
035e4b2
better content headers
bluegenes Mar 7, 2023
39b6010
add refs for sourmash tax
bluegenes Mar 7, 2023
8f8d9a6
simplify lingroup file colnames and lingroup report name
bluegenes Mar 7, 2023
10e5dda
flex
bluegenes Mar 7, 2023
aac8669
more description
bluegenes Mar 7, 2023
7fcef3c
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
2751ebe
more description for tutorial
bluegenes Mar 7, 2023
e8cb7a0
better lingroup output documentation
bluegenes Mar 7, 2023
907b74c
rank arg tests
bluegenes Mar 7, 2023
5804124
Merge branch 'latest' into lins-v2
bluegenes Apr 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 33 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ information; these are grouped under the `sourmash tax` and
* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
* `tax genome` - summarize single-genome gather results and report most likely classification.
* `tax annotate` - annotate gather results with lineage information (no summarization or classification).
* `tax prepare` - prepare and/or combine taxonomy files.
* `tax grep` - subset taxonomies and create picklists based on taxonomy string matches.
* `tax summarize` - print summary information (counts of lineages) for a taxonomy lineages file or database.

Expand Down Expand Up @@ -491,7 +492,8 @@ The sourmash `tax` or `taxonomy` commands integrate taxonomic
`gather` command (we cannot combine separate `gather` runs for the
same query). For supported databases (e.g. GTDB, NCBI), we provide
taxonomy csv files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).
databases. As of v4.8, some sourmash taxonomy commands can also use `LIN`
lineage information. For more information, see [databases](databases.md).

`tax` commands rely upon the fact that `gather` provides both the total
fraction of the query matched to each database matched, as well as a
Expand Down Expand Up @@ -530,8 +532,13 @@ sourmash tax metagenome
--taxonomy gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
`krona`.
The possible output formats are:
- `human`
- `csv_summary`
- `lineage_summary`
- `krona`
- `kreport`
- `lingroup_report`

#### `csv_summary` output format

Expand Down Expand Up @@ -707,6 +714,29 @@ example sourmash `{output-name}.kreport.txt`:
```


#### `lingroup` output format

When using LIN taxonomic information, you can optionally also provide a `lingroup` file with two required columns: `name` and `lin`. If provided, we will produce a file, `{base}.lingroups.tsv`, where `{base}` is the name provided via the `-o`,` --output-base` option. This output will select information from the full summary that match the LIN prefixes provided as groups.

This output format consists of four columns:
- `name`, `lin` columns are taken directly from the `--lingroup` file
- `percent_containment`, the total percent of the dataset contained in this lingroup and all descendents
- `num_bp_contained`, the estimated number of base pairs contained in this lingroup and all descendents.

Similar to `kreport` above, we use the wording "contained" rather than "assigned," because `sourmash` assigns matches at the genome level, and the `tax` functions summarize this information.

example output:
```
name lin percent_containment num_bp_contained
lg1 0;0;0 5.82 714000
lg2 1;0;0 5.05 620000
lg3 2;0;0 1.56 192000
lg3 1;0;1 0.65 80000
lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000
```

Related lingroup subpaths will be grouped in output, but exact ordering may change between runs.

### `sourmash tax genome` - classify a genome using `gather` results

`sourmash tax genome` reports likely classification for each query,
Expand Down
7 changes: 7 additions & 0 deletions doc/databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ You can read more about the different database and index types [here](https://so

Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1 and up.

## Taxonomic Information (for non-LCA databases)

For each prepared database, we have also made taxonomic information available linking each genome with its assigned lineage (`GTDB` or `NCBI` as appropriate).
For private databases, users can create their own `taxonomy` files: the critical columns are `ident`, containing the genome accession (e.g. `GCA_1234567.1`) and
a column for each taxonomic rank, `superkingdom` to `species`. If a `strain` column is provided, it will also be used.
As of v4.8, we can also use LIN taxonomic information in tax commands that accept the `--lins` flag. If used, `sourmash tax` commands will require a `lin` column in the taxonomy file which should contain `;`-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). Some taxonomy commands also accept a `lingroups` file, which is a two-column file (`name`, `lin`) describing the name and LIN prefix of LINgroups to be used for taxonomic summarization.

## Downloading and using the databases

All databases below can be downloaded via the command line with `curl -JLO <url>`, where `<url>` is the URL below. This will download an appropriately named file; you can name it yourself by specify `'-o <output>` to specify the local filename.
Expand Down
Loading