diff --git a/doc/command-line.md b/doc/command-line.md index 15f9591c44..e98e337d8b 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -91,6 +91,7 @@ information; these are grouped under the `sourmash tax` and * `tax metagenome` - summarize metagenome gather results at each taxonomic rank. * `tax genome` - summarize single-genome gather results and report most likely classification. * `tax annotate` - annotate gather results with lineage information (no summarization or classification). +* `tax prepare` - prepare and/or combine taxonomy files. * `tax grep` - subset taxonomies and create picklists based on taxonomy string matches. * `tax summarize` - print summary information (counts of lineages) for a taxonomy lineages file or database. @@ -491,7 +492,8 @@ The sourmash `tax` or `taxonomy` commands integrate taxonomic `gather` command (we cannot combine separate `gather` runs for the same query). For supported databases (e.g. GTDB, NCBI), we provide taxonomy csv files, but they can also be generated for user-generated - databases. For more information, see [databases](databases.md). + databases. As of v4.8, some sourmash taxonomy commands can also use `LIN` + lineage information. For more information, see [databases](databases.md). `tax` commands rely upon the fact that `gather` provides both the total fraction of the query matched to each database matched, as well as a @@ -530,8 +532,13 @@ sourmash tax metagenome --taxonomy gtdb-rs202.taxonomy.v2.csv ``` -There are three possible output formats, `csv_summary`, `lineage_summary`, and - `krona`. +The possible output formats are: +- `human` +- `csv_summary` +- `lineage_summary` +- `krona` +- `kreport` +- `lingroup_report` #### `csv_summary` output format @@ -707,6 +714,29 @@ example sourmash `{output-name}.kreport.txt`: ``` +#### `lingroup` output format + +When using LIN taxonomic information, you can optionally also provide a `lingroup` file with two required columns: `name` and `lin`. If provided, we will produce a file, `{base}.lingroups.tsv`, where `{base}` is the name provided via the `-o`,` --output-base` option. This output will select information from the full summary that match the LIN prefixes provided as groups. + +This output format consists of four columns: +- `name`, `lin` columns are taken directly from the `--lingroup` file +- `percent_containment`, the total percent of the dataset contained in this lingroup and all descendents +- `num_bp_contained`, the estimated number of base pairs contained in this lingroup and all descendents. + +Similar to `kreport` above, we use the wording "contained" rather than "assigned," because `sourmash` assigns matches at the genome level, and the `tax` functions summarize this information. + +example output: +``` +name lin percent_containment num_bp_contained +lg1 0;0;0 5.82 714000 +lg2 1;0;0 5.05 620000 +lg3 2;0;0 1.56 192000 +lg3 1;0;1 0.65 80000 +lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000 +``` + +Related lingroup subpaths will be grouped in output, but exact ordering may change between runs. + ### `sourmash tax genome` - classify a genome using `gather` results `sourmash tax genome` reports likely classification for each query, diff --git a/doc/databases.md b/doc/databases.md index 0111625306..38d9430764 100644 --- a/doc/databases.md +++ b/doc/databases.md @@ -15,6 +15,13 @@ You can read more about the different database and index types [here](https://so Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1 and up. +## Taxonomic Information (for non-LCA databases) + +For each prepared database, we have also made taxonomic information available linking each genome with its assigned lineage (`GTDB` or `NCBI` as appropriate). +For private databases, users can create their own `taxonomy` files: the critical columns are `ident`, containing the genome accession (e.g. `GCA_1234567.1`) and +a column for each taxonomic rank, `superkingdom` to `species`. If a `strain` column is provided, it will also be used. +As of v4.8, we can also use LIN taxonomic information in tax commands that accept the `--lins` flag. If used, `sourmash tax` commands will require a `lin` column in the taxonomy file which should contain `;`-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). Some taxonomy commands also accept a `lingroups` file, which is a two-column file (`name`, `lin`) describing the name and LIN prefix of LINgroups to be used for taxonomic summarization. + ## Downloading and using the databases All databases below can be downloaded via the command line with `curl -JLO `, where `` is the URL below. This will download an appropriately named file; you can name it yourself by specify `'-o ` to specify the local filename. diff --git a/doc/tutorial-lin-taxonomy.md b/doc/tutorial-lin-taxonomy.md new file mode 100644 index 0000000000..5512a6115f --- /dev/null +++ b/doc/tutorial-lin-taxonomy.md @@ -0,0 +1,528 @@ +# Analyzing Metagenome Composition using the LIN taxonomic framework + +Tessa Pierce Ward + +March 2023 + +requires sourmash v4.8+ + +--- + +```{contents} + :depth: 2 +``` + +This tutorial uses the `sourmash taxonomy` module, which was introduced via [blog post](https://bluegenes.github.io/sourmash-tax/) +and was recently shown to perfom well for taxonomic profiling of long (and short) reads in [Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://link.springer.com/article/10.1186/s12859-022-05103-0), Portik et al., 2022. + + +In this tutorial, we'll use sourmash gather to analyze metagenomes using the [LIN taxonomic framework](https://dl.acm.org/doi/pdf/10.1145/3535508.3545546). +Specifically, we will analyze plant metagenomes with a low-level pathogen spike-in. +The goal is to see if we can correctly assign the pathogen sequence to its LINgroup, which includes +all known pathogenic strains. + +- `barcode1` - highest spike-in (75 picogram/microliter pathogen DNA) +- `barcode3` - lower spike-in (7.5 picogram/microliter pathogen DNA) +- `barcode5` - no spike-in + +The pathogen is `Ralstonia solanacearum` in the `Phylum IIB sequevar 1` group. + +This data is courtesy of [The Laboratory of Plant & Atmospheric Microbiology & (Meta)Genomics](https://sites.google.com/vt.edu/lab-vinatzer/home) in collaboration with USDA APHIS. + +## Install sourmash + +First, we need to install the software! We'll use conda/mamba to do this. + +The below command installs [sourmash](http://sourmash.readthedocs.io/). + +Install the software: +``` +# create a new environment +mamba create -n smash -y -c conda-forge -c bioconda sourmash +``` + +then activate the conda environment: +``` +conda activate smash +``` + +> Victory conditions: your prompt should start with +> `(smash) ` +> and you should now be able to run `sourmash` and have it output usage information!! + +## Create a working subdirectory + +Make a directory named `smash_lin`, change into it: +``` +mkdir -p ~/smash_lin +cd ~/smash_lin +``` + +Now make a couple useful folders: +``` +mkdir -p inputs +mkdir -p databases +``` + +## Download relevant data + +### First, download a database and taxonomic information + +Here, we know the spike-in is a pathogenic seqevar of Ralstonia. We will download a database +containing signatures of 27 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information. + +``` +# database +curl -JLO https://osf.io/vxsta/download +mv ralstonia*.zip ./databases/ralstonia.zip + +# taxonomy csv +curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/databases/ralstonia-lin.taxonomy.GCA-GCF.csv +mv ralstonia-lin.taxonomy.GCA-GCF.csv ./databases + +# lingroup csv +curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/inputs/ralstonia.lingroups.csv +mv ralstonia.lingroups.csv ./databases + +ls databases # look at the database files +``` + +### Next, download pre-made sourmash signatures made from the input metagenomes + +``` +# download barcode 1 sig +curl -JLO https://osf.io/ujntr/download +mv barcode1_22142.sig.zip ./inputs/ + +# download barcode 3 signature +curl -JLO https://osf.io/2h9wx/download +mv barcode3_31543.sig.zip ./inputs + +# download barcode 5 signature +curl -JLO https://osf.io/k8nw5/download +mv barcode5_36481.sig.zip ./inputs + +# look at available input files +ls inputs +``` + +## Look at the signatures + +Let's start with the `barcode1` (highest spike-in) sample + +### First, let's look at the metagenome signature. + +By running `sourmash sig fileinfo`, we can see information on the signatures available within the zip file. + +Here, you can see I've generated the metagenome signature with `scaled=1000` and built two ksizes, `k=31` and `k=51` + +Run: +``` +sourmash sig fileinfo ./inputs/barcode1_22142.sig.zip +``` + +In the output, you should see: +``` +** loading from './inputs/barcode1_22142.sig.zip' +path filetype: ZipFileLinearIndex +location: /home/jovyan/smash_lin/inputs/barcode1_22142.sig.zip +is database? yes +has manifest? yes +num signatures: 2 +total hashes: 914328 +summary of sketches: + 1 sketches with DNA, k=31, scaled=1000, abund 426673 total hashes + 1 sketches with DNA, k=51, scaled=1000, abund 487655 total hashes +``` + +### We can also look at the database + +Here, you can see I've generated the database with `scaled=1000` and built three ksizes, `k=21`, `k=31` and `k=51` + +Run: +``` +sourmash sig fileinfo ./databases/ralstonia.zip +``` + +In the output, you should see: + +``` +** loading from './databases/ralstonia.zip' +path filetype: ZipFileLinearIndex +location: /home/jovyan/databases/ralstonia.zip +is database? yes +has manifest? yes +num signatures: 81 +** examining manifest... +total hashes: 445041 +summary of sketches: + 27 sketches with DNA, k=21, scaled=1000, abund 148324 total hashes + 27 sketches with DNA, k=31, scaled=1000, abund 148111 total hashes + 27 sketches with DNA, k=51, scaled=1000, abund 148606 total hashes +``` +There's a lot of things to digest in this output but the two main ones are: +* there are 27 genomes represented in this database, each of which are sketched at k=21,k=31,k=51 +* this database represents ~445 *million* k-mers (multiply number of hashes by the scaled number) + + +## Run sourmash gather using ksize 51 + +Now let's run `sourmash gather` to find the closest reference genome(s) in the database. +If you want to read more about what sourmash is doing, please see [Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), Irber et al., 2022. + +Run: +``` +query="inputs/barcode1_22142.sig.zip" +database="databases/ralstonia.zip" + +gather_csv_output="barcode1_22141.k51.gather.csv" + +sourmash gather $query $database -k 51 -o $gather_csv_output +``` + +You should see the following output: +``` +selecting specified query k=51 +loaded query: barcode1_22142... (k=51, DNA) +--ading from 'databases/ralstonia.zip'... +loaded 81 total signatures from 1 locations. +after selecting signatures compatible with search, 27 remain. +Starting prefetch sweep across databases. + +Found 7 signatures via prefetch; now doing gather. + +overlap p_query p_match avg_abund +--------- ------- ------- --------- +105.0 kbp 0.0% 2.0% 1.0 GCA_002251655.1 Ralstonia solanacear... +found less than 50.0 kbp in common. => exiting + +found 1 matches total; +the recovered matches hit 0.0% of the abundance-weighted query. +the recovered matches hit 0.0% of the query k-mers (unweighted). +``` + +The first step of gather found all potential matches (7), and the greedy algorithm narrowed this to a single best match, `GCA_002251655.1` which shared an estimated 105 kbp with the metagenome (a very small percentage of the total dataset.) This is expected, though, since the dataset is a plant metagenome with a small `Ralstonia` spike-in. + +## Add taxonomic information and summarize up lingroups + +`sourmash gather` finds the smallest set of reference genomes that contains all the known information (k-mers) in the metagenome. In most cases, `gather` will find many metagenome matches. Here, we're only looking for `Ralstonia` matches and we only have a single gather result. Regardless, let's use `sourmash tax metagenome` to add taxonomic information and see if we've correctly assigned the pathogenic sequence. + +### First, let's look at the relevant taxonomy files. + +These commands will show the first few lines of each file. If you prefer, you can look at a more human-friendly view by opening the files in a spreadsheet program. + +- **taxonomy_csv:** `databases/ralstonia-lin.taxonomy.GCA-GCF.csv` + - the essential columns are `lin` (`14;1;0;...`) and `ident` (`GCF_00`...) +- **lingroups information:** `databases/ralstonia.lingroups.csv` + - both columns are essential (`name`, `lin`) + + +Look at the taxonomy file: +``` +head -n 5 databases/ralstonia-lin.taxonomy.GCA-GCF.csv +``` + +You should see: +``` +lin,species,strain,filename,accession,ident +14;1;0;0;0;0;0;0;0;0;6;0;1;0;1;0;0;0;0;0,Ralstonia solanacearum,OE1_1,GCF_001879565.1_ASM187956v1_genomic.fna,GCF_001879565.1,GCF_001879565.1 +14;1;0;0;0;0;0;0;0;0;6;0;1;0;0;0;0;0;0;0,Ralstonia solanacearum,PSS1308,GCF_001870805.1_ASM187080v1_genomic.fna,GCF_001870805.1,GCF_001870805.1 +14;1;0;0;0;0;0;0;0;0;2;1;0;0;0;0;0;0;0;0,Ralstonia solanacearum,FJAT_1458,GCF_001887535.1_ASM188753v1_genomic.fna,GCF_001887535.1,GCF_001887535.1 +14;1;0;0;0;0;0;0;0;0;2;0;0;4;4;0;0;0;0;0,Ralstonia solanacearum,Pe_13,GCF_012062595.1_ASM1206259v1_genomic.fna,GCF_012062595.1,GCF_012062595.1 +``` +> The key columns are: +> - `ident`, containing identifiers matching the database sketches +> - `lin`, containing the species information. + +Now, let's look at the lingroups file +``` +head -n5 databases/ralstonia.lingroups.csv +``` + +You should see: +``` +name,lin +Phyl II,14;1;0;0;0;3;0 +Phyl IIA,14;1;0;0;0;3;0;1;0;0 +Phyl IIB,14;1;0;0;0;3;0;0 +Phyl IIB seq1 and seq2,14;1;0;0;0;3;0;0;0;0;1;0;0;0;0 +``` +> Here, we have two columns: +> - `name` - the name for each lingroup. +> - `lin` - the LIN prefix corresponding to each group. + + +### Now, run `sourmash tax metagenome` to integrate taxonomic information into `gather` results + +Using the `gather` output we generated above, we can integrate taxonomic information and summarize up "ranks" (lin positions). We can produce several different types of outputs, including a `lingroup` report. + +`lingroup` format summarizes the taxonomic information at each `lingroup`, and produces a report with 4 columns: +- `name` (from lingroups file) +- `lin` (from lingroups file) +- `percent_containment` - total % of the file matched to this lingroup +- `num_bp_contained` - estimated number of bp matched to this lingroup + +> Since sourmash assigns all k-mers to individual genomes, no reads/base pairs are "assigned" to higher taxonomic ranks or lingroups (as with Kraken-style LCA). Here, "percent_containment" and "num_bp_contained" is calculated by summarizing the assignments made to all genomes in a lingroup. This is akin to the "contained" information in Kraken-style reports. + +Run `tax metagenome`: +``` +gather_csv_output="barcode1_22141.k51.gather.csv" +taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv" +lingroups_csv="databases/ralstonia.lingroups.csv" + +sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \ + --lins --lingroup $lingroups_csv +``` + +You should see: +``` +loaded 1 gather results from 'barcode1_22141.k51.gather.csv'. +loaded results for 1 queries from 1 gather CSVs +Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +Read 11 lingroup rows and found 11 distinct lingroup prefixes. +``` + +and the results: +``` +name lin percent_containment num_bp_contained +Phyl II 14;1;0;0;0;3;0 0.02 108000 +Phyl IIB 14;1;0;0;0;3;0;0 0.02 108000 +Phyl IIB seq1 and seq2 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0 0.02 108000 +IIB seq1 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 0.02 108000 +``` + +Here, the most specific lingroup we assign to is `Phyl IIB seq1`, which is the pathogenic lingroup that was spiked in, yay! Note that the other groups in the output all contain this group. + + +#### Now output the lingroup report to a file (instead of to the terminal) + +use `-o` to provide an output basename for taxonomic output. + +``` +gather_csv_output="barcode1_22141.k51.gather.csv" +taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv" +lingroups_csv="databases/ralstonia.lingroups.csv" + +sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \ + --lins --lingroup $lingroups_csv \ + -o "barcode1" +``` + +> You should see `saving 'lingroup' output to 'barcode1.lingroup.tsv'` in the output. + +#### Optionally, write multiple output formats + +You can use `-F` to specify additional output formats. Here, I've added `csv_summary`. Note that while the `lingroup` format will be generated automatically if you specify the `--lingroup` file, you can also specify it with `-F lingroup` if you want, as I've done here. + +Run: +``` +gather_csv_output="barcode1_22141.k51.gather.csv" +taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv" +lingroups_csv="databases/ralstonia.lingroups.csv" + +sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \ + --lins --lingroup $lingroups_csv \ + -F lingroup csv_summary -o "barcode1" +``` + + +You should see the following in the output: + +``` +saving 'csv_summary' output to 'barcode1.summarized.csv'. +saving 'lingroup' output to 'barcode1.lingroup.txt'. +``` + +The `csv_summary` format is the **full** summary of this sample, e.g. the summary at each taxonomic rank (LIN position). It also includes an entry with the `unclassified` portion at each rank. + +> Note: Multiple output formats require the `-o` `--output-base` to be specified, as each must be written to a file. + +Here's an abbreviated version of the `gather` results for `barcode1`, with lingroup information added: + +| | **ksize** | **scaled** | **best overlap** | **gather match(es)** | **lingroup** | **lin** | +| ------- | --------- | ---------- | ---------------- | -------------------- | ------------ | ---------------------------------- | +| **bc1** | 51 | 1000 | 105 kb | GCA_002251655.1 | IIB seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | +| **bc1** | 31 | 1000 | 173 kb | GCA_002251655.1 | IIB seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | + + +### Now run with `barcode3` sample + +#### sourmash gather +Run: +``` +query="inputs/barcode3_31543.sig.zip" +database="databases/ralstonia.zip" + +gather_csv_output="barcode3_31543.dna.k51.gather.csv" + +sourmash gather $query $database -k 51 -o $gather_csv_output +``` + +You should see: +``` +selecting specified query k=51 +loaded query: barcode3_31543... (k=51, DNA) +loading from 'databases/ralstonia.zip'... +loaded 81 total signatures from 1 locations. +after selecting signatures compatible with search, 27 remain. +Starting prefetch sweep across databases. +Found 0 signatures via prefetch; now doing gather. +found less than 50.0 kbp in common. => exiting + + +found 0 matches total; +the recovered matches hit 0.0% of the query k-mers (unweighted). +``` + +#### gather found no sequence matches! But, we can lower the detection threshold: + +``` +query="inputs/barcode3_31543.sig.zip" +database="databases/ralstonia.zip" +gather_csv_output="barcode3_31543.k51.gather.csv" + +# use a 10kb detection threshold +sourmash gather $query $database -k 51 --threshold-bp 10000 -o $gather_csv_output +``` + +This time, you should see: +``` +selecting specified query k=51 +loaded query: barcode3_31543... (k=51, DNA) +loading from 'databases/ralstonia.zip'... +loaded 81 total signatures from 1 locations. +after selecting signatures compatible with search, 27 remain. +Starting prefetch sweep across databases. +Found 6 signatures via prefetch; now doing gather. + + +overlap p_query p_match avg_abund +--------- ------- ------- --------- +12.0 kbp 0.0% 0.2% 1.0 GCA_000750575.1 Ralstonia solanacear... + +found 1 matches total; +the recovered matches hit 0.0% of the abundance-weighted query. +the recovered matches hit 0.0% of the query k-mers (unweighted). + +``` + +You'll notice that while we have an estimated ~12kbp overlap, the matched genome (`GCA_000750575.1`) is different from the one matched above for `barcode5`. If you run `sourmash tax metagenome` on this output, you'll see that this genome belongs to `Phyl IIB seq 2` group, which is a sister group to the correct `Phyl IIB seq 1` group that we expected. So we have a match but it's not the right one -- why not? + +### What happened? Use `prefetch` to investigate + +`sourmash gather` has two steps: first, it runs a `prefetch` to find ALL genome matches, and then uses a greedy approach to select the smallest set of genomes that contain ('cover') all known sequence content. Let's run `prefetch` independently so we can look at the results of the first step. Here, let's use `--threshold-bp 0` to get all possible matches. + +Run: +``` +query="inputs/barcode3_31543.sig.zip" +prefetch_csv_output="barcode3_31543.k51.prefetch.csv" +database="databases/ralstonia.zip" + +sourmash prefetch $query $database -k 51 --threshold-bp 0 -o $prefetch_csv_output +``` + +You should see: +``` +selecting specified query k=51 +loaded query: barcode3_31543... (k=51, DNA) +query sketch has scaled=1000; will be dynamically downsampled as needed. +--tal of 10 matching signatures so far.tonia.zip' +loaded 81 total signatures from 1 locations. +after selecting signatures compatible with search, 27 remain. +-- +total of 15 matching signatures. +saved 15 matches to CSV file 'barcode3_31543.k51.prefetch.csv' +of 487043 distinct query hashes, 12 were found in matches above threshold. +a total of 487031 query hashes remain unmatched. +final scaled value (max across query and all matches) is 1000 +``` + +Here, the output is telling us we found matches to 15 of the 27 Ralstonia genomes. But only **12 k-mers** were shared between the metagenome sample and the genomes. Remember that sourmash uses a representative subsample of all k-mers, so here these 12 k-mers represent ~ 12kb of sequence (12 * scaled). We've found that this is sufficient to detect presence of an organism, but at this low level, it can be hard to distinguish between closely-related genomes. Let's open the prefetch output to see how those 12 k-mers matched between different genomes. + +#### Look at the `barcode3_31543.k51.prefetch.csv` file + +> Use a spreadsheet program on your computer or use `less -S barcode3_31543.k51.prefetch.csv` to see the file on the terminal. If using `less`, hit `q` when you want to exit and return to your terminal prompt. + +The first column contains the estimated number of base pairs matched between our query and each matching reference genome. You'll notice there are four genomes that match 12kb of sequence, one of which is the "correct" genome (`GCA_002251655.1`, which is in the `IIB seq1` lingroup). + +**What is happening here?** + +When faced with equally good matches, `sourmash gather` makes a random choice about which genome to assign these k-mers to. This happens primarily with highly similar genomes and/or very small sequence matches. If this happens and you need to distinguish between these genomes, we recommend trying a lower scaled value (higher resolution). "scaled" refers to the systematic downsampling: we keep rougly 1/scaled k-mers (`scaled=1000` keeps ~1 of every 1000 unique k-mers). `scaled=1` keeps all k-mers, but our signature storage is not optimized for this use case. + +To see if we could robustly assign the correct sequevar for `barcode3` using a higher resolution sketch, I also ran `gather` using `scaled=100`. + +Here's an abbreviated version of the `gather` results for `barcode3`, with lingroup information added: + + + +| | **ksize** | **scaled** | **best overlap** | **gather match(es)** | **lingroup** | **lin** | +| ------- | --------- | ---------- | ---------------- | -------------------- | ------------ | ---------------------------------- | +| **bc3** | 51 | 1000 | 12kb | GCA_000750575.1 | IIB seq2 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0 | +| **bc3** | 31 | 1000 | 28 kb | GCA_002251655.1 | IIB seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | +| **bc3** | 51 | 100 | 14.8 kb | GCA_002251655.1 | IIB seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | +| **bc3** | 31 | 100 | 21.1 kb | GCA_002251655.1 | IIB seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | + +We typically use k=51 for strain-level matching and k=31 for species-level matching. Notice that running at k=31 with scaled 1000 found the right match. However, if you run prefetch for `k=31`, you see there are three matches with `28kb` overlap, so we just got lucky that `gather` selected the right one for this test case. + +In contrast, by sketching the `Ralstonia` genomes and metagenome at higher resolution (`scaled=100`), we had sufficient information to correctly assign the sequence to the `IIB seq1` lingroup at either ksize. + + +### Now try the `barcode5` sample + +You can also run the `barcode5` file using the same commands as above: + +``` +query="inputs/barcode5_36481.sig.zip" +database="databases/ralstonia.zip" + +gather_csv_output="barcode5_36481.dna.k51.gather.csv" + +sourmash gather $query $database -k 51 -o $gather_csv_output +``` + +You should see: + +``` +selecting specified query k=51 +loaded query: barcode5_36481... (k=51, DNA) +-- +loaded 81 total signatures from 1 locations. +after selecting signatures compatible with search, 27 remain. + +Starting prefetch sweep across databases. +Found 0 signatures via prefetch; now doing gather. +found less than 50.0 kbp in common. => exiting + +found 0 matches total; +the recovered matches hit 0.0% of the query k-mers (unweighted). +``` + + +No matches are found. If you drop the threshold-bp to 0 (`--threshold-bp 0`), you can find ~1kbp overlap (a single k-mer match!). **Note, we do not recommend trusting/using results with fewer than 3 k-mer matches (3kbp at scaled=1000)**. Especially in larger databases (e.g. NCBI/GTDB), a single k-mer match might actually be from contamination in the reference genome rather than true genome content, so you may end up assigning the wrong lineage. Requiring 3 k-mers (representing ~3kb of matching sequence) makes it more likely your matches represent true genome content. + +I then ran this file at higher resolution to see how the results changed. In each case, very few k-mers matched and we could not robustly identify a specific `Ralstonia` genome or lingroup. As it turns out, `barcode5` does not have a `Ralstonia` spike-in, so this is a good thing! + +Here's an abbreviated version of the `gather` results for `barcode5`, with lingroup information added in cases with a single gather match: + +| | **ksize** | **scaled** | **best overlap** | **gather match(es)** | **lingroup** | **lin** | +| ------- | --------- | ---------- | ---------------- | -------------------- | ------------ | ---------------------------------- | +| **bc5** | 51 | 1000 | 1 kbp | GCA_000750575.1 | IIB seq2 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0 | +| **bc5** | 31 | 1000 | 0 | N/A | | | +| **bc5** | 51 | 100 | 300bp | all | | | +| **bc5** | 31 | 100 | 1.2 kb | all | | | +| **bc5** | 51 | 10 | 120 bp | all | | | +| **bc5** | 31 | 10 | 670 bp | all | | | +| **bc5** | 51 | 5 | 150 bp | all | | | +| **bc5** | 31 | 5 | 500 bp | all | | | + + +**Again, while I've used a threshold-bp of 0 to get the gather match at scaled=1000, we do not typically trust gather matches with less than `3*scaled` overlap (< 3 k-mers matched).** Even at very high resolution (scaled=5), we matched nearly all Ralstonia genomes and could not distinguish a single lingroup. + +We typically recommend running at `scaled=1000` (our default), as this works for most microbial use cases. You can run at higher resolution (lower scaled) if you need to, but higher resolution signatures are larger and can take significantly longer to build and search - use at your own risk :). + +## Summary and concluding thoughts + +The LIN taxonomic framework may be useful distinguishing groups below the species level. +We can now use LINs and lingroups with `sourmash tax metagenome`. For low level matches, the gather greedy +approach can struggle. We are working on ways to better warn users about this behavior and welcome +feedback and suggestions on our [issue tracker](https://github.com/sourmash-bio/sourmash/issues/new). \ No newline at end of file diff --git a/doc/tutorials.md b/doc/tutorials.md index bb201fbca4..c089822574 100644 --- a/doc/tutorials.md +++ b/doc/tutorials.md @@ -13,7 +13,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. ## Background and details -These next three tutorials are all notebooks that you can view, run +These next four tutorials are all notebooks that you can view, run yourself, or run interactively online via the [binder](https://mybinder.org) service. @@ -23,6 +23,8 @@ yourself, or run interactively online via the * [Working with private collections of signatures.](sourmash-collections.md) +* [Using `sourmash taxonomy` with the LIN taxonomic framework.](tutorial-lin-taxonomy.md) + ## More information For more information on analyzing sequencing data with sourmash, check out our [longer tutorial](tutorial-long.md). diff --git a/src/sourmash/cli/tax/annotate.py b/src/sourmash/cli/tax/annotate.py index e0c17b0019..501a02fd58 100644 --- a/src/sourmash/cli/tax/annotate.py +++ b/src/sourmash/cli/tax/annotate.py @@ -59,6 +59,10 @@ def subparser(subparsers): '-f', '--force', action = 'store_true', help='continue past errors in file and taxonomy loading', ) + subparser.add_argument( + '--lins', '--lin-taxonomy', action='store_true', default=False, + help='use LIN taxonomy in place of standard taxonomic ranks. Note that the taxonomy CSV must contain LIN lineage information.' + ) def main(args): import sourmash diff --git a/src/sourmash/cli/tax/genome.py b/src/sourmash/cli/tax/genome.py index 54c40fd681..555f812a25 100644 --- a/src/sourmash/cli/tax/genome.py +++ b/src/sourmash/cli/tax/genome.py @@ -95,16 +95,23 @@ def subparser(subparsers): def main(args): import sourmash - if not args.gather_csv and not args.from_file: - raise ValueError(f"No gather CSVs found! Please input via '-g' or '--from-file'.") - if len(args.output_format) > 1: - if args.output_base == "-": - raise TypeError(f"Writing to stdout is incompatible with multiple output formats {args.output_format}") - if not args.rank: - if any(x in ["krona"] for x in args.output_format): - raise ValueError(f"Rank (--rank) is required for krona output format.") - if not args.output_format: - # change to "human" for 5.0 - args.output_format = ["csv_summary"] + try: + if not args.gather_csv and not args.from_file: + raise ValueError(f"No gather CSVs found! Please input via '-g' or '--from-file'.") + # handle output formats + print(args.output_format) + if not args.rank: + if any(x in ["krona"] for x in args.output_format): + raise ValueError(f"Rank (--rank) is required for krona output format.") + if len(args.output_format) > 1: + if args.output_base == "-": + raise ValueError(f"Writing to stdout is incompatible with multiple output formats {args.output_format}") + elif not args.output_format: + # change to "human" for 5.0 + args.output_format = ["csv_summary"] + + except ValueError as exc: + error(f"ERROR: {str(exc)}") + import sys; sys.exit(-1) return sourmash.tax.__main__.genome(args) diff --git a/src/sourmash/cli/tax/metagenome.py b/src/sourmash/cli/tax/metagenome.py index df81789360..cbcca18fad 100644 --- a/src/sourmash/cli/tax/metagenome.py +++ b/src/sourmash/cli/tax/metagenome.py @@ -23,6 +23,8 @@ import sourmash from sourmash.logging import notify, print_results, error +from sourmash.cli.utils import add_rank_arg, check_rank, check_tax_outputs + def subparser(subparsers): @@ -67,29 +69,34 @@ def subparser(subparsers): ) subparser.add_argument( '-F', '--output-format', default=[], nargs='*', action="extend", - choices=["human", "csv_summary", "krona", "lineage_summary", "kreport"], + choices=["human", "csv_summary", "krona", "lineage_summary", "kreport", "lingroup"], help='choose output format(s)', ) - subparser.add_argument( - '-r', '--rank', choices=['strain','species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom'], - help='For non-default output formats: Summarize genome taxonomy at this rank and above. Note that the taxonomy CSV must contain lineage information at this rank.' - ) subparser.add_argument( '-f', '--force', action = 'store_true', help='continue past errors in taxonomy database loading', ) + subparser.add_argument( + '--lins', '--lin-taxonomy', action='store_true', default=False, + help="use LIN taxonomy in place of standard taxonomic ranks. Note that the taxonomy CSV must contain 'lin' lineage information." + ) + subparser.add_argument( + '--lingroup', '--lingroups', metavar='FILE', default=None, + help="CSV containing 'name', 'lin' columns, where 'lin' is the lingroup prefix. Will produce a 'lingroup' report containing taxonomic summarization for each group." + ) + add_rank_arg(subparser) def main(args): import sourmash - if not args.gather_csv and not args.from_file: - raise ValueError(f"No gather CSVs found! Please input via '-g' or '--from-file'.") - if len(args.output_format) > 1: - if args.output_base == "-": - raise TypeError(f"Writing to stdout is incompatible with multiple output formats {args.output_format}") - if not args.rank: - if any(x in ["krona", "lineage_summary"] for x in args.output_format): - raise ValueError(f"Rank (--rank) is required for krona and lineage_summary output formats.") - if not args.output_format: - # change to "human" for 5.0 - args.output_format = ["csv_summary"] + try: + if not args.gather_csv and not args.from_file: + raise ValueError(f"No gather CSVs found! Please input via '-g' or '--from-file'.") + if args.rank: + args.rank = check_rank(args) + args.output_format = check_tax_outputs(args, rank_required = ['krona', 'lineage_summary']) + + except ValueError as exc: + error(f"ERROR: {str(exc)}") + import sys; sys.exit(-1) + return sourmash.tax.__main__.metagenome(args) diff --git a/src/sourmash/cli/tax/summarize.py b/src/sourmash/cli/tax/summarize.py index 7fca17e837..06a109e95c 100644 --- a/src/sourmash/cli/tax/summarize.py +++ b/src/sourmash/cli/tax/summarize.py @@ -46,6 +46,10 @@ def subparser(subparsers): '-f', '--force', action = 'store_true', help='continue past errors in file and taxonomy loading', ) + subparser.add_argument( + '--lins', '--lin-taxonomy', action='store_true', default=False, + help='use LIN taxonomy in place of standard taxonomic ranks.' + ) def main(args): import sourmash diff --git a/src/sourmash/cli/utils.py b/src/sourmash/cli/utils.py index d92c726b2d..4e1e2d9ff6 100644 --- a/src/sourmash/cli/utils.py +++ b/src/sourmash/cli/utils.py @@ -132,3 +132,54 @@ def add_num_arg(parser, default=0): '-n', '--num-hashes', '--num', metavar='N', type=check_num_bounds, default=default, help='num value should be between 50 and 50000' ) + + +def check_rank(args): + """ Check '--rank'/'--position'/'--lin-position' argument matches selected taxonomy.""" + standard_ranks =['strain', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom'] + if args.lins: + if args.rank.isdigit(): + #if isinstance(args.rank, int): + return str(args.rank) + raise argparse.ArgumentTypeError(f"Invalid '--rank'/'--position' input: '{args.rank}'. '--lins' is specified. Rank must be an integer corresponding to a LIN position.") + elif args.rank in standard_ranks: + return args.rank + else: + raise argparse.ArgumentTypeError(f"Invalid '--rank'/'--position' input: '{args.rank}'. Please choose: 'strain', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom'") + + +def add_rank_arg(parser): + parser.add_argument( + '-r', '--rank', + '--position', '--lin-position', + help="For non-default output formats: Summarize genome taxonomy at this rank (or LIN position) and above. \ + Note that the taxonomy CSV must contain lineage information at this rank (or LIN position). \ + Choices: 'strain', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom' or an integer LIN position" + ) + +def check_tax_outputs(args, rank_required = ["krona"]): + "Handle ouput format combinations" + # check that rank is passed for formats requiring rank. + if not args.rank: + if any(x in rank_required for x in args.output_format): + raise ValueError(f"Rank (--rank) is required for {', '.join(rank_required)} output formats.") + + # check that '--lins' is specified and '--lingroup' file exists if needed + if args.lins: + if args.lingroup: + if "lingroup" not in args.output_format: + args.output_format.append("lingroup") + elif "lingroup" in args.output_format: + raise ValueError(f"Must provide lingroup csv via '--lingroup' in order to output a lingroup report.") + elif args.lingroup or "lingroup" in args.output_format: + raise ValueError(f"Must enable LIN taxonomy via '--lins' in order to use lingroups.") + + # check that only one output format is specified if writing to stdout + if len(args.output_format) > 1: + if args.output_base == "-": + raise ValueError(f"Writing to stdout is incompatible with multiple output formats {args.output_format}") + elif not args.output_format: + # change to "human" for 5.0 + args.output_format = ["csv_summary"] + + return args.output_format diff --git a/src/sourmash/lca/lca_utils.py b/src/sourmash/lca/lca_utils.py index b9864ed0a8..8ee9340ed7 100644 --- a/src/sourmash/lca/lca_utils.py +++ b/src/sourmash/lca/lca_utils.py @@ -120,10 +120,6 @@ def build_tree(assignments, initial=None): for assignment in assignments: node = tree - # when we switch LineagePair over, will need ot add this. - #if isinstance(assignment, (BaseLineageInfo, RankLineageInfo, LINSLineageInfo)): - # assignment = assignment.filled_lineage - for lineage_tup in assignment: if lineage_tup.name: child = node.get(lineage_tup, {}) diff --git a/src/sourmash/tax/__main__.py b/src/sourmash/tax/__main__.py index 1d127cefb7..dc01abaded 100644 --- a/src/sourmash/tax/__main__.py +++ b/src/sourmash/tax/__main__.py @@ -11,10 +11,9 @@ import sourmash from ..sourmash_args import FileOutputCSV, FileOutput from sourmash.logging import set_quiet, error, notify, print_results -from sourmash.lca.lca_utils import zip_lineage from . import tax_utils -from .tax_utils import MultiLineageDB, GatherRow +from .tax_utils import MultiLineageDB, GatherRow, RankLineageInfo, LINLineageInfo usage=''' sourmash taxonomy [] - manipulate/work with taxonomy information. @@ -43,6 +42,7 @@ 'human': '.human.txt', 'lineage_csv': '.lineage.csv', 'kreport': ".kreport.txt", + 'lingroup': ".lingroup.tsv" } def make_outfile(base, output_type, *, output_dir = ""): @@ -72,7 +72,7 @@ def metagenome(args): tax_assign = MultiLineageDB.load(args.taxonomy_csv, keep_full_identifiers=args.keep_full_identifiers, keep_identifier_versions=args.keep_identifier_versions, - force=args.force) + force=args.force, lins=args.lins) available_ranks = tax_assign.available_ranks except ValueError as exc: error(f"ERROR: {str(exc)}") @@ -93,6 +93,7 @@ def metagenome(args): fail_on_missing_taxonomy=args.fail_on_missing_taxonomy, keep_full_identifiers=args.keep_full_identifiers, keep_identifier_versions = args.keep_identifier_versions, + lins=args.lins, ) except ValueError as exc: error(f"ERROR: {str(exc)}") @@ -145,7 +146,11 @@ def metagenome(args): summary_outfile, limit_float = make_outfile(args.output_base, "human", output_dir=args.output_dir) with FileOutput(summary_outfile) as out_fp: - tax_utils.write_human_summary(query_gather_results, out_fp, args.rank or "species") + human_display_rank = args.rank or "species" + if args.lins and not args.rank: + human_display_rank = query_gather_results[0].ranks[-1] # lowest rank + + tax_utils.write_human_summary(query_gather_results, out_fp, human_display_rank) # write summarized output csv single_query_results = query_gather_results[0] @@ -162,6 +167,20 @@ def metagenome(args): header, kreport_results = single_query_results.make_kreport_results() tax_utils.write_output(header, kreport_results, out_fp, sep="\t", write_header=False) + # write summarized --> LINgroup output tsv + if "lingroup" in args.output_format: + try: + lingroups = tax_utils.read_lingroups(args.lingroup) + except ValueError as exc: + error(f"ERROR: {str(exc)}") + sys.exit(-1) + + lingroupfile, limit_float = make_outfile(args.output_base, "lingroup", output_dir=args.output_dir) + + with FileOutputCSV(lingroupfile) as out_fp: + header, lgreport_results = single_query_results.make_lingroup_results(LINgroupsD = lingroups) + tax_utils.write_output(header, lgreport_results, out_fp, sep="\t", write_header=True) + def genome(args): """ @@ -271,7 +290,7 @@ def annotate(args): tax_assign = MultiLineageDB.load(args.taxonomy_csv, keep_full_identifiers=args.keep_full_identifiers, keep_identifier_versions=args.keep_identifier_versions, - force=args.force) + force=args.force, lins=args.lins) except ValueError as exc: error(f"ERROR: {str(exc)}") sys.exit(-1) @@ -289,7 +308,7 @@ def annotate(args): fail_on_missing_taxonomy=args.fail_on_missing_taxonomy, keep_full_identifiers=args.keep_full_identifiers, keep_identifier_versions = args.keep_identifier_versions, - ) + lins=args.lins) if not query_gather_results: continue @@ -383,8 +402,7 @@ def search_pattern(l, r): else: with FileOutputCSV(args.output) as fp: w = csv.writer(fp) - - w.writerow(['ident'] + list(sourmash.lca.taxlist(include_strain=False))) + w.writerow(['ident'] + list(RankLineageInfo().taxlist[:-1])) for ident, lineage in sorted(match_ident): w.writerow([ident] + [ x.name for x in lineage ]) @@ -398,7 +416,8 @@ def summarize(args): tax_assign = MultiLineageDB.load(args.taxonomy_files, force=args.force, keep_full_identifiers=args.keep_full_identifiers, - keep_identifier_versions=args.keep_identifier_versions) + keep_identifier_versions=args.keep_identifier_versions, + lins=args.lins) except ValueError as exc: error("ERROR while loading taxonomies!") error(str(exc)) @@ -443,7 +462,11 @@ def summarize(args): # output in order of most common for lineage, count in lineage_counts.most_common(): rank = lineage[-1].rank - lin = ";".join(zip_lineage(lineage, truncate_empty=True)) + if args.lins: + inf = LINLineageInfo(lineage=lineage) + else: + inf = RankLineageInfo(lineage=lineage) + lin = inf.display_lineage() w.writerow([rank, str(count), lin]) n = len(lineage_counts) diff --git a/src/sourmash/tax/tax_utils.py b/src/sourmash/tax/tax_utils.py index 844fd1ceef..6cb83f5eda 100644 --- a/src/sourmash/tax/tax_utils.py +++ b/src/sourmash/tax/tax_utils.py @@ -3,8 +3,7 @@ """ import os import csv -from collections import namedtuple, defaultdict -from collections import abc +from collections import abc, defaultdict from itertools import zip_longest from typing import NamedTuple from dataclasses import dataclass, field, replace, asdict @@ -22,7 +21,7 @@ 'report_missing_and_skipped_identities', 'aggregate_by_lineage_at_rank' 'format_for_krona', 'combine_sumgather_csvs_by_lineage', 'write_lineage_sample_frac', - 'MultiLineageDB', 'RankLineageInfo'] + 'MultiLineageDB', 'RankLineageInfo', 'LINLineageInfo'] from sourmash.logging import notify from sourmash.sourmash_args import load_pathlist_from_file @@ -30,10 +29,6 @@ RANKCODE = { "superkingdom": "D", "kingdom": "K", "phylum": "P", "class": "C", "order": "O", "family":"F", "genus": "G", "species": "S", "unclassified": "U"} -# import lca utils as needed for now -from sourmash.lca import lca_utils -from sourmash.lca.lca_utils import (taxlist) - class LineagePair(NamedTuple): rank: str name: str = None @@ -146,25 +141,21 @@ def _init_from_lineage_tuples(self): new_lineage.append(LineagePair(rank=rank)) for lin_tup in self.lineage: # now add input tuples in correct spots. This corrects for order and allows empty values. - if not isinstance(lin_tup, (LineagePair, lca_utils.LineagePair)): - raise ValueError(f"{lin_tup} is not LineagePair.") - # find index for this rank + if not isinstance(lin_tup, LineagePair): + raise ValueError(f"{lin_tup} is not tax_utils LineagePair.") if lin_tup.rank: # skip this tuple if rank is None or "" (empty lineage tuple. is this needed?) try: + # find index for this rank rank_idx = self.rank_index(lin_tup.rank) except ValueError as e: raise ValueError(f"Rank '{lin_tup.rank}' not present in {', '.join(self.ranks)}") from e - # make sure we're adding tax_utils.LineagePairs, not lca_utils.LineagePairs for consistency - if isinstance(lin_tup, lca_utils.LineagePair): - new_lineage[rank_idx] = LineagePair(rank=lin_tup.rank, name=lin_tup.name) - else: - new_lineage[rank_idx] = lin_tup - + new_lineage[rank_idx] = lin_tup + # build list of filled ranks - filled_ranks = [a.rank for a in new_lineage if a.name] + filled_ranks = [a.rank for a in new_lineage if a.name is not None] # set lineage and filled_ranks object.__setattr__(self, "lineage", tuple(new_lineage)) - object.__setattr__(self, "filled_ranks", filled_ranks) + object.__setattr__(self, "filled_ranks", tuple(filled_ranks)) def _init_from_lineage_str(self): """ @@ -175,9 +166,9 @@ def _init_from_lineage_str(self): new_lineage = self.lineage_str.split(',') new_lineage = [ LineagePair(rank=rank, name=n) for (rank, n) in zip_longest(self.ranks, new_lineage) ] # build list of filled ranks - filled_ranks = [a.rank for a in new_lineage if a.name] + filled_ranks = [a.rank for a in new_lineage if a.name is not None] object.__setattr__(self, "lineage", tuple(new_lineage)) - object.__setattr__(self, "filled_ranks", filled_ranks) + object.__setattr__(self, "filled_ranks", tuple(filled_ranks)) def zip_lineage(self, truncate_empty=False): """ @@ -222,7 +213,7 @@ def check_rank_availability(self, rank): if rank in self.ranks: # rank is available return True raise ValueError(f"Desired Rank '{rank}' not available for this lineage.") - + def rank_is_filled(self, rank, other=None): self.check_rank_availability(rank) if other is not None: @@ -232,12 +223,17 @@ def rank_is_filled(self, rank, other=None): return True return False + def is_compatible(self, other): + if self.ranks == other.ranks: + return True + return False + def is_lineage_match(self, other, rank): """ check to see if two lineages are a match down to given rank. """ self.check_rank_availability(rank) - if not other.ranks == self.ranks: # check same ranks + if not self.is_compatible(other): raise ValueError("Cannot compare lineages from taxonomies with different ranks.") # always return false if rank is not filled in either of the two lineages if self.rank_is_filled(rank, other=other): @@ -262,8 +258,7 @@ def pop_to_rank(self, rank): return new def lineage_at_rank(self, rank): - "non-destructive pop_to_rank. Returns tuple of LineagePairs" - "Returns tuple of LineagePairs at given rank." + "Return tuple of LineagePairs at specified rank." # are we already above rank? self.check_rank_availability(rank) if not self.rank_is_filled(rank): @@ -272,6 +267,16 @@ def lineage_at_rank(self, rank): rank_idx = self.rank_index(rank) return self.filled_lineage[:rank_idx+1] + def find_lca(self, other): + """ + If an LCA match exists between self and other, + find and report LCA lineage. If not, return None. + """ + for rank in self.ascending_taxlist: + if self.is_lineage_match(other, rank): + return self.pop_to_rank(rank) + return None + @dataclass(frozen=True, order=True) class RankLineageInfo(BaseLineageInfo): @@ -355,7 +360,210 @@ def _init_from_lineage_dict(self): filled_ranks = [a.rank for a in new_lineage if a.name] # set lineage and filled_ranks object.__setattr__(self, "lineage", tuple(new_lineage)) - object.__setattr__(self, "filled_ranks", filled_ranks) + object.__setattr__(self, "filled_ranks", tuple(filled_ranks)) + + +@dataclass(frozen=True, order=True) +class LINLineageInfo(BaseLineageInfo): + """ + This LINLineageInfo class uses the BaseLineageInfo methods for hierarchical LIN taxonomic 'ranks'. + + Inputs (at least one required): + n_lin_positions: the number of lineage positions + lineage_str: `;`- or `,`-separated LINS string + + If both `n_lin_positions` and `lineage_str` are provided, we will initialize a `LINLineageInfo` + with the provided n_lin_positions, and fill positions with `lineage_str` values. If the number of + positions is less than provided lineages, initialization will fail. Otherwise, we will insert blanks + beyond provided data in `lineage_str`. + + If no information is passed, an empty LINLineageInfo will be initialized (n_lin_positions=0). + + Input lineage information is only used for initialization of the final `lineage` + and will not be used or compared in any other class methods. + """ + ranks: tuple = field(default=None, init=False, compare=False)# we will set this within class instead + lineage: tuple = None + # init with n_positions if you want to set a specific number of positions + n_lin_positions: int = field(default=None, compare=False) + + def __post_init__(self): + "Initialize according to passed values" + # ranks must be tuple for hashability + if self.lineage is not None: + self._init_from_lineage_tuples() + elif self.lineage_str is not None: + self._init_from_lineage_str() + else: + self._init_empty() + + def __eq__(self, other): + """ + Check if two LINLineageInfo match. Since we sometimes want to match LINprefixes, which have fewer + total ranks, with full LINs, we only check for the filled_lineage to match and don't check that + the number of lin_positions match. + """ + if other == (): # if comparing to a null tuple, don't try to find its lineage before returning False + return False + return self.filled_lineage==other.filled_lineage + + def _init_ranks_from_n_lin_positions(self): + new_ranks = [str(x) for x in range(0, self.n_lin_positions)] + object.__setattr__(self, "ranks", new_ranks) + + def _init_empty(self): + "initialize empty genome lineage" + # first, set ranks from n_positions + if self.n_lin_positions is None: + # set n_lin_positions to 0 for completely empty LINLineageInfo + object.__setattr__(self, "n_lin_positions", 0) + self._init_ranks_from_n_lin_positions() + new_lineage=[] + for rank in self.ranks: + new_lineage.append(LineagePair(rank=rank)) + # set lineage and filled_ranks (because frozen, need to do it this way) + object.__setattr__(self, "lineage", tuple(new_lineage)) + object.__setattr__(self, "filled_ranks", ()) + object.__setattr__(self, "n_filled_pos", 0) + + def _init_from_lineage_str(self): + """ + Turn a ; or ,-separated set of lineages into a list of LineagePair objs. + """ + new_lineage = self.lineage_str.split(';') + if len(new_lineage) == 1: + new_lineage = self.lineage_str.split(',') + if self.n_lin_positions is not None: + if self.n_lin_positions < len(new_lineage): + raise(ValueError("Provided 'n_lin_positions' has fewer positions than provided 'lineage_str'.")) + self._init_ranks_from_n_lin_positions() + else: + n_lin_positions = len(new_lineage) + object.__setattr__(self, "n_lin_positions", n_lin_positions) + self._init_ranks_from_n_lin_positions() + + # build lineage and n_filled_pos, filled_ranks + new_lineage = [ LineagePair(rank=rank, name=n) for (rank, n) in zip_longest(self.ranks, new_lineage) ] + filled_ranks = [a.rank for a in new_lineage if a.name is not None] + object.__setattr__(self, "lineage", tuple(new_lineage)) + object.__setattr__(self, "filled_ranks", tuple(filled_ranks)) + object.__setattr__(self, "n_filled_pos", len(filled_ranks)) + + def _init_from_lineage_tuples(self): + 'initialize from tuple/list of LineagePairs, building ranks as you go' + new_lineage = [] + ranks = [] + # check this is a list or tuple of lineage tuples: + for lin_tup in self.lineage: + # make sure we're adding tax_utils.LineagePairs + if not isinstance(lin_tup, LineagePair): + raise ValueError(f"{lin_tup} is not tax_utils LineagePair.") + new_lineage.append(lin_tup) + ranks.append(lin_tup.rank) + # build list of filled ranks + filled_ranks = [a.rank for a in new_lineage if a.name is not None] + # set lineage and filled_ranks + object.__setattr__(self, "lineage", tuple(new_lineage)) + object.__setattr__(self, "n_lin_positions", len(new_lineage)) + object.__setattr__(self, "ranks", tuple(ranks)) + object.__setattr__(self, "filled_ranks", tuple(filled_ranks)) + object.__setattr__(self, "n_filled_pos", len(filled_ranks)) + + + def is_compatible(self, other): + """ + Since we sometimes want to match LINprefixes with full LINs, + we don't want to enforce identical ranks. Here we just look to + make sure self and other share any ranks (LIN positions). + + Since ranks are positions, this should be true for LINLineageInfo + unless one is empty. However, it should prevent comparison between + other LineageInfo instances and LINLineageInfo. + """ + # do self and other share any ranks? + if any(x in self.ranks for x in other.ranks): + return True + return False + + + +@dataclass +class LineageTree: + """ + Builds a tree of dictionaries from lists of LineagePair or + LineageInfo objects in 'assignments'. This tree can then be used + to find lowest common ancestor agreements/confusion. + """ + assignments: list = field(compare=False) + + def __post_init__(self): + self.tree = {} + self.add_lineages(self.assignments) + + def add_lineage(self, lineage): + if isinstance(lineage, (BaseLineageInfo, RankLineageInfo, LINLineageInfo)): + lineage = lineage.filled_lineage + node = self.tree + for lineage_tup in lineage: + if lineage_tup.name: + child = node.get(lineage_tup, {}) + node[lineage_tup] = child + # shift -> down in tree + node = child + + def add_lineages(self, lineages): + if not lineages: + raise ValueError("empty assignment passed to build_tree") + if not isinstance(lineages, abc.Iterable): + raise ValueError("Must pass in an iterable containing LineagePair or LineageInfo objects.") + for lineageInf in lineages: + self.add_lineage(lineageInf) + + def find_lca(self): + """ + Given a LineageTree tree, find the first node with multiple + children, OR the only leaf in the tree. Return (lineage_tup, reason), + where 'reason' is the number of children of the returned node, i.e. + 0 if it's a leaf and > 1 if it's an internal node. + """ + node = self.tree + lca = [] + while 1: + if len(node) == 1: # descend to only child; track path + lineage_tup = next(iter(node.keys())) + lca.append(lineage_tup) + node = node[lineage_tup] + elif len(node) == 0: # at leaf; end + return tuple(lca), 0 + else: # len(node) > 1 => confusion!! + return tuple(lca), len(node) + + def ordered_paths(self, include_internal=False): + """ + Find all paths in the nested dict in a depth-first manner. + Each path is a tuple of lineage tuples that lead from the root + to a leaf node. Optionally include internal nodes by building + them up from leaf nodes (for ordering). + """ + paths = [] + stack = [((), self.tree)] + while stack: + path, node = stack.pop() + for key, val in node.items(): + if len(val) == 0: # leaf node + # if want internal paths, build up from leaf + if include_internal: + internal_path = path + while internal_path: + if internal_path not in paths: + paths.append(internal_path) + if isinstance(internal_path, abc.Iterable): + internal_path = internal_path[:-1] + # now add leaf path + paths.append(path + (key,)) + else: # not leaf, add to stack + stack.append((path + (key,), val)) + return paths def get_ident(ident, *, @@ -404,9 +612,30 @@ def collect_gather_csvs(cmdline_gather_input, *, from_file=None): return gather_csvs +def read_lingroups(lingroup_csv): + lingroupD = {} + n=None + with sourmash_args.FileInputCSV(lingroup_csv) as r: + header = r.fieldnames + # check for empty file + if not header: + raise ValueError(f"Cannot read lingroups from '{lingroup_csv}'. Is file empty?") + if "lin" not in header or "name" not in header: + raise ValueError(f"'{lingroup_csv}' must contain the following columns: 'name', 'lin'.") + for n, row in enumerate(r): + lingroupD[row['lin']] = row['name'] + + if n is None: + raise ValueError(f'No lingroups loaded from {lingroup_csv}.') + n_lg = len(lingroupD.keys()) + notify(f"Read {n+1} lingroup rows and found {n_lg} distinct lingroup prefixes.") + return lingroupD + + def load_gather_results(gather_csv, tax_assignments, *, seen_queries=None, force=False, skip_idents = None, fail_on_missing_taxonomy=False, - keep_full_identifiers=False, keep_identifier_versions=False): + keep_full_identifiers=False, keep_identifier_versions=False, + lins=False): "Load a single gather csv" if not seen_queries: seen_queries=set() @@ -430,13 +659,14 @@ def load_gather_results(gather_csv, tax_assignments, *, seen_queries=None, force # do not allow loading of same query from a second CSV. raise ValueError(f"Gather query {gatherRow.query_name} was found in more than one CSV. Cannot load from '{gather_csv}'.") taxres = TaxResult(raw=gatherRow, keep_full_identifiers=keep_full_identifiers, - keep_identifier_versions=keep_identifier_versions) + keep_identifier_versions=keep_identifier_versions, + lins=lins) taxres.get_match_lineage(tax_assignments=tax_assignments, skip_idents=skip_idents, fail_on_missing_taxonomy=fail_on_missing_taxonomy) # add to matching QueryTaxResult or create new one if not this_querytaxres or not this_querytaxres.is_compatible(taxres): # get existing or initialize new - this_querytaxres = gather_results.get(gatherRow.query_name, QueryTaxResult(taxres.query_info)) + this_querytaxres = gather_results.get(gatherRow.query_name, QueryTaxResult(taxres.query_info, lins=lins)) this_querytaxres.add_taxresult(taxres) gather_results[gatherRow.query_name] = this_querytaxres @@ -448,7 +678,7 @@ def load_gather_results(gather_csv, tax_assignments, *, seen_queries=None, force def check_and_load_gather_csvs(gather_csvs, tax_assign, *, fail_on_missing_taxonomy=False, force=False, - keep_full_identifiers=False,keep_identifier_versions=False): + keep_full_identifiers=False,keep_identifier_versions=False, lins=False): ''' Load gather csvs, checking for empties and ids missing from taxonomic assignments. ''' @@ -466,7 +696,8 @@ def check_and_load_gather_csvs(gather_csvs, tax_assign, *, fail_on_missing_taxon seen_queries=gather_results.keys(), force=force, keep_full_identifiers=keep_full_identifiers, keep_identifier_versions = keep_identifier_versions, - fail_on_missing_taxonomy=fail_on_missing_taxonomy) + fail_on_missing_taxonomy=fail_on_missing_taxonomy, + lins=lins) except ValueError as exc: if force: if "found in more than one CSV" in str(exc): @@ -726,7 +957,7 @@ def __bool__(self): @classmethod def load(cls, filename, *, delimiter=',', force=False, - keep_full_identifiers=False, keep_identifier_versions=True): + keep_full_identifiers=False, keep_identifier_versions=True, lins=False): """ Load a taxonomy assignment CSV file into a LineageDB. @@ -763,34 +994,49 @@ def load(cls, filename, *, delimiter=',', force=False, header = ["ident" if "accession" == x else x for x in header] elif 'name' in header and 'lineage' in header: return cls.load_from_gather_with_lineages(filename, - force=force) + force=force, + lins=lins) else: header_str = ",".join([repr(x) for x in header]) raise ValueError(f'No taxonomic identifiers found; headers are {header_str}') - # is "strain" an available rank? - if "strain" in header: - include_strain=True - # check that all ranks are in header - ranks = list(RankLineageInfo().taxlist) - if not include_strain: - ranks.remove('strain') - if not set(ranks).issubset(header): - # for now, just raise err if not all ranks are present. - # in future, we can define `ranks` differently if desired - # return them from this function so we can check the `available` ranks - raise ValueError('Not all taxonomy ranks present') + if lins and "lin" not in header: + raise ValueError(f"'lin' column not found: cannot read LIN taxonomy assignments from {filename}.") + + if not lins: + # is "strain" an available rank? + if "strain" in header: + include_strain=True + # check that all ranks are in header + ranks = list(RankLineageInfo().taxlist) + if not include_strain: + ranks.remove('strain') + if not set(ranks).issubset(header): + # for now, just raise err if not all ranks are present. + # in future, we can define `ranks` differently if desired + # return them from this function so we can check the `available` ranks + raise ValueError('Not all taxonomy ranks present') assignments = {} num_rows = 0 n_species = 0 n_strains = 0 + n_pos = None # now parse and load lineages for n, row in enumerate(r): num_rows += 1 - # read lineage from row dictionary - lineageInfo = RankLineageInfo(lineage_dict=row) + if lins: + lineageInfo = LINLineageInfo(lineage_str=row['lin']) + if n_pos is not None: + if lineageInfo.n_lin_positions != n_pos: + raise ValueError(f"For taxonomic summarization, all LIN assignments must use the same number of LIN positions.") + else: + n_pos = lineageInfo.n_lin_positions # set n_pos with first entry + ranks=lineageInfo.ranks + else: + # read lineage from row dictionary + lineageInfo = RankLineageInfo(lineage_dict=row) # get identifier ident = row[identifier] @@ -810,17 +1056,18 @@ def load(cls, filename, *, delimiter=',', force=False, else: assignments[ident] = lineage - if lineage[-1].rank == 'species': - n_species += 1 - elif lineage[-1].rank == 'strain': - n_species += 1 - n_strains += 1 + if not lins: + if lineage[-1].rank == 'species': + n_species += 1 + elif lineage[-1].rank == 'strain': + n_species += 1 + n_strains += 1 return LineageDB(assignments, ranks) @classmethod - def load_from_gather_with_lineages(cls, filename, *, force=False): + def load_from_gather_with_lineages(cls, filename, *, force=False, lins=False): """ Load an annotated gather-with-lineages CSV file produced by 'tax annotate' into a LineageDB. @@ -841,7 +1088,7 @@ def load_from_gather_with_lineages(cls, filename, *, force=False): if "name" not in header or "lineage" not in header: raise ValueError(f"Expected headers 'name' and 'lineage' not found. Is this a with-lineages file?") - ranks = list(lca_utils.taxlist(include_strain=include_strain)) + ranks=None assignments = {} num_rows = 0 n_species = 0 @@ -853,24 +1100,32 @@ def load_from_gather_with_lineages(cls, filename, *, force=False): name = row['name'] ident = get_ident(name) - lineage = row['lineage'] - lineage = lca_utils.make_lineage(lineage) + if lins: + lineageInfo = LINLineageInfo(lineage_str=row['lineage']) + else: + lineageInfo = RankLineageInfo(lineage_str= row['lineage']) + + if ranks is None: + ranks = lineageInfo.taxlist + + lineage = lineageInfo.filled_lineage # check duplicates if ident in assignments: - if assignments[ident] != tuple(lineage): + if assignments[ident] != lineage: # this should not happen with valid # sourmash tax annotate output, but check anyway. if not force: raise ValueError(f"multiple lineages for identifier {ident}") else: - assignments[ident] = tuple(lineage) + assignments[ident] = lineage - if lineage[-1].rank == 'species': - n_species += 1 - elif lineage[-1].rank == 'strain': - n_species += 1 - n_strains += 1 + if isinstance(lineageInfo, RankLineageInfo): + if lineage[-1].rank == 'species': + n_species += 1 + elif lineage[-1].rank == 'strain': + n_species += 1 + n_strains += 1 return LineageDB(assignments, ranks) @@ -904,7 +1159,7 @@ def __init__(self, conn, *, table_name=None): # get available ranks... ranks = set() - for column, rank in zip(self.columns, taxlist(include_strain=True)): + for column, rank in zip(self.columns, RankLineageInfo().taxlist): query = f'SELECT COUNT({column}) FROM {self.table_name} WHERE {column} IS NOT NULL AND {column} != ""' c.execute(query) cnt, = c.fetchone() @@ -948,7 +1203,7 @@ def load(cls, location): def _make_tup(self, row): "build a tuple of LineagePairs for this sqlite row" - tup = [ LineagePair(n, r) for (n, r) in zip(taxlist(True), row) ] + tup = [ LineagePair(n, r) for (n, r) in zip(RankLineageInfo().taxlist, row) ] return tuple(tup) def __getitem__(self, ident): @@ -1148,7 +1403,7 @@ class TEXT, db.commit() def _save_csv(self, fp): - headers = ['identifiers'] + list(taxlist(include_strain=True)) + headers = ['identifiers'] + list(RankLineageInfo().taxlist) w = csv.DictWriter(fp, fieldnames=headers) w.writeheader() @@ -1317,10 +1572,10 @@ class TaxResult: query_name: str = field(init=False) query_info: QueryInfo = field(init=False) match_ident: str = field(init=False) - lineageInfo: RankLineageInfo = RankLineageInfo() skipped_ident: bool = False missed_ident: bool = False match_lineage_attempted: bool = False + lins: bool = False def __post_init__(self): self.get_ident() @@ -1338,6 +1593,10 @@ def __post_init__(self): self.f_unique_to_query = float(self.raw.f_unique_to_query) self.f_unique_weighted = float(self.raw.f_unique_weighted) self.unique_intersect_bp = int(self.raw.unique_intersect_bp) + if self.lins: + self.lineageInfo = LINLineageInfo() + else: + self.lineageInfo = RankLineageInfo() def get_ident(self): # split identifiers = split on whitespace @@ -1359,7 +1618,10 @@ def get_match_lineage(self, tax_assignments, skip_idents=None, fail_on_missing_t else: lin = tax_assignments.get(self.match_ident) if lin: - self.lineageInfo = RankLineageInfo(lineage=lin) + if self.lins: + self.lineageInfo = LINLineageInfo(lineage = lin) + else: + self.lineageInfo = RankLineageInfo(lineage = lin) else: self.missed_ident=True self.match_lineage_attempted = True @@ -1439,12 +1701,17 @@ def as_human_friendly_dict(self, query_info): return sD def as_kreport_dict(self, query_info): + """ + Produce kreport dict for named taxonomic groups. + """ lowest_assignment_rank = 'species' sD = {} sD['num_bp_assigned'] = str(0) # total percent containment, weighted to include abundance info sD['percent_containment'] = f'{self.f_weighted_at_rank * 100:.2f}' sD["num_bp_contained"] = str(int(self.f_weighted_at_rank * query_info.total_weighted_bp)) + if isinstance(self.lineage, LINLineageInfo): + raise ValueError("Cannot produce 'kreport' with LIN taxonomy.") if self.lineage != RankLineageInfo(): this_rank = self.lineage.lowest_rank sD['rank_code'] = RANKCODE[this_rank] @@ -1461,6 +1728,19 @@ def as_kreport_dict(self, query_info): sD['rank_code'] = RANKCODE['unclassified'] sD["num_bp_assigned"] = sD["num_bp_contained"] return sD + + def as_lingroup_dict(self, query_info, lg_name): + """ + Produce lingroup report dict for lingroups. + """ + sD = {} + # total percent containment, weighted to include abundance info + sD['percent_containment'] = f'{self.f_weighted_at_rank * 100:.2f}' + sD["num_bp_contained"] = str(int(self.f_weighted_at_rank * query_info.total_weighted_bp)) + sD["lin"] = self.lineage.display_lineage() + sD["name"] = lg_name + return sD + @dataclass class ClassificationResult(SummarizedGatherResult): @@ -1519,6 +1799,7 @@ class QueryTaxResult: Contains methods for formatting results for different outputs. """ query_info: QueryInfo # initialize with QueryInfo dataclass + lins: bool = False def __post_init__(self): self.query_name = self.query_info.query_name # for convenience @@ -1557,7 +1838,7 @@ def _init_classification_results(self): self.krona_header = [] def is_compatible(self, taxresult): - return taxresult.query_info == self.query_info + return taxresult.query_info == self.query_info and taxresult.lins == self.lins @property def ascending_ranks(self): @@ -1650,7 +1931,10 @@ def build_summarized_result(self, single_rank=None, force_resummarize=False): self.total_bp_classified[rank] += bp_intersect_at_rank # record unclassified - lineage = RankLineageInfo() + if self.lins: + lineage = LINLineageInfo() + else: + lineage = RankLineageInfo() query_ani = None f_unique = 1.0 - self.total_f_classified[rank] if f_unique > 0: @@ -1840,3 +2124,52 @@ def make_kreport_results(self): unclassified_recorded = True kreport_results.append(kresD) return header, kreport_results + + def make_lingroup_results(self, LINgroupsD): # LingroupsD is dictionary {lg_prefix: lg_name} + """ + Report results for the specified LINGroups. + Keep LCA paths in order as much as possible. + """ + self.check_summarization() + header = ["name", "lin", "percent_containment", "num_bp_contained"] + + if self.query_info.total_weighted_hashes == 0: + raise ValueError("ERROR: cannot produce 'LINgroup_report' format from gather results before sourmash v4.5.0") + + # find the ranks we need to consider + all_lgs = set() + lg_ranks = set() + for lg_prefix in LINgroupsD.keys(): + # store lineage info for LCA pathfinding + lg_info = LINLineageInfo(lineage_str=lg_prefix) + all_lgs.add(lg_info) + # store rank so we only go through summarized results at these ranks + lg_rank = int(lg_info.lowest_rank) + lg_ranks.add(lg_rank) + + # grab summarized results matching LINgroup prefixes + lg_results = {} + for rank in lg_ranks: + rank = str(rank) + rank_results = self.summarized_lineage_results[rank] + for res in rank_results: + if res.lineage in all_lgs:# is this lineage in the list of LINgroups? + this_lingroup_name = LINgroupsD[res.lineage.display_lineage(truncate_empty=True)] + lg_resD = res.as_lingroup_dict(self.query_info, this_lingroup_name) + lg_results[res.lineage] = lg_resD + + # We want to return in ~ depth order: descending each specific path in order + # use LineageTree to find ordered paths + lg_tree = LineageTree(all_lgs) + ordered_paths = lg_tree.ordered_paths(include_internal = True) + # store results in order: + lingroup_results=[] + for lg in ordered_paths: + # get LINInfo object + lg_LINInfo = LINLineageInfo(lineage=lg) + # get result, if we have it + lg_res = lg_results.get(lg_LINInfo) + if lg_res: + lingroup_results.append(lg_res) + + return header, lingroup_results diff --git a/tests/test-data/tax/test.LIN-taxonomy.csv b/tests/test-data/tax/test.LIN-taxonomy.csv new file mode 100644 index 0000000000..1544b78994 --- /dev/null +++ b/tests/test-data/tax/test.LIN-taxonomy.csv @@ -0,0 +1,7 @@ +ident,lin +GCF_001881345.1,0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 +GCF_009494285.1,1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 +GCF_013368705.1,2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 +GCF_003471795.1,1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 +GCF_000017325.1,1;0;1;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0 +GCF_000021665.1,1;0;1;0;0;0;0;0;0;0;0;0;0;1;1;0;0;0;0;0 diff --git a/tests/test_tax.py b/tests/test_tax.py index f852825068..b5314bac3c 100644 --- a/tests/test_tax.py +++ b/tests/test_tax.py @@ -405,7 +405,8 @@ def test_metagenome_no_rank_lineage_summary(runtmp): with pytest.raises(SourmashCommandFailed) as exc: runtmp.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '-o', csv_base, '--output-format', 'lineage_summary') - assert "Rank (--rank) is required for krona and lineage_summary output formats." in str(exc.value) + print(str(exc.value)) + assert "Rank (--rank) is required for krona, lineage_summary output formats." in str(exc.value) def test_metagenome_no_rank_krona(runtmp): @@ -415,7 +416,24 @@ def test_metagenome_no_rank_krona(runtmp): with pytest.raises(SourmashCommandFailed) as exc: runtmp.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '-o', csv_base, '--output-format', 'krona') - assert "Rank (--rank) is required for krona and lineage_summary output formats." in str(exc.value) + print(str(exc.value)) + assert "Rank (--rank) is required for krona, lineage_summary output formats." in str(exc.value) + + +def test_metagenome_bad_rank_krona(runtmp): + g_csv = utils.get_test_data('tax/test1.gather.csv') + tax = utils.get_test_data('tax/test.taxonomy.csv') + csv_base = "out" + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '-o', csv_base, '--output-format', 'krona', '--rank', 'NotARank') + print(str(exc.value)) + assert "Invalid '--rank'/'--position' input: 'NotARank'. Please choose: 'strain', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom'" in runtmp.last_result.err + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '-o', csv_base, '--output-format', 'krona', '--rank', '5') + print(str(exc.value)) + assert "Invalid '--rank'/'--position' input: '5'. Please choose: 'strain', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom'" in runtmp.last_result.err def test_genome_no_rank_krona(runtmp): @@ -2195,6 +2213,35 @@ def test_annotate_gzipped_gather(runtmp): assert "d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri" in lin_gather_results[4] +def test_annotate_0_LIN(runtmp): + # test annotate basics + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + csvout = runtmp.output("test1.gather.with-lineages.csv") + out_dir = os.path.dirname(csvout) + + c.run_sourmash('tax', 'annotate', '--gather-csv', g_csv, '--taxonomy-csv', tax, '-o', out_dir, "--lins") + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert os.path.exists(csvout) + + lin_gather_results = [x.rstrip() for x in open(csvout)] + print("\n".join(lin_gather_results)) + assert f"saving 'annotate' output to '{csvout}'" in runtmp.last_result.err + + assert "lineage" in lin_gather_results[0] + assert "0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in lin_gather_results[1] + assert "1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in lin_gather_results[2] + assert "2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in lin_gather_results[3] + assert "1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in lin_gather_results[4] + + def test_annotate_gather_argparse(runtmp): # test annotate with two gather CSVs, second one empty, and --force. # this tests argparse handling w/extend. @@ -3287,3 +3334,321 @@ def test_tax_summarize_strain_csv_with_lineages(runtmp): assert c['2'] == 5 assert c['6'] == 1 assert c['1'] == 11 + + +def test_tax_summarize_LINS(runtmp): + # test basic operation w/LINs + taxfile = utils.get_test_data('tax/test.LIN-taxonomy.csv') + lineage_csv = runtmp.output('annotated-lin.csv') + + taxdb = tax_utils.LineageDB.load(taxfile, lins=True) + with open(lineage_csv, 'w', newline="") as fp: + w = csv.writer(fp) + w.writerow(['name', 'lineage']) + for k, v in taxdb.items(): + lin = tax_utils.LINLineageInfo(lineage=v) + linstr = lin.display_lineage(truncate_empty=False) + print(linstr) + w.writerow([k, linstr]) + + runtmp.sourmash('tax', 'summarize', lineage_csv, '-o', 'ranks.csv', '--lins') + + out = runtmp.last_result.out + err = runtmp.last_result.err + + print(out) + print(err) + + assert "number of distinct taxonomic lineages: 6" in out + assert "saved 91 lineage counts to" in err + + csv_out = runtmp.output('ranks.csv') + + with sourmash_args.FileInputCSV(csv_out) as r: + # count number across ranks as a cheap consistency check + c = Counter() + for row in r: + print(row) + val = row['lineage_count'] + c[val] += 1 + + print(list(c.most_common())) + + assert c['1'] == 77 + assert c['2'] == 1 + assert c['3'] == 11 + assert c['4'] == 2 + + +def test_metagenome_LIN(runtmp): + # test basic metagenome with LIN taxonomy + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '--lins') + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert 'query_name,rank,fraction,lineage,query_md5,query_filename,f_weighted_at_rank,bp_match_at_rank' in c.last_result.out + # 0th rank/position + assert "test1,0,0.089,1,md5,test1.sig,0.057,444000,0.925,0" in c.last_result.out + assert "test1,0,0.088,0,md5,test1.sig,0.058,442000,0.925,0" in c.last_result.out + assert "test1,0,0.028,2,md5,test1.sig,0.016,138000,0.891,0" in c.last_result.out + assert "test1,0,0.796,unclassified,md5,test1.sig,0.869,3990000,,0" in c.last_result.out + # 1st rank/position + assert "test1,1,0.089,1;0,md5,test1.sig,0.057,444000,0.925,0" in c.last_result.out + assert "test1,1,0.088,0;0,md5,test1.sig,0.058,442000,0.925,0" in c.last_result.out + assert "test1,1,0.028,2;0,md5,test1.sig,0.016,138000,0.891,0" in c.last_result.out + assert "test1,1,0.796,unclassified,md5,test1.sig,0.869,3990000,,0" in c.last_result.out + # 2nd rank/position + assert "test1,2,0.088,0;0;0,md5,test1.sig,0.058,442000,0.925,0" in c.last_result.out + assert "test1,2,0.078,1;0;0,md5,test1.sig,0.050,390000,0.921,0" in c.last_result.out + assert "test1,2,0.028,2;0;0,md5,test1.sig,0.016,138000,0.891,0" in c.last_result.out + assert "test1,2,0.011,1;0;1,md5,test1.sig,0.007,54000,0.864,0" in c.last_result.out + assert "test1,2,0.796,unclassified,md5,test1.sig,0.869,3990000,,0" in c.last_result.out + # 19th rank/position + assert "test1,19,0.088,0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0,md5,test1.sig,0.058,442000,0.925,0" in c.last_result.out + assert "test1,19,0.078,1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0,md5,test1.sig,0.050,390000,0.921,0" in c.last_result.out + assert "test1,19,0.028,2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0,md5,test1.sig,0.016,138000,0.891,0" in c.last_result.out + assert "test1,19,0.011,1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0,md5,test1.sig,0.007,54000,0.864,0" in c.last_result.out + assert "test1,19,0.796,unclassified,md5,test1.sig,0.869,3990000,,0" in c.last_result.out + + +def test_metagenome_LIN_lingroups(runtmp): + # test lingroups output + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write('lin,name\n') + out.write('0;0;0,lg1\n') + out.write('1;0;0,lg2\n') + out.write('2;0;0,lg3\n') + out.write('1;0;1,lg3\n') + # write a 19 so we can check the end + out.write('1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0,lg4\n') + + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '--lingroup', lg_file) + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert "Read 5 lingroup rows and found 5 distinct lingroup prefixes." in c.last_result.err + assert "name lin percent_containment num_bp_contained" in c.last_result.out + assert "lg1 0;0;0 5.82 714000" in c.last_result.out + assert "lg2 1;0;0 5.05 620000" in c.last_result.out + assert "lg3 2;0;0 1.56 192000" in c.last_result.out + assert "lg3 1;0;1 0.65 80000" in c.last_result.out + assert "lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000" in c.last_result.out + + +def test_metagenome_LIN_human_summary_no_lin_position(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '-F', "human") + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert "sample name proportion cANI lineage" in c.last_result.out + assert "----------- ---------- ---- -------" in c.last_result.out + assert "test1 86.9% - unclassified" in c.last_result.out + assert "test1 5.8% 92.5% 0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in c.last_result.out + assert "test1 5.0% 92.1% 1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in c.last_result.out + assert "test1 1.6% 89.1% 2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in c.last_result.out + assert "test1 0.7% 86.4% 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0" in c.last_result.out + + +def test_metagenome_LIN_human_summary_lin_position_5(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '-F', "human", '--lin-position', '5') + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert "sample name proportion cANI lineage" in c.last_result.out + assert "----------- ---------- ---- -------" in c.last_result.out + assert "test1 86.9% - unclassified" in c.last_result.out + assert "test1 5.8% 92.5% 0;0;0;0;0;0" in c.last_result.out + assert "test1 5.0% 92.1% 1;0;0;0;0;0" in c.last_result.out + assert "test1 1.6% 89.1% 2;0;0;0;0;0" in c.last_result.out + assert "test1 0.7% 86.4% 1;0;1;0;0;0" in c.last_result.out + + +def test_metagenome_LIN_krona_lin_position_5(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '-F', "krona", '--lin-position', '5') + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status == 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert "fraction 0 1 2 3 4 5" in c.last_result.out + assert "0.08815317112086159 0 0 0 0 0 0" in c.last_result.out + assert "0.07778220981252493 1 0 0 0 0 0" in c.last_result.out + assert "0.027522935779816515 2 0 0 0 0 0" in c.last_result.out + assert "0.010769844435580374 1 0 1 0 0 0" in c.last_result.out + assert "0.7957718388512166 unclassified unclassified unclassified unclassified unclassified unclassified" in c.last_result.out + + +def test_metagenome_LIN_krona_bad_rank(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '-F', "krona", '--lin-position', 'strain') + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status != 0 + assert "Invalid '--rank'/'--position' input: 'strain'. '--lins' is specified. Rank must be an integer corresponding to a LIN position." in c.last_result.err + + + +def test_metagenome_LIN_lingroups_empty_lg_file(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write("") + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '--lingroup', lg_file) + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status != 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert f"Cannot read lingroups from '{lg_file}'. Is file empty?" in c.last_result.err + + +def test_metagenome_LIN_lingroups_bad_cli_inputs(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write("") + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '-F', "lingroup") + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status != 0 + assert "Must provide lingroup csv via '--lingroup' in order to output a lingroup report." in c.last_result.err + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '-F', "lingroup") + print(c.last_result.err) + assert c.last_result.status != 0 + assert "Must enable LIN taxonomy via '--lins' in order to use lingroups." in c.last_result.err + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, '--lingroup', lg_file) + print(c.last_result.err) + assert c.last_result.status != 0 + assert "Must enable LIN taxonomy via '--lins' in order to use lingroups." in c.last_result.err + + +def test_metagenome_mult_outputs_stdout_fail(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '-F', "kreport", 'csv_summary') + + print(c.last_result.err) + assert c.last_result.status != 0 + assert f"Writing to stdout is incompatible with multiple output formats ['kreport', 'csv_summary']" in c.last_result.err + + +def test_genome_mult_outputs_stdout_fail(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'genome', '-g', g_csv, '--taxonomy-csv', tax, + '-F', "lineage_csv", 'csv_summary') + + print(c.last_result.err) + assert c.last_result.status != 0 + assert f"Writing to stdout is incompatible with multiple output formats ['lineage_csv', 'csv_summary']" in c.last_result.err + + +def test_metagenome_LIN_lingroups_lg_only_header(runtmp): + c = runtmp + + g_csv = utils.get_test_data('tax/test1.gather.v450.csv') + tax = utils.get_test_data('tax/test.LIN-taxonomy.csv') + + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write('lin,name\n') + + with pytest.raises(SourmashCommandFailed): + c.run_sourmash('tax', 'metagenome', '-g', g_csv, '--taxonomy-csv', tax, + '--lins', '--lingroup', lg_file) + + print(c.last_result.status) + print(c.last_result.out) + print(c.last_result.err) + + assert c.last_result.status != 0 + assert "Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0" in c.last_result.err + assert f"No lingroups loaded from {lg_file}" in c.last_result.err diff --git a/tests/test_tax_utils.py b/tests/test_tax_utils.py index db4a277363..412340ae37 100644 --- a/tests/test_tax_utils.py +++ b/tests/test_tax_utils.py @@ -12,22 +12,23 @@ from sourmash.tax.tax_utils import (ascending_taxlist, get_ident, load_gather_results, collect_gather_csvs, check_and_load_gather_csvs, + LineagePair, QueryInfo, GatherRow, TaxResult, QueryTaxResult, SummarizedGatherResult, ClassificationResult, - QueryInfo, GatherRow, TaxResult, QueryTaxResult, - BaseLineageInfo, RankLineageInfo, LineagePair, + BaseLineageInfo, RankLineageInfo, LINLineageInfo, aggregate_by_lineage_at_rank, format_for_krona, - write_krona, write_lineage_sample_frac, - LineageDB, LineageDB_Sqlite, MultiLineageDB) - -# import lca utils as needed -from sourmash.lca import lca_utils + write_krona, write_lineage_sample_frac, read_lingroups, + LineageTree, LineageDB, LineageDB_Sqlite, MultiLineageDB) # utility functions for testing -def make_mini_taxonomy(tax_info): +def make_mini_taxonomy(tax_info, LIN=False): #pass in list of tuples: (name, lineage) taxD = {} - for (name,lin) in tax_info: - taxD[name] = lca_utils.make_lineage(lin) + for (name, lin) in tax_info: + if LIN: + lineage = LINLineageInfo(lineage_str=lin) + else: + lineage = RankLineageInfo(lineage_str=lin) + taxD[name] = lineage.filled_lineage return taxD @@ -53,28 +54,30 @@ def make_GatherRow(gather_dict=None, exclude_cols=[]): return gatherRaw -def make_TaxResult(gather_dict=None, taxD=None, keep_full_ident=False, keep_ident_version=False, skip_idents=None): +def make_TaxResult(gather_dict=None, taxD=None, keep_full_ident=False, keep_ident_version=False, skip_idents=None, LIN=False): """Make TaxResult from artificial gather row (dict)""" gRow = make_GatherRow(gather_dict) - taxres = TaxResult(raw=gRow, keep_full_identifiers=keep_full_ident, keep_identifier_versions=keep_ident_version) + taxres = TaxResult(raw=gRow, keep_full_identifiers=keep_full_ident, + keep_identifier_versions=keep_ident_version, lins=LIN) if taxD is not None: taxres.get_match_lineage(tax_assignments=taxD, skip_idents=skip_idents) return taxres def make_QueryTaxResults(gather_info, taxD=None, single_query=False, keep_full_ident=False, keep_ident_version=False, - skip_idents=None, summarize=False, classify=False, classify_rank=None, c_thresh=0.1, ani_thresh=None): + skip_idents=None, summarize=False, classify=False, classify_rank=None, c_thresh=0.1, ani_thresh=None, + LIN=False): """Make QueryTaxResult(s) from artificial gather information, formatted as list of gather rows (dicts)""" gather_results = {} this_querytaxres = None for gather_infoD in gather_info: taxres = make_TaxResult(gather_infoD, taxD=taxD, keep_full_ident=keep_full_ident, - keep_ident_version=keep_ident_version, skip_idents=skip_idents) + keep_ident_version=keep_ident_version, skip_idents=skip_idents, LIN=LIN) query_name = taxres.query_name # add to matching QueryTaxResult or create new one if not this_querytaxres or not this_querytaxres.is_compatible(taxres): # get existing or initialize new - this_querytaxres = gather_results.get(query_name, QueryTaxResult(taxres.query_info)) + this_querytaxres = gather_results.get(query_name, QueryTaxResult(taxres.query_info, lins=LIN)) this_querytaxres.add_taxresult(taxres) # print('missed_ident?', taxres.missed_ident) gather_results[query_name] = this_querytaxres @@ -158,6 +161,27 @@ def test_SummarizedGatherResult(): 'family': '', 'genus': '', 'species': '', 'strain': ''} +def test_SummarizedGatherResult_LINs(): + "SummarizedGatherResult with LINs" + qInf = QueryInfo(query_name='q1', query_md5='md5', query_filename='f1',query_bp='100', + query_n_hashes='10',ksize='31',scaled='10', total_weighted_hashes='200') + sgr = SummarizedGatherResult(rank="phylum", fraction=0.2, lineage=LINLineageInfo(lineage_str="0;0;1"), + f_weighted_at_rank=0.3, bp_match_at_rank=30) + + lgD = sgr.as_lingroup_dict(query_info=qInf, lg_name="lg_name") + print(lgD) + assert lgD == {'name': "lg_name", "lin": "0;0;1", + 'percent_containment': '30.00', 'num_bp_contained': "600"} + lgD = sgr.as_lingroup_dict(query_info=qInf, lg_name="lg_name") + print(lgD) + assert lgD == {'name': "lg_name", "lin": "0;0;1", + 'percent_containment': '30.00', 'num_bp_contained': "600"} + with pytest.raises(ValueError) as exc: + sgr.as_kreport_dict(query_info=qInf) + print(str(exc)) + assert "Cannot produce 'kreport' with LIN taxonomy." in str(exc) + + def test_SummarizedGatherResult_set_query_ani(): "Check ANI estimation within SummarizedGatherResult dataclass" qInf = QueryInfo(query_name='q1', query_md5='md5', query_filename='f1',query_bp='100', @@ -574,6 +598,43 @@ def test_load_taxonomy_csv(): assert len(tax_assign) == 6 # should have read 6 rows +def test_load_taxonomy_csv_LIN(): + taxonomy_csv = utils.get_test_data('tax/test.LIN-taxonomy.csv') + tax_assign = MultiLineageDB.load([taxonomy_csv], lins=True) + print("taxonomy assignments: \n", tax_assign) + assert list(tax_assign.keys()) == ['GCF_001881345.1', 'GCF_009494285.1', 'GCF_013368705.1', 'GCF_003471795.1', 'GCF_000017325.1', 'GCF_000021665.1'] + #assert list(tax_assign.keys()) == ["GCF_000010525.1", "GCF_000007365.1", "GCF_000007725.1", "GCF_000009605.1", "GCF_000021065.1", "GCF_000021085.1"] + assert len(tax_assign) == 6 # should have read 6 rows + print(tax_assign.available_ranks) + assert tax_assign.available_ranks == {str(x) for x in range(0,20)} + + +def test_load_taxonomy_csv_LIN_fail(): + taxonomy_csv = utils.get_test_data('tax/test.taxonomy.csv') + with pytest.raises(ValueError) as exc: + MultiLineageDB.load([taxonomy_csv], lins=True) + assert f"'lin' column not found: cannot read LIN taxonomy assignments from {taxonomy_csv}." in str(exc.value) + + +def test_load_taxonomy_csv_LIN_mismatch_in_taxfile(runtmp): + taxonomy_csv = utils.get_test_data('tax/test.LIN-taxonomy.csv') + mimatchLIN_csv = runtmp.output('mmLIN-taxonomy.csv') + with open(mimatchLIN_csv, 'w') as mm: + tax21=[] + tax = [x.rstrip() for x in open(taxonomy_csv, 'r')] + for n, taxline in enumerate(tax): + if n == 2: # add ;0 to a LIN + taxlist = taxline.split(',') + taxlist[1] += ';0' # add 21st position to LIN + tax21.append(",".join(taxlist)) + else: + tax21.append(taxline) + mm.write("\n".join(tax21)) + with pytest.raises(ValueError) as exc: + MultiLineageDB.load([mimatchLIN_csv], lins=True) + assert "For taxonomic summarization, all LIN assignments must use the same number of LIN positions." in str(exc.value) + + def test_load_taxonomy_csv_gzip(runtmp): # test loading a gzipped taxonomy csv file taxonomy_csv = utils.get_test_data('tax/test.taxonomy.csv') @@ -864,13 +925,13 @@ def test_tax_multi_load_files_shadowed(runtmp): assert len(db.shadowed_identifiers()) == 6 # we should have everything including strain - assert set(lca_utils.taxlist()) == set(db.available_ranks) + assert set(RankLineageInfo().taxlist) == set(db.available_ranks) db = MultiLineageDB.load([taxonomy_csv, taxonomy_db], keep_full_identifiers=False, keep_identifier_versions=False) assert len(db.shadowed_identifiers()) == 6 - assert set(lca_utils.taxlist(include_strain=False)) == set(db.available_ranks) + assert set(RankLineageInfo().taxlist[:-1]) == set(db.available_ranks) def test_tax_multi_save_files(runtmp, keep_identifiers, keep_versions): @@ -1104,7 +1165,7 @@ def test_BaseLineageInfo_init_not_lineagepair(): with pytest.raises(ValueError) as exc: BaseLineageInfo(lineage=lin_tups, ranks=ranks) print(str(exc)) - assert "is not LineagePair" in str(exc) + assert "is not tax_utils LineagePair" in str(exc) def test_RankLineageInfo_taxlist(): @@ -1122,6 +1183,123 @@ def test_RankLineageInfo_init_lineage_str(): assert taxinf.zip_lineage()== ['a', 'b', 'c', '', '', '', '', ''] +def test_LINLineageInfo_init_empty(): + taxinf = LINLineageInfo() + assert taxinf.n_lin_positions == 0 + assert taxinf.zip_lineage()== [] + assert taxinf.display_lineage()== "" + assert taxinf.filled_ranks == () + assert taxinf.n_filled_pos == 0 + + +def test_LINLineageInfo_init_n_pos(): + n_pos = 5 + taxinf = LINLineageInfo(n_lin_positions=n_pos) + print(taxinf.lineage) + print(taxinf.lineage_str) + assert taxinf.n_lin_positions == 5 + assert taxinf.zip_lineage()== ['', '', '', '', ''] + assert taxinf.filled_ranks == () + assert taxinf.n_filled_pos == 0 + + +def test_LINLineageInfo_init_n_pos_and_lineage_str(): + x = "0;0;1" + n_pos = 5 + taxinf = LINLineageInfo(lineage_str=x, n_lin_positions=n_pos) + print(taxinf.lineage) + print(taxinf.lineage_str) + assert taxinf.n_lin_positions == 5 + assert taxinf.zip_lineage()== ['0', '0', '1', '', ''] + assert taxinf.filled_ranks == ("0","1","2") + assert taxinf.n_filled_pos == 3 + + +def test_LINLineageInfo_init_n_pos_and_lineage_str_fail(): + x = "0;0;1" + n_pos = 2 + with pytest.raises(ValueError) as exc: + LINLineageInfo(lineage_str=x, n_lin_positions=n_pos) + print(str(exc)) + assert "Provided 'n_lin_positions' has fewer positions than provided 'lineage_str'." in str(exc) + + +def test_LINLineageInfo_init_lineage_str_only(): + x = "0,0,1" + taxinf = LINLineageInfo(lineage_str=x) + print(taxinf.lineage) + print(taxinf.lineage_str) + assert taxinf.n_lin_positions == 3 + assert taxinf.zip_lineage()== ['0', '0', '1'] + assert taxinf.filled_ranks == ("0","1","2") + assert taxinf.n_filled_pos == 3 + + +def test_LINLineageInfo_init_not_lineagepair(): + lin_tups = (("rank1", "name1"),) + with pytest.raises(ValueError) as exc: + LINLineageInfo(lineage=lin_tups) + print(str(exc)) + assert "is not tax_utils LineagePair" in str(exc) + + +def test_LINLineageInfo_init_lineagepair(): + lin_tups = (LineagePair("rank1", "name1"), LineagePair("rank2", None),) + taxinf = LINLineageInfo(lineage=lin_tups) + print(taxinf.lineage) + assert taxinf.n_lin_positions == 2 + assert taxinf.zip_lineage()== ["name1", ""] + assert taxinf.zip_lineage(truncate_empty=True)== ["name1"] + assert taxinf.filled_ranks == ("rank1",) + assert taxinf.ranks == ("rank1", "rank2") + assert taxinf.n_filled_pos == 1 + + +def test_lca_LINLineageInfo_diff_n_pos(): + x = "0;0;1" + y = '0' + lin1 = LINLineageInfo(lineage_str=x) + lin2 = LINLineageInfo(lineage_str=y) + assert lin1.is_compatible(lin2) + assert lin2.is_compatible(lin1) + lca_from_lin1 = lin1.find_lca(lin2) + lca_from_lin2 = lin2.find_lca(lin1) + assert lca_from_lin1 == lca_from_lin2 + assert lca_from_lin1.display_lineage(truncate_empty=True) == "0" + + +def test_lca_LINLineageInfo_no_lca(): + x = "0;0;1" + y = '12;0;1' + lin1 = LINLineageInfo(lineage_str=x) + lin2 = LINLineageInfo(lineage_str=y) + assert lin1.is_compatible(lin2) + assert lin2.is_compatible(lin1) + lca_from_lin1 = lin1.find_lca(lin2) + lca_from_lin2 = lin2.find_lca(lin1) + assert lca_from_lin1 == lca_from_lin2 == None + + +def test_lca_RankLineageInfo_no_lca(): + x = "a;b;c" + y = 'd;e;f;g' + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + assert lin1.is_compatible(lin2) + assert lin2.is_compatible(lin1) + lca_from_lin1 = lin1.find_lca(lin2) + lca_from_lin2 = lin2.find_lca(lin1) + assert lca_from_lin1 == lca_from_lin2 == None + + +def test_incompatibility_LINLineageInfo_RankLineageInfo(): + x="a;b;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = LINLineageInfo(lineage_str=x) + assert not lin1.is_compatible(lin2) + assert not lin2.is_compatible(lin1) + + def test_RankLineageInfo_init_lineage_str_with_ranks_as_list(): x = "a;b;c" taxranks = ['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'] @@ -1348,6 +1526,7 @@ def test_is_lineage_match_1(): lin1 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__e') lin2 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') print(lin1.lineage) + assert lin1.is_compatible(lin2) assert lin1.is_lineage_match(lin2, 'superkingdom') assert lin2.is_lineage_match(lin1, 'superkingdom') assert lin1.is_lineage_match(lin2, 'phylum') @@ -1364,11 +1543,19 @@ def test_is_lineage_match_1(): assert not lin1.is_lineage_match(lin2, 'species') assert not lin2.is_lineage_match(lin1, 'species') + lca_from_lin1 = lin1.find_lca(lin2) + print(lca_from_lin1.display_lineage()) + lca_from_lin2 = lin2.find_lca(lin1) + assert lca_from_lin1 == lca_from_lin2 + assert lca_from_lin1.display_lineage() == "d__a;p__b;c__c;o__d" + + def test_is_lineage_match_2(): # match at family, and above, levels; no genus or species to match lin1 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') lin2 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') + assert lin1.is_compatible(lin2) assert lin1.is_lineage_match(lin2, 'superkingdom') assert lin2.is_lineage_match(lin1, 'superkingdom') assert lin1.is_lineage_match(lin2, 'phylum') @@ -1385,12 +1572,19 @@ def test_is_lineage_match_2(): assert not lin1.is_lineage_match(lin2, 'species') assert not lin2.is_lineage_match(lin1, 'species') + lca_from_lin1 = lin1.find_lca(lin2) + print(lca_from_lin1.display_lineage()) + lca_from_lin2 = lin2.find_lca(lin1) + assert lca_from_lin1 == lca_from_lin2 + assert lca_from_lin1.display_lineage() == "d__a;p__b;c__c;o__d;f__f" + def test_is_lineage_match_3(): # one lineage is empty lin1 = RankLineageInfo() lin2 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') + assert lin1.is_compatible(lin2) assert not lin1.is_lineage_match(lin2, 'superkingdom') assert not lin2.is_lineage_match(lin1, 'superkingdom') assert not lin1.is_lineage_match(lin2, 'phylum') @@ -1413,6 +1607,7 @@ def test_is_lineage_match_incorrect_ranks(): lin1 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__e', ranks=taxranks[::-1]) lin2 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') print(lin1.lineage) + assert not lin1.is_compatible(lin2) with pytest.raises(ValueError) as exc: lin1.is_lineage_match(lin2, 'superkingdom') print(str(exc)) @@ -1424,6 +1619,7 @@ def test_is_lineage_match_improper_rank(): lin1 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__e') lin2 = RankLineageInfo(lineage_str = 'd__a;p__b;c__c;o__d;f__f') print(lin1.lineage) + assert lin1.is_compatible(lin2) with pytest.raises(ValueError) as exc: lin1.is_lineage_match(lin2, 'NotARank') print(str(exc)) @@ -2582,11 +2778,333 @@ def test_make_kreport_results_fail_pre_v450(): assert "cannot produce 'kreport' format from gather results before sourmash v4.5.0" in str(exc) -def test_make_kreport_results_fail_pre_v450(): - taxD = make_mini_taxonomy([("gA", "a;b;c"), ("gB", "a;b;d")]) +def test_make_lingroup_results(): + taxD = make_mini_taxonomy([("gA", "1;0;0"), ("gB", "1;0;1"), ("gC", "1;1;0")], LIN=True) + print(taxD) + lingroupD = {"1":"lg1", "1;0":'lg2', '1;1': "lg3"} + print(lingroupD) + gather_results = [{"total_weighted_hashes":100}, + {"name": 'gB', "total_weighted_hashes":100}, + {"name": 'gC', "total_weighted_hashes":100}] + q_res = make_QueryTaxResults(gather_info=gather_results, taxD=taxD, single_query=True, summarize=True, LIN=True) + print(q_res.summarized_lineage_results) + + header, lgD = q_res.make_lingroup_results(LINgroupsD = lingroupD) + print(header) + assert header == ['name', 'lin', 'percent_containment', 'num_bp_contained'] + # order may change, just check that each lg entry is present in list of results + lg1 = {'percent_containment': '60.00', 'num_bp_contained': '60', + 'lin': '1', 'name': 'lg1'} + lg2 = {'percent_containment': '40.00', 'num_bp_contained': '40', + 'lin': '1;0', 'name': 'lg2'} + lg3 = {'percent_containment': '20.00', 'num_bp_contained': '20', + 'lin': '1;1', 'name': 'lg3'} + assert lg1 in lgD + assert lg2 in lgD + assert lg3 in lgD + + +def test_make_lingroup_results_fail_pre_v450(): + taxD = make_mini_taxonomy([("gA", "1;0;0"), ("gB", "1;0;1"), ("gC", "1;1;0")], LIN=True) gather_results = [{}, {"name": 'gB'}] - q_res = make_QueryTaxResults(gather_info=gather_results, taxD=taxD, single_query=True, summarize=True) + q_res = make_QueryTaxResults(gather_info=gather_results, taxD=taxD, single_query=True, summarize=True, LIN=True) + lingroupD = {"1":"lg1", "1;0":'lg2', '1;1': "lg3"} with pytest.raises(ValueError) as exc: - q_res.make_kreport_results() + q_res.make_lingroup_results(lingroupD) print(str(exc)) - assert "cannot produce 'kreport' format from gather results before sourmash v4.5.0" in str(exc) + assert "cannot produce 'LINgroup_report' format from gather results before sourmash v4.5.0" in str(exc) + + +def test_read_lingroups(runtmp): + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write('lin,name\n') + out.write('1,lg1\n') + out.write('1;0,lg2\n') + out.write('1;1,lg3\n') + lgD = read_lingroups(lg_file) + + assert lgD == {"1":"lg1", "1;0":'lg2', '1;1': "lg3"} + +def test_read_lingroups_empty_file(runtmp): + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write("") + with pytest.raises(ValueError) as exc: + read_lingroups(lg_file) + print(str(exc)) + assert f"Cannot read lingroups from '{lg_file}'. Is file empty?" in str(exc) + + +def test_read_lingroups_only_header(runtmp): + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write('lin,name\n') + with pytest.raises(ValueError) as exc: + read_lingroups(lg_file) + print(str(exc)) + assert f"No lingroups loaded from {lg_file}" in str(exc) + + +def test_read_lingroups_bad_header(runtmp): + lg_file = runtmp.output("test.lg.csv") + with open(lg_file, 'w') as out: + out.write('LINgroup_pfx,LINgroup_nm\n') + with pytest.raises(ValueError) as exc: + read_lingroups(lg_file) + print(str(exc)) + assert f"'{lg_file}' must contain the following columns: 'name', 'lin'." in str(exc) + + +def test_LineageTree_init(): + x = "a;b" + lin1 = RankLineageInfo(lineage_str=x) + print(lin1) + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a'): + { LineagePair('phylum', 'b') : {}} } + +def test_LineageTree_init_mult(): + x = "a;b" + y = "a;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + print(lin1) + from sourmash.tax.tax_utils import LineageTree + tree = LineageTree([lin1, lin2]) + assert tree.tree == {LineagePair(rank='superkingdom', name='a', taxid=None): + {LineagePair(rank='phylum', name='b', taxid=None): {}, + LineagePair(rank='phylum', name='c', taxid=None): {}}} + + +def test_LineageTree_init_and_add_lineage(): + x = "a;b" + y = "a;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + print(lin1) + from sourmash.tax.tax_utils import LineageTree + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a'): + { LineagePair('phylum', 'b') : {}} } + tree.add_lineage(lin2) + assert tree.tree == {LineagePair(rank='superkingdom', name='a', taxid=None): + {LineagePair(rank='phylum', name='b', taxid=None): {}, + LineagePair(rank='phylum', name='c', taxid=None): {}}} + + +def test_LineageTree_init_and_add_lineages(): + x = "a;b" + y = "a;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + print(lin1) + from sourmash.tax.tax_utils import LineageTree + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a'): + { LineagePair('phylum', 'b') : {}} } + tree.add_lineages([lin2]) + assert tree.tree == {LineagePair(rank='superkingdom', name='a', taxid=None): + {LineagePair(rank='phylum', name='b', taxid=None): {}, + LineagePair(rank='phylum', name='c', taxid=None): {}}} + + +def test_build_tree_RankLineageInfo(): + x = "a;b" + lin1 = RankLineageInfo(lineage_str=x) + print(lin1) + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a'): + { LineagePair('phylum', 'b') : {}} } + + +def test_build_tree_LINLineageInfo(): + x = "0;3" + lin1 = LINLineageInfo(lineage_str=x) + print(lin1) + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('0', '0'): + { LineagePair('1', '3') : {}} } + + +def test_build_tree_2(): + x = "a;b" + y = "a;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + print(lin1) + print(lin2) + tree = LineageTree([lin1,lin2]) + + assert tree.tree == { LineagePair('superkingdom', 'a'): { LineagePair('phylum', 'b') : {}, + LineagePair('phylum', 'c') : {}} } + + +def test_build_tree_2_LineagePairs(): + # build tree from LineagePairs + tree = LineageTree([[LineagePair('superkingdom', 'a'), LineagePair('phylum', 'b')], + [LineagePair('superkingdom', 'a'), LineagePair('phylum', 'c')], + ]) + + assert tree.tree == { LineagePair('superkingdom', 'a'): { LineagePair('phylum', 'b') : {}, + LineagePair('phylum', 'c') : {}} } + + +def test_build_tree_3(): + # empty phylum name + x='a;' + lin1 = RankLineageInfo(lineage_str=x) + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a'): {} } + + +def test_build_tree_3_LineagePairs(): + # empty phylum name: LineagePair input + lin1 = (LineagePair('superkingdom', "a", '3'), + LineagePair('phylum', '', ''),) + tree = LineageTree([lin1]) + assert tree.tree == { LineagePair('superkingdom', 'a', '3'): {} } + + +def test_build_tree_5(): + with pytest.raises(ValueError): + tree = LineageTree([]) + + +def test_build_tree_5b(): + with pytest.raises(ValueError): + tree = LineageTree("") + + +def test_build_tree_iterable(): + with pytest.raises(ValueError) as exc: + tree = LineageTree(RankLineageInfo()) + assert "Must pass in an iterable containing LineagePair or LineageInfo objects" in str(exc) + + +def test_find_lca(): + x='a;b' + lin1 = RankLineageInfo(lineage_str=x) + tree = LineageTree([lin1]) + lca = tree.find_lca() + + assert lca == ((LineagePair('superkingdom', 'a'), LineagePair('phylum', 'b'),), 0) + + +def test_find_lca_LineagePairs(): + tree = LineageTree([[LineagePair('rank1', 'name1'), LineagePair('rank2', 'name2')]]) + lca = tree.find_lca() + + assert lca == ((LineagePair('rank1', 'name1'), LineagePair('rank2', 'name2'),), 0) + + +def test_find_lca_2(): + x = "a;b" + y = "a;c" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + + tree = LineageTree([lin1, lin2]) + lca = tree.find_lca() + + assert lca == ((LineagePair('superkingdom', 'a'),), 2) + + +def test_find_lca_LIN(): + x = "5;6" + y = "5;10" + lin1 = LINLineageInfo(lineage_str=x) + lin2 = LINLineageInfo(lineage_str=y) + + tree = LineageTree([lin1, lin2]) + lca = tree.find_lca() + + assert lca == ((LineagePair('0', '5'),), 2) + print(lca) + + +def test_find_lca_2_LineagePairs(): + tree = LineageTree([[LineagePair('rank1', 'name1'), LineagePair('rank2', 'name2a')], + [LineagePair('rank1', 'name1'), LineagePair('rank2', 'name2b')], + ]) + lca = tree.find_lca() + + assert lca == ((LineagePair('rank1', 'name1'),), 2) + + +def test_find_lca_3(): + lin1 = RankLineageInfo(lineage_str="a;b;c") + lin2 = RankLineageInfo(lineage_str="a;b") + + tree = LineageTree([lin1, lin2]) + lca, reason = tree.find_lca() + assert lca == lin1.filled_lineage # find most specific leaf node + print(lca) + + +def test_build_tree_with_initial(): + x = "a;b;c" + y = "a;b;d" + z = "a;e" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + lin3 = RankLineageInfo(lineage_str=z) + + tree = LineageTree([lin1, lin2]) + lca = tree.find_lca() + + print(lca) + assert lca == ((LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None)), 2) + tree.add_lineages([lin3]) + lca2 = tree.find_lca() + print(lca2) + assert lca2 == ((LineagePair('superkingdom', 'a'),), 2) + + +def test_LineageTree_find_ordered_paths(): + x = "a;b;c" + y = "a;b;d" + z = "a;e" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + lin3 = RankLineageInfo(lineage_str=z) + + tree = LineageTree([lin1, lin2, lin3]) + paths = tree.ordered_paths() + + print(paths) + assert paths == [(LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='e', taxid=None)), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None), + LineagePair(rank='class', name='c', taxid=None)), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None), + LineagePair(rank='class', name='d', taxid=None))] + + +def test_LineageTree_find_ordered_paths_include_internal(): + x = "a;b;c" + y = "a;b;d" + z = "a;e" + lin1 = RankLineageInfo(lineage_str=x) + lin2 = RankLineageInfo(lineage_str=y) + lin3 = RankLineageInfo(lineage_str=z) + + tree = LineageTree([lin1, lin2, lin3]) + paths = tree.ordered_paths(include_internal=True) + + print(paths) + + assert paths == [(LineagePair(rank='superkingdom', name='a', taxid=None),), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='e', taxid=None)), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None)), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None), + LineagePair(rank='class', name='c', taxid=None)), + (LineagePair(rank='superkingdom', name='a', taxid=None), + LineagePair(rank='phylum', name='b', taxid=None), + LineagePair(rank='class', name='d', taxid=None))]