using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial #3095

ctb · 2024-03-22T14:07:29Z

There's quite a bit of interest #2816 #3070 #3089 in using doing contig-level/long-read gather and (maybe) taxonomy assignment for contigs/long reads. Here's a short example that uses fastmultigather to do this.

The example uses contigs from genomes and searches them against the genomes themselves. This is just as an example! You can totally replace podar-ref-singletons.zip with your own queries that would be from different genomes.

A few notes -

fastmultigather is part of the branchwater plugin that can be installed with conda. See docs for fastmultigather specifically.
fastmultigather when used with a rocksdb database (built with sourmash scripts index, below) will generate a single output file with -o. Indexing will take some time for large databases but it will be worth it ;). (ref rocksdb docs)
*If you use fastgather or fastmultigather against a zipfile, you'll get multiple output files, but otherwise things will still work.

`fastmultigather` quickstart using small data sets + rocksdb

hackmd for editing: https://hackmd.io/ztM-7ZJoSYahMMPde7Q5vw?view

# make working dir
mkdir podar-ref-singleton
cd podar-ref-singleton

# download example data
curl -L https://osf.io/vbhy5/download -o podar-ref.tar.gz

# unpack
tar xzf podar-ref.tar.gz

# sketch twice - once with all contigs using --singleton, once combining each file
sourmash sketch dna --singleton *.fa -o podar-ref-singleton.zip
sourmash sketch dna --name-from-first *.fa -o podar-ref-genomes.zip

# index database so that fastmultigather can produce all gather columns
# this will take a while if you do it for large databases!
sourmash scripts index podar-ref-genomes.zip -o podar-ref.rocksdb

# run fastmultigather
sourmash scripts fastmultigather podar-ref-singleton.zip podar-ref.rocksdb -o gather.csv

# all your gather results will be in gather.csv

# grab lineage file
curl -L https://osf.io/4yhjw/download -o podar-ref.tax.csv

sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out

# all results will be in out.human.txt

Related issues:

The text was updated successfully, but these errors were encountered:

yuzie0314 · 2024-05-13T03:24:14Z

Hi @ctb,
I have spent some time reviewing your tutorial and tested several times, but it failed in the final step, namely the sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out step.

Good news is that fastmultigather does the magic to speed up the gather step which is fantastic ! And I am sure that the results are what we want, we saw the query_name is the contig names within a genome (query) and the match_name is the reference genome name.

The error message as follow:

singularity exec -B pwd -B /fsx /fsx/singularity/branchwater.0.8.5.sif bash test.sh 

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Exiting.
ERROR: 'gather.csv' is missing columns needed for taxonomic summarization. Please run gather with sourmash >= 4.4.

PS the content of test.sh is:
sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out

the expected output is out.human.txt containing taxonomic information.

Thank you in advance,
Yuzie

ctb · 2024-05-14T14:16:02Z

ok, took me a second 😆 😭

and apologies for the complicated answer. This should be resolved in the next few weeks... but for now... it's a bit of a mess.

Question: are you using a rocksdb index? The current release of the plugin, v0.9.3, only supports full gather output when using fastmultigather against a rocksdb index.

This will be updated in the next release, since sourmash-bio/sourmash_plugin_branchwater#298 was merged!

However, the bad news is that testing has since revealed that fastmultigather against a rocksdb has a bug in it where it returns incomplete results; see sourmash-bio/sourmash_plugin_branchwater#322. (The good news, such as it is, is that the results are accurate when using fastgather/fastmultigather NOT against a rocksdb index...)

SO, for now, the solution is: use fastgather or fastmultigather WITHOUT a rocksdb index, and then run sourmash gather using a picklist, per https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/doc/README.md#using-fastgather-to-create-a-picklist-for-sourmash-gather.

I'll update you here when we have fixed the problems and released a new version. Apologies, things got tricky with all our different efforts to speed things up!

Related issue:

full gather results from plugin are not compatible with sourmash tax sourmash_plugin_branchwater#330

ctb · 2024-06-20T13:49:31Z

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So a lot of the caveats above are now moot! I'll update the code above when I get a chance.

mpgriesh · 2024-06-27T21:54:24Z

Hi all - thanks for this wonderful tool! I have been using the most recent plugin distribution from conda-forge, but it still seems to have a similar error as above and additionally discussed in the plugin issue 330

sbatch run_sourmash_gather.sh /mpgriesh/data/sourmash/genomes_k51.zip /mpgriesh/data/sourmash/gather_genomes_k51.csv # produced expected output

(branchwater) sourmash tax genome -g gather_genomes_k51.csv -t /data/databases/sourmash/gtdb-rs214-k51/gtdb-rs214.lineages.csv -F human -o genomes-k51

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Exiting.
ERROR: Error: Cannot add TaxResult: query information does not match.

I'll add that I tried the tutorial listed above and it completed as expected.

ctb · 2024-07-12T03:19:03Z

hi @mpgriesh sorry for taking so long to get back to you!

I tried reproducing the error and couldn't - I'm curious what command you're running in run_sourmash_gather.sh. thanks!

And this is also a good reminder to update this issue - I'll do that tomorrow!

mpgriesh · 2024-07-12T06:21:17Z

No problem at all @ctb! Thanks again for such a thoughtful and useful suite of tools. I'm really excited about the functionality here

My apologies for not clarifying that command before.

I first ran:

(sourmash) sourmash sketch dna --name-from-first inputs/*.fa -p k=51 -o genomes_k51.zip
(sourmash) sourmash scripts index data/databases/sourmash/gtdb-rs214-k51/gtdb-rs214-k51.zip -o gtdb-rs214-k51.rocksdb

And then the command run by run_sourmash_gather.sh:

(branchwater) sourmash scripts fastmultigather /mpgriesh/data/sourmash/genomes_k51.zip gtdb-rs214-k51.rocksdb/ -o /mpgriesh/data/sourmash/gather_genomes_k51.csv -k 51 -m DNA -c 16

ctb · 2024-07-12T12:29:52Z

thanks! and hmmmm... is it possible that you have several queries with the same name or identifier?

mpgriesh · 2024-07-13T01:39:31Z

The file names are all SRA experiment accessions and are unique.

Here is the fastmultigather output:

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,match_filename,match_name,match_md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,moltype,scaled,query_n_hashes,query_abundance,query_containment_ani,match_containment_ani,average_containment_ani,max_containment_ani,n_unique_weighted_found,sum_weighted_found,total_weighted_hashes
1302000,0.20758928571428573,0.2412004446091145,0.20758928571428573,0.20758928571428573,1.0,1.0,0.0,/dev/fd/63,"GCF_004346335.1 Pseudomonas sp. LP_4_YM strain=LP_4_YM, ASM434633v1",090832325bf2a9c6731dac3d1afbbbb0,0.2412004446091145,1302000,0,4970000,inputs/DRX122001.fa,contig_1,9d7bc6917ff3ecf330001b3907dafc44,6272000,51,DNA,1000,6272,false,0.9696429886016547,0.9725003509992804,0.9710716698004676,0.9725003509992804,0,0,6272

This appears correct to me. I have two separate conda environments - sourmash and the branchwater_plugin. Is it possible the versions are incompatible? I used the sourmash distribution for sketch and index.

ctb · 2024-07-13T16:41:38Z

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So a lot of the caveats above are now moot! I'll update the code above when I get a chance.

On re-reading the example, not much needed to be changed - I just updated the text a bit.

ctb · 2024-07-13T16:48:33Z

@mpgriesh I put your CSV in a file and ran:

% sourmash tax genome -g fmg-bug.csv -t gtdb-rs214.lineages.sqldb -F human

== This is sourmash version 4.8.11.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loaded 1 gather results from 'fmg-bug.csv'.
loaded results for 1 queries from 1 gather CSVs
classified 1/1 queries (100.00%). Writing results
sample name    status    proportion   cANI   lineage
-----------    ------    ----------   ----   -------
contig_1          match    20.8%     97.0%  d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas_E;s__Pseudomonas_E fulva

so it seems to be working now 🤔 .

I must admit to some confusion in figuring out how and why this didn't work in your situation, because a dive into the related issues suggests that the only thing that we fixed was in the generation of the CSV - and the CSV that you copy/pasted works fine 😭 .

BUT... wait... I see contig_1 in the name there... maybe when you constructed the sketches, the sketch names were not set to be different? That might be the problem. Try seeing what sourmash sig describe says on the query sequences - the signature: line is what is reported as name in the CSV output.

Update: some of the behavior we addressed in multigather with #2065 might be similar - the challenge with using the same ident/etc. I wonder if we could change tax genome/tax metagenome to care more about the md5sum when collecting queries together? Will ruminate.

ChillarAnand · 2024-08-05T03:09:05Z

How to get index script/plugin?

❯ sourmash scripts

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

options:
  -h, --help  show this help message and exit

available plugin/extension commands:
  sourmash scripts commonhash       - filter hashes by min occurrence across sketches
  sourmash scripts sketchall        - sketch many genomes efficiently using multiple threads.
  sourmash scripts venn             - create and write out a pairwise or three-way Venn set overlap diagram.
  sourmash scripts abundhist        - calculate abundance profiles from one or more abund sketches
  sourmash scripts get-genomes      - download one or more NCBI genomes + associated taxonomic info

This command is failing.

$ sourmash scripts index podar-ref-genomes.zip -o podar-ref.rocksdb

ctb · 2024-08-05T09:50:51Z

How to get index script/plugin?

hi! You'll need to install the branchwater plugin - the easiest is via conda,

conda install sourmash_plugin_branchwater

let me know if that work (or doesn't)!

ChillarAnand · 2024-08-05T10:04:56Z

Using conda worked. Thanks for the quick reply.

ChillarAnand · 2024-08-05T23:07:41Z

Based on docs, for taxonomic classification when I ran this command, it is not able to find any matches

#! /bin/bash

set -x 

# one time setup
wget -c https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214.lineages.csv.gz
wget -c https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.zip
sourmash tax prepare -t gtdb-rs214.lineages.csv.gz -o gtdb-rs214.taxonomy.sqldb -F sql


# taxonomic classification

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund -o input_sample.sig input_sample.fastq

# sourmash gather input_sample.sig gtdb-rs214-k31.zip -o input_sample_gather_gtdb_rs214_full.csv
sourmash gather input_sample.sig gtdb-rs214-reps.k31.zip -o input_sample_gather_gtdb_rs214_reps.csv

sourmash tax annotate -g input_sample_gather_gtdb_rs214_reps.csv -t gtdb-rs214.taxonomy.sqldb

Sample file: https://github.com/ChillarAnand/avilpage.com/tree/master/scripts/data

Starting prefetch sweep across databases.
Prefetch found 0 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.
found less than 50.0 kbp in common. => exiting

No matches found for --threshold-bp at 50.0 kbp.

For the same sample, with kraken, this is the classification.

From the above tutorial, I cannot understand how to process a single sample and do taxonomic classification.

ctb · 2024-08-12T03:39:56Z

From the above tutorial, I cannot understand how to process a single sample and do taxonomic classification.

it looks like you did everything right!

Two things to try -

switch to using the complete GTDB database, not just the species representatives;
you could lower the gather threshold with --threshold-bp=5000;

u-xixi · 2024-08-22T21:44:21Z

Hi I came across this thread when searching for a solution, and I just want to comment on @ChillarAnand 's case.
You suggested lowering the gather threshold to 5k, but I had a look at the query reads, they are < 2k in length. Shouldn't the threshold be lowered to 1k?
I am asking because I was trying to use manysketch and fastmultigather to classify contigs from metagenomic assemblies. I ran fastmultigather with default threshold, and then looked at the CSV produced. There was not a single record in the CSV where the query length is under 50k. I believe my reference database should be sufficient. It was genbank bacteria, archaea, viral, prozotoa, and fungi provided by sourmash. So I think the problem is the threshold being above the query length. I am running it now with --threshold-bp 1000 because 1k is the minimum length of my contigs.

ctb · 2024-08-26T16:03:25Z

Hi I came across this thread when searching for a solution, and I just want to comment on @ChillarAnand 's case. You suggested lowering the gather threshold to 5k, but I had a look at the query reads, they are < 2k in length. Shouldn't the threshold be lowered to 1k? I am asking because I was trying to use manysketch and fastmultigather to classify contigs from metagenomic assemblies. I ran fastmultigather with default threshold, and then looked at the CSV produced. There was not a single record in the CSV where the query length is under 50k. I believe my reference database should be sufficient. It was genbank bacteria, archaea, viral, prozotoa, and fungi provided by sourmash. So I think the problem is the threshold being above the query length. I am running it now with --threshold-bp 1000 because 1k is the minimum length of my contigs.

Ugh, you are completely correct - sorry about that and THANK YOU for correcting me!

ctb added doc documentation content or issues faq things to add to an FAQ or docs labels Mar 22, 2024

ctb mentioned this issue Jul 11, 2024

sourmash fastgather -> tax sourmash-bio/sourmash_plugin_branchwater#386

Closed

ctb mentioned this issue Jul 14, 2024

do tax commands pay attention to md5sum instead of just name? can they? #3251

Open

spoonbender76 mentioned this issue Aug 15, 2024

Can sourmash be used to remove microbal contamination from long reads? #3291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial #3095

using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial #3095

ctb commented Mar 22, 2024 •

edited

Loading

yuzie0314 commented May 13, 2024

ctb commented May 14, 2024

ctb commented Jun 20, 2024

mpgriesh commented Jun 27, 2024 •

edited by ctb

Loading

ctb commented Jul 12, 2024

mpgriesh commented Jul 12, 2024 •

edited

Loading

ctb commented Jul 12, 2024

mpgriesh commented Jul 13, 2024 •

edited by ctb

Loading

ctb commented Jul 13, 2024

ctb commented Jul 13, 2024 •

edited

Loading

ChillarAnand commented Aug 5, 2024

ctb commented Aug 5, 2024

ChillarAnand commented Aug 5, 2024 •

edited

Loading

ChillarAnand commented Aug 5, 2024

ctb commented Aug 12, 2024

u-xixi commented Aug 22, 2024

ctb commented Aug 26, 2024

using fastmultigather to do contig-level gather and taxonomy assignment - a brief tutorial #3095

using fastmultigather to do contig-level gather and taxonomy assignment - a brief tutorial #3095

Comments

ctb commented Mar 22, 2024 • edited Loading

fastmultigather quickstart using small data sets + rocksdb

yuzie0314 commented May 13, 2024

ctb commented May 14, 2024

ctb commented Jun 20, 2024

mpgriesh commented Jun 27, 2024 • edited by ctb Loading

ctb commented Jul 12, 2024

mpgriesh commented Jul 12, 2024 • edited Loading

ctb commented Jul 12, 2024

mpgriesh commented Jul 13, 2024 • edited by ctb Loading

ctb commented Jul 13, 2024

ctb commented Jul 13, 2024 • edited Loading

ChillarAnand commented Aug 5, 2024

ctb commented Aug 5, 2024

ChillarAnand commented Aug 5, 2024 • edited Loading

ChillarAnand commented Aug 5, 2024

ctb commented Aug 12, 2024

u-xixi commented Aug 22, 2024

ctb commented Aug 26, 2024

using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial #3095

using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial #3095

ctb commented Mar 22, 2024 •

edited

Loading

`fastmultigather` quickstart using small data sets + rocksdb

mpgriesh commented Jun 27, 2024 •

edited by ctb

Loading

mpgriesh commented Jul 12, 2024 •

edited

Loading

mpgriesh commented Jul 13, 2024 •

edited by ctb

Loading

ctb commented Jul 13, 2024 •

edited

Loading

ChillarAnand commented Aug 5, 2024 •

edited

Loading