Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using fastmultigather to do contig-level gather and taxonomy assignment - a brief tutorial #3095

Open
ctb opened this issue Mar 22, 2024 · 17 comments
Labels
doc documentation content or issues faq things to add to an FAQ or docs

Comments

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

There's quite a bit of interest #2816 #3070 #3089 in using doing contig-level/long-read gather and (maybe) taxonomy assignment for contigs/long reads. Here's a short example that uses fastmultigather to do this.

The example uses contigs from genomes and searches them against the genomes themselves. This is just as an example! You can totally replace podar-ref-singletons.zip with your own queries that would be from different genomes.

A few notes -

  • fastmultigather is part of the branchwater plugin that can be installed with conda. See docs for fastmultigather specifically.
  • fastmultigather when used with a rocksdb database (built with sourmash scripts index, below) will generate a single output file with -o. Indexing will take some time for large databases but it will be worth it ;). (ref rocksdb docs)
  • *If you use fastgather or fastmultigather against a zipfile, you'll get multiple output files, but otherwise things will still work.

fastmultigather quickstart using small data sets + rocksdb

hackmd for editing: https://hackmd.io/ztM-7ZJoSYahMMPde7Q5vw?view

# make working dir
mkdir podar-ref-singleton
cd podar-ref-singleton

# download example data
curl -L https://osf.io/vbhy5/download -o podar-ref.tar.gz

# unpack
tar xzf podar-ref.tar.gz

# sketch twice - once with all contigs using --singleton, once combining each file
sourmash sketch dna --singleton *.fa -o podar-ref-singleton.zip
sourmash sketch dna --name-from-first *.fa -o podar-ref-genomes.zip

# index database so that fastmultigather can produce all gather columns
# this will take a while if you do it for large databases!
sourmash scripts index podar-ref-genomes.zip -o podar-ref.rocksdb

# run fastmultigather
sourmash scripts fastmultigather podar-ref-singleton.zip podar-ref.rocksdb -o gather.csv

# all your gather results will be in gather.csv

# grab lineage file
curl -L https://osf.io/4yhjw/download -o podar-ref.tax.csv

sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out

# all results will be in out.human.txt

Related issues:

@yuzie0314
Copy link

Hi @ctb,
I have spent some time reviewing your tutorial and tested several times, but it failed in the final step, namely the sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out step.

Good news is that fastmultigather does the magic to speed up the gather step which is fantastic ! And I am sure that the results are what we want, we saw the query_name is the contig names within a genome (query) and the match_name is the reference genome name.

The error message as follow:

singularity exec -B pwd -B /fsx /fsx/singularity/branchwater.0.8.5.sif bash test.sh 

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Exiting.
ERROR: 'gather.csv' is missing columns needed for taxonomic summarization. Please run gather with sourmash >= 4.4.

PS the content of test.sh is:
sourmash tax genome -g gather.csv -t podar-ref.tax.csv -F human -o out

the expected output is out.human.txt containing taxonomic information.

螢幕擷取畫面 2024-05-13 112309

Thank you in advance,
Yuzie

@ctb
Copy link
Contributor Author

ctb commented May 14, 2024

ok, took me a second 😆 😭

and apologies for the complicated answer. This should be resolved in the next few weeks... but for now... it's a bit of a mess.

Question: are you using a rocksdb index? The current release of the plugin, v0.9.3, only supports full gather output when using fastmultigather against a rocksdb index.

This will be updated in the next release, since sourmash-bio/sourmash_plugin_branchwater#298 was merged!

However, the bad news is that testing has since revealed that fastmultigather against a rocksdb has a bug in it where it returns incomplete results; see sourmash-bio/sourmash_plugin_branchwater#322. (The good news, such as it is, is that the results are accurate when using fastgather/fastmultigather NOT against a rocksdb index...)

SO, for now, the solution is: use fastgather or fastmultigather WITHOUT a rocksdb index, and then run sourmash gather using a picklist, per https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/doc/README.md#using-fastgather-to-create-a-picklist-for-sourmash-gather.

I'll update you here when we have fixed the problems and released a new version. Apologies, things got tricky with all our different efforts to speed things up!

Related issue:

@ctb
Copy link
Contributor Author

ctb commented Jun 20, 2024

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So a lot of the caveats above are now moot! I'll update the code above when I get a chance.

@mpgriesh
Copy link

mpgriesh commented Jun 27, 2024

Hi all - thanks for this wonderful tool! I have been using the most recent plugin distribution from conda-forge, but it still seems to have a similar error as above and additionally discussed in the plugin issue 330
image

sbatch run_sourmash_gather.sh /mpgriesh/data/sourmash/genomes_k51.zip /mpgriesh/data/sourmash/gather_genomes_k51.csv # produced expected output

(branchwater) sourmash tax genome -g gather_genomes_k51.csv -t /data/databases/sourmash/gtdb-rs214-k51/gtdb-rs214.lineages.csv -F human -o genomes-k51

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Exiting.
ERROR: Error: Cannot add TaxResult: query information does not match.

I'll add that I tried the tutorial listed above and it completed as expected.

@ctb
Copy link
Contributor Author

ctb commented Jul 12, 2024

hi @mpgriesh sorry for taking so long to get back to you!

I tried reproducing the error and couldn't - I'm curious what command you're running in run_sourmash_gather.sh. thanks!

And this is also a good reminder to update this issue - I'll do that tomorrow!

@mpgriesh
Copy link

mpgriesh commented Jul 12, 2024

No problem at all @ctb! Thanks again for such a thoughtful and useful suite of tools. I'm really excited about the functionality here

My apologies for not clarifying that command before.

I first ran:

(sourmash) sourmash sketch dna --name-from-first inputs/*.fa -p k=51 -o genomes_k51.zip
(sourmash) sourmash scripts index data/databases/sourmash/gtdb-rs214-k51/gtdb-rs214-k51.zip -o gtdb-rs214-k51.rocksdb

And then the command run by run_sourmash_gather.sh:

(branchwater) sourmash scripts fastmultigather /mpgriesh/data/sourmash/genomes_k51.zip gtdb-rs214-k51.rocksdb/ -o /mpgriesh/data/sourmash/gather_genomes_k51.csv -k 51 -m DNA -c 16

@ctb
Copy link
Contributor Author

ctb commented Jul 12, 2024

thanks! and hmmmm... is it possible that you have several queries with the same name or identifier?

@mpgriesh
Copy link

mpgriesh commented Jul 13, 2024

The file names are all SRA experiment accessions and are unique.

Here is the fastmultigather output:

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,match_filename,match_name,match_md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,moltype,scaled,query_n_hashes,query_abundance,query_containment_ani,match_containment_ani,average_containment_ani,max_containment_ani,n_unique_weighted_found,sum_weighted_found,total_weighted_hashes
1302000,0.20758928571428573,0.2412004446091145,0.20758928571428573,0.20758928571428573,1.0,1.0,0.0,/dev/fd/63,"GCF_004346335.1 Pseudomonas sp. LP_4_YM strain=LP_4_YM, ASM434633v1",090832325bf2a9c6731dac3d1afbbbb0,0.2412004446091145,1302000,0,4970000,inputs/DRX122001.fa,contig_1,9d7bc6917ff3ecf330001b3907dafc44,6272000,51,DNA,1000,6272,false,0.9696429886016547,0.9725003509992804,0.9710716698004676,0.9725003509992804,0,0,6272

This appears correct to me. I have two separate conda environments - sourmash and the branchwater_plugin. Is it possible the versions are incompatible? I used the sourmash distribution for sketch and index.

@ctb
Copy link
Contributor Author

ctb commented Jul 13, 2024

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So a lot of the caveats above are now moot! I'll update the code above when I get a chance.

On re-reading the example, not much needed to be changed - I just updated the text a bit.

@ctb
Copy link
Contributor Author

ctb commented Jul 13, 2024

@mpgriesh I put your CSV in a file and ran:

% sourmash tax genome -g fmg-bug.csv -t gtdb-rs214.lineages.sqldb -F human

== This is sourmash version 4.8.11.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loaded 1 gather results from 'fmg-bug.csv'.
loaded results for 1 queries from 1 gather CSVs
classified 1/1 queries (100.00%). Writing results
sample name    status    proportion   cANI   lineage
-----------    ------    ----------   ----   -------
contig_1          match    20.8%     97.0%  d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas_E;s__Pseudomonas_E fulva

so it seems to be working now 🤔 .

I must admit to some confusion in figuring out how and why this didn't work in your situation, because a dive into the related issues suggests that the only thing that we fixed was in the generation of the CSV - and the CSV that you copy/pasted works fine 😭 .

BUT... wait... I see contig_1 in the name there... maybe when you constructed the sketches, the sketch names were not set to be different? That might be the problem. Try seeing what sourmash sig describe says on the query sequences - the signature: line is what is reported as name in the CSV output.

Update: some of the behavior we addressed in multigather with #2065 might be similar - the challenge with using the same ident/etc. I wonder if we could change tax genome/tax metagenome to care more about the md5sum when collecting queries together? Will ruminate.

@ChillarAnand
Copy link

How to get index script/plugin?

❯ sourmash scripts

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

options:
  -h, --help  show this help message and exit

available plugin/extension commands:
  sourmash scripts commonhash       - filter hashes by min occurrence across sketches
  sourmash scripts sketchall        - sketch many genomes efficiently using multiple threads.
  sourmash scripts venn             - create and write out a pairwise or three-way Venn set overlap diagram.
  sourmash scripts abundhist        - calculate abundance profiles from one or more abund sketches
  sourmash scripts get-genomes      - download one or more NCBI genomes + associated taxonomic info

This command is failing.

$ sourmash scripts index podar-ref-genomes.zip -o podar-ref.rocksdb

@ctb
Copy link
Contributor Author

ctb commented Aug 5, 2024

How to get index script/plugin?

hi! You'll need to install the branchwater plugin - the easiest is via conda,

conda install sourmash_plugin_branchwater

let me know if that work (or doesn't)!

@ChillarAnand
Copy link

ChillarAnand commented Aug 5, 2024

Using conda worked. Thanks for the quick reply.

@ChillarAnand
Copy link

Based on docs, for taxonomic classification when I ran this command, it is not able to find any matches

#! /bin/bash

set -x 

# one time setup
wget -c https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214.lineages.csv.gz
wget -c https://farm.cse.ucdavis.edu/\~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.zip
sourmash tax prepare -t gtdb-rs214.lineages.csv.gz -o gtdb-rs214.taxonomy.sqldb -F sql


# taxonomic classification

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund -o input_sample.sig input_sample.fastq

# sourmash gather input_sample.sig gtdb-rs214-k31.zip -o input_sample_gather_gtdb_rs214_full.csv
sourmash gather input_sample.sig gtdb-rs214-reps.k31.zip -o input_sample_gather_gtdb_rs214_reps.csv

sourmash tax annotate -g input_sample_gather_gtdb_rs214_reps.csv -t gtdb-rs214.taxonomy.sqldb

Sample file: https://github.com/ChillarAnand/avilpage.com/tree/master/scripts/data

Starting prefetch sweep across databases.
Prefetch found 0 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.
found less than 50.0 kbp in common. => exiting

No matches found for --threshold-bp at 50.0 kbp.

For the same sample, with kraken, this is the classification.

Screenshot 2024-08-06 at 04 30 25

From the above tutorial, I cannot understand how to process a single sample and do taxonomic classification.

@ctb
Copy link
Contributor Author

ctb commented Aug 12, 2024

From the above tutorial, I cannot understand how to process a single sample and do taxonomic classification.

it looks like you did everything right!

Two things to try -

  • switch to using the complete GTDB database, not just the species representatives;
  • you could lower the gather threshold with --threshold-bp=5000;

@u-xixi
Copy link

u-xixi commented Aug 22, 2024

Hi I came across this thread when searching for a solution, and I just want to comment on @ChillarAnand 's case.
You suggested lowering the gather threshold to 5k, but I had a look at the query reads, they are < 2k in length. Shouldn't the threshold be lowered to 1k?
I am asking because I was trying to use manysketch and fastmultigather to classify contigs from metagenomic assemblies. I ran fastmultigather with default threshold, and then looked at the CSV produced. There was not a single record in the CSV where the query length is under 50k. I believe my reference database should be sufficient. It was genbank bacteria, archaea, viral, prozotoa, and fungi provided by sourmash. So I think the problem is the threshold being above the query length. I am running it now with --threshold-bp 1000 because 1k is the minimum length of my contigs.

@ctb
Copy link
Contributor Author

ctb commented Aug 26, 2024

Hi I came across this thread when searching for a solution, and I just want to comment on @ChillarAnand 's case. You suggested lowering the gather threshold to 5k, but I had a look at the query reads, they are < 2k in length. Shouldn't the threshold be lowered to 1k? I am asking because I was trying to use manysketch and fastmultigather to classify contigs from metagenomic assemblies. I ran fastmultigather with default threshold, and then looked at the CSV produced. There was not a single record in the CSV where the query length is under 50k. I believe my reference database should be sufficient. It was genbank bacteria, archaea, viral, prozotoa, and fungi provided by sourmash. So I think the problem is the threshold being above the query length. I am running it now with --threshold-bp 1000 because 1k is the minimum length of my contigs.

Ugh, you are completely correct - sorry about that and THANK YOU for correcting me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues faq things to add to an FAQ or docs
Projects
None yet
Development

No branches or pull requests

5 participants