Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyze a metagenome using 66,000 GTDB genomic representatives #14

Open
ctb opened this issue May 12, 2022 · 5 comments
Open

analyze a metagenome using 66,000 GTDB genomic representatives #14

ctb opened this issue May 12, 2022 · 5 comments
Labels
fastq working with FASTQ files gtdb-rs207 examples using GTDB RS207 intro introductory examples metagenome analyzing metagenomes

Comments

@ctb
Copy link
Contributor

ctb commented May 12, 2022

This example uses the metagenome signature prepared in #12.

You'll also need to download the GTDB database as in #13.

Now, run sourmash gather:

sourmash gather SRR5950647.sig gtdb-rs207.genomic-reps.dna.k31.zip

This should take about 5 minutes.

The output should look like this:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
383.0 kbp      2.0%    7.8%       1.6    GCF_003697165.2 Escherichia coli DSM ...
187.0 kbp      1.0%    5.0%       1.6    GCF_015074785.1 Prevotella copri stra...
142.0 kbp      0.7%    2.8%       1.4    GCF_000012825.1 Bacteroides vulgatus ...
164.0 kbp      0.3%    1.4%       1.7    GCF_019127135.1 Prevotella copri stra...
found less than 50.0 kbp in common. => exiting

found 4 matches total;
the recovered matches hit 4.0% of the abundance-weighted query

This a minimum metagenome cover for the metagenome, based on the genomes in the GTDB database: in brief, it provides a shortest list of genomes that contain all of the known content in the metagenome (in this case, about 4%).

Note: more of the metagenome might be matched if you used a larger database or a database that included eukaryotic and/or host sequence.

@ctb ctb changed the title searching the prepared GTDB genomic representatives database with a metagenome analyze a metagenome using 66,000 GTDB genomic representatives May 12, 2022
@ctb ctb added intro introductory examples fastq working with FASTQ files metagenome analyzing metagenomes gtdb-rs207 examples using GTDB RS207 labels May 12, 2022
@sapuizait
Copy link

That is pretty cool, i tried it on a metagenome (very impressed with the speed btw) but i have a couple of questions...

So, based on the output below, I assume that it identified 1030126 signatures? Then out of them only 65703 were retained... Why is that? It seems awful few as 65k out of a million signatures is almost 6% ? Then when comparing to the k31 DB, 2581 gave matches, thus I assume the rest is unclassified
I am wondering how representative that result is?

I made the signature file by combining the two fastq files (like you showed in the previous thread)
and compared the .sig file to the k31.lca.json.gz DB

#output

select query k=31 automatically.
loaded query: 1030126... (k=31, DNA)

loaded 65703 total signatures from 1 locations.
after selecting signatures compatible with search, 65703 remain.

Starting prefetch sweep across databases.

Found 2581 signatures via prefetch; now doing gather.

@ctb
Copy link
Contributor Author

ctb commented Jul 1, 2024

Hi @sapuizait a few quick notes -

  • the 1030126... is the name of the sketch. You can change this yourself with sourmash sig rename <sigfile> <newname> -o <newsigfile>
  • 65,703 signatures could be searched, because they had been built with compatible parameters (e.g. k=31, DNA, and so on)
  • of the 65,703 that could be searched, 2,581 had substantial overlap (> 50kb, by default - see --threshold-bp) with the metagenome. This is completely dependent on the metagenome - most genomes won't be in any given metagenome.
  • the percentage of the metagenome that is unclassified is given at the end of the 'gather' output. This (loosely) corresponds to fraction of reads that will not map to any genome in the reference database; see FAQ entry on mapping.

HTH!

@sapuizait
Copy link

uggghhhhh - oh boy you are right I m such an idiot - I gave a random number for a name and I forgot about it.... sorry about that.
OK then, next question: I see how the 50kb is a reasonable overlap BUT if you were willing to sacrifice some accuracy, how low would you go? 20kb?
Thanks!

@ctb
Copy link
Contributor Author

ctb commented Jul 1, 2024

no worries ;).

in re threshold, it's entirely up to you! See discussion here: sourmash-bio/sourmash#2360 (comment)

Note that we now have much faster multithreaded gather available, too; see benchmarks.

@sapuizait
Copy link

Excellent - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fastq working with FASTQ files gtdb-rs207 examples using GTDB RS207 intro introductory examples metagenome analyzing metagenomes
Projects
None yet
Development

No branches or pull requests

2 participants