analyze a metagenome using 66,000 GTDB genomic representatives #14

ctb · 2022-05-12T13:54:07Z

This example uses the metagenome signature prepared in #12.

You'll also need to download the GTDB database as in #13.

Now, run sourmash gather:

sourmash gather SRR5950647.sig gtdb-rs207.genomic-reps.dna.k31.zip

This should take about 5 minutes.

The output should look like this:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
383.0 kbp      2.0%    7.8%       1.6    GCF_003697165.2 Escherichia coli DSM ...
187.0 kbp      1.0%    5.0%       1.6    GCF_015074785.1 Prevotella copri stra...
142.0 kbp      0.7%    2.8%       1.4    GCF_000012825.1 Bacteroides vulgatus ...
164.0 kbp      0.3%    1.4%       1.7    GCF_019127135.1 Prevotella copri stra...
found less than 50.0 kbp in common. => exiting

found 4 matches total;
the recovered matches hit 4.0% of the abundance-weighted query

This a minimum metagenome cover for the metagenome, based on the genomes in the GTDB database: in brief, it provides a shortest list of genomes that contain all of the known content in the metagenome (in this case, about 4%).

Note: more of the metagenome might be matched if you used a larger database or a database that included eukaryotic and/or host sequence.

The text was updated successfully, but these errors were encountered:

sapuizait · 2024-07-01T12:47:51Z

That is pretty cool, i tried it on a metagenome (very impressed with the speed btw) but i have a couple of questions...

So, based on the output below, I assume that it identified 1030126 signatures? Then out of them only 65703 were retained... Why is that? It seems awful few as 65k out of a million signatures is almost 6% ? Then when comparing to the k31 DB, 2581 gave matches, thus I assume the rest is unclassified
I am wondering how representative that result is?

I made the signature file by combining the two fastq files (like you showed in the previous thread)
and compared the .sig file to the k31.lca.json.gz DB

#output

select query k=31 automatically.
loaded query: 1030126... (k=31, DNA)

loaded 65703 total signatures from 1 locations.
after selecting signatures compatible with search, 65703 remain.

Starting prefetch sweep across databases.

Found 2581 signatures via prefetch; now doing gather.

ctb · 2024-07-01T12:53:40Z

Hi @sapuizait a few quick notes -

the 1030126... is the name of the sketch. You can change this yourself with sourmash sig rename <sigfile> <newname> -o <newsigfile>
65,703 signatures could be searched, because they had been built with compatible parameters (e.g. k=31, DNA, and so on)
of the 65,703 that could be searched, 2,581 had substantial overlap (> 50kb, by default - see --threshold-bp) with the metagenome. This is completely dependent on the metagenome - most genomes won't be in any given metagenome.
the percentage of the metagenome that is unclassified is given at the end of the 'gather' output. This (loosely) corresponds to fraction of reads that will not map to any genome in the reference database; see FAQ entry on mapping.

HTH!

sapuizait · 2024-07-01T12:58:18Z

uggghhhhh - oh boy you are right I m such an idiot - I gave a random number for a name and I forgot about it.... sorry about that.
OK then, next question: I see how the 50kb is a reasonable overlap BUT if you were willing to sacrifice some accuracy, how low would you go? 20kb?
Thanks!

ctb · 2024-07-01T13:08:47Z

no worries ;).

in re threshold, it's entirely up to you! See discussion here: sourmash-bio/sourmash#2360 (comment)

Note that we now have much faster multithreaded gather available, too; see benchmarks.

sapuizait · 2024-07-01T13:12:02Z

Excellent - thanks!

ctb changed the title ~~searching the prepared GTDB genomic representatives database with a metagenome~~ analyze a metagenome using 66,000 GTDB genomic representatives May 12, 2022

ctb added intro introductory examples fastq working with FASTQ files metagenome analyzing metagenomes gtdb-rs207 examples using GTDB RS207 labels May 12, 2022

ctb mentioned this issue May 14, 2022

estimate how much of a metagenome sample can be mapped to reference genomes #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyze a metagenome using 66,000 GTDB genomic representatives #14

analyze a metagenome using 66,000 GTDB genomic representatives #14

ctb commented May 12, 2022 •

edited

Loading

sapuizait commented Jul 1, 2024

ctb commented Jul 1, 2024

sapuizait commented Jul 1, 2024

ctb commented Jul 1, 2024

sapuizait commented Jul 1, 2024

analyze a metagenome using 66,000 GTDB genomic representatives #14

analyze a metagenome using 66,000 GTDB genomic representatives #14

Comments

ctb commented May 12, 2022 • edited Loading

sapuizait commented Jul 1, 2024

select query k=31 automatically. loaded query: 1030126... (k=31, DNA)

ctb commented Jul 1, 2024

sapuizait commented Jul 1, 2024

ctb commented Jul 1, 2024

sapuizait commented Jul 1, 2024

ctb commented May 12, 2022 •

edited

Loading

select query k=31 automatically.
loaded query: 1030126... (k=31, DNA)