Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download and search 66,000 GTDB genomes with a query genome #13

Open
ctb opened this issue May 12, 2022 · 0 comments
Open

download and search 66,000 GTDB genomes with a query genome #13

ctb opened this issue May 12, 2022 · 0 comments
Labels
fasta working with FASTA files genome analyzing genomes gtdb-rs207 examples using GTDB RS207 intro introductory examples

Comments

@ctb
Copy link
Contributor

ctb commented May 12, 2022

You'll need to build the genome signature file in #11 first.

Then, download the GTDB genomic representatives database:

curl -JLO https://osf.io/3a6gn/download

This will create a 1.7 GB file, gtdb-rs207.genomic-reps.dna.k31.zip, which contains 66,000 genome sketches from the Genome Taxonomy Database, release 207.

Now search the genome against the GTDB database:

sourmash search GCF_000005845.2_ASM584v2_genomic.fna.gz.sig gtdb-rs207.genomic-reps.dna.k31.zip

This will take about 5 minutes.

The output will look like this:

8 matches; showing first 3:
similarity   match
----------   -----
 29.9%       GCF_003697165.2 Escherichia coli DSM 30083 = JCM 1649 = A...
 14.6%       GCF_002965065.1 Escherichia sp. MOD1-EC7003 strain=MOD1-E...
 14.2%       GCF_000026225.1 Escherichia fergusonii ATCC 35469 strain=...

showing that this genome is, indeed, an E. coli genome :).

The similarity in the left column is Jaccard similarity, calculated using the k-mers in the query genome sketch against the k-mers in each of the database genome sketches.

You can increase the number of output results with -n:

8 matches:
similarity   match
----------   -----
 29.9%       GCF_003697165.2 Escherichia coli DSM 30083 = JCM 1649 = A...
 14.6%       GCF_002965065.1 Escherichia sp. MOD1-EC7003 strain=MOD1-E...
 14.2%       GCF_000026225.1 Escherichia fergusonii ATCC 35469 strain=...
 14.1%       GCF_902498915.1 Escherichia ruysiae, OPT1704
 14.1%       GCF_004211955.1 Escherichia sp. E1V33 strain=E1V33, ASM42...
 13.5%       GCF_005843885.1 Escherichia sp. E4742 strain=E4742, ASM58...
 10.3%       GCF_001660175.1 Escherichia sp. B1147 strain=B1147, ASM16...
 10.1%       GCF_011881725.1 Escherichia coli strain=SCPM-O-B-8794, AS...

and you can record the results in a CSV file with -o <output.csv>.

@ctb ctb changed the title downloading and searching the prepared GTDB genomic representatives database with a genome download and search 66,000 GTDB genomes with a query genome May 12, 2022
@ctb ctb added intro introductory examples fasta working with FASTA files gtdb-rs207 examples using GTDB RS207 genome analyzing genomes labels May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fasta working with FASTA files genome analyzing genomes gtdb-rs207 examples using GTDB RS207 intro introductory examples
Projects
None yet
Development

No branches or pull requests

1 participant