Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTDB #19

Open
andrewjmc opened this issue Oct 11, 2021 · 1 comment
Open

GTDB #19

andrewjmc opened this issue Oct 11, 2021 · 1 comment

Comments

@andrewjmc
Copy link

Hi,

I would love to clean human contaminated sequences from the GTDB bacteria and archaea (r95) and NCBI viruses and fungi, as classifications are being badly affected in some samples of mine with high human DNA proportion. I already have a concatenated .faa file for kraken, and a seqid2taxid.map file. However, because it is a custom-built database, and incorporates GTDB, the taxids bear no relation to NCBI IDs. I have a names.dmp and nodes.dmp file.

Could I tweak conterminator to process this database? It is a 120 Gb sequence database. I can't see how much RAM is required, but naively following the idea of linear time, I would hope I could process my database in under a day.

Best wishes,

Andrew

@martin-steinegger
Copy link
Collaborator

The database module should allow you to download the GTDB database. It will build names.dmp and nodes.dmp based on the GTDB taxonomy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants