-
Notifications
You must be signed in to change notification settings - Fork 7
Setting up the database(s)
Here are the instructions for setting up the query databases for using GenEra
with DIAMOND
and/or with Foldseek
.
First, download the nr database (warning: this is a huge FASTA file):
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr{.gz,.gz.md5} && md5sum -c *.md5
gunzip nr.gz
NOTE: Alternatively, you can download any other database whose sequence IDs can be traced back to the NCBI Taxonomy.
Then download “prot.accession2taxid” from the NCBI webpage:
wget ftp://ftp.ncbi.nih.gov:21/pub/taxonomy/accession2taxid/prot.accession2taxid.gz && gunzip prot.accession2taxid.gz
Then download the taxonomy dump from the NCBI:
wget -N ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
mkdir -p taxdump && tar zxf new_taxdump.tar.gz -C ./taxdump
Finally, create a local nr database (another huge file):
diamond makedb \
--in nr \
--db nr \
--taxonmap prot.accession2taxid \
--taxonnodes taxdump/nodes.dmp \
--taxonnames taxdump/names.dmp
Congrats, you should now have the output file nr.dmnd
!
You can eliminate “prot.accession2taxid”, but keep the taxdump, as GenEra will use it later on.
Foldseek uses 3D structure predictions in PDB format as input. So first make sure that you have structural predictions for each protein of your query species. This can be done using tools such as AlphaFold or OmegaFold. for example:
omegafold query_sequences.fasta output_directory
The folding prediction of each protein should be stored on independent PDB files within a single directory. Make sure all the PDB files are uncompressed before running the analysis!
Once you have that, use Foldseek to download the AlphaFold database (warning: this database is huge):
foldseek databases Alphafold/UniProt alphafoldDB tmp
Finally, download the taxonomy dump from the NCBI:
wget -N ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
mkdir -p taxdump && tar zxf new_taxdump.tar.gz -C ./taxdump