Skip to content

Setting up the database(s)

Jaruwatana Sodai Lotharukpong edited this page Sep 7, 2023 · 3 revisions

Setting up the database(s)

Here are the instructions for setting up the query databases for using GenEra with DIAMOND and/or with Foldseek.

DIAMOND database setup

First, download the nr database (warning: this is a huge FASTA file):

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr{.gz,.gz.md5} && md5sum -c *.md5
gunzip nr.gz

NOTE: Alternatively, you can download any other database whose sequence IDs can be traced back to the NCBI Taxonomy.

Then download “prot.accession2taxid” from the NCBI webpage:

wget ftp://ftp.ncbi.nih.gov:21/pub/taxonomy/accession2taxid/prot.accession2taxid.gz && gunzip prot.accession2taxid.gz

Then download the taxonomy dump from the NCBI:

wget -N ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
mkdir -p taxdump && tar zxf new_taxdump.tar.gz -C ./taxdump

Finally, create a local nr database (another huge file):

diamond makedb \
 --in nr \
 --db nr \
 --taxonmap prot.accession2taxid \
 --taxonnodes taxdump/nodes.dmp \
 --taxonnames taxdump/names.dmp

Congrats, you should now have the output file nr.dmnd! You can eliminate “prot.accession2taxid”, but keep the taxdump, as GenEra will use it later on.

Foldseek database setup

Foldseek uses 3D structure predictions in PDB format as input. So first make sure that you have structural predictions for each protein of your query species. This can be done using tools such as AlphaFold or OmegaFold. for example:

omegafold query_sequences.fasta output_directory 

The folding prediction of each protein should be stored on independent PDB files within a single directory. Make sure all the PDB files are uncompressed before running the analysis!

Once you have that, use Foldseek to download the AlphaFold database (warning: this database is huge):

foldseek databases Alphafold/UniProt alphafoldDB tmp 

Finally, download the taxonomy dump from the NCBI:

wget -N ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
mkdir -p taxdump && tar zxf new_taxdump.tar.gz -C ./taxdump