-
Create a reference file
Prepare a TAB-separated reference file following the format described in the Reference Database section.
The file extension must be '.ref', say 'your_reference.ref'. -
Format database
python $DFAST_APP_ROOT/scripts/reference_util.py formatdb -i your_reference.ref
Then, index files for GHOSTX and BLASTP will be generated in the same location as the reference file.
-
Configure and run
Specify the reference file in thedatabase
attribute in the 'DBsearch' section,
e.g."database": "/path/to/your_reference.ref"
Alternatively, you can run dfast using the
--database
option.dfast --genome your_genome.fa --database /path/to/your_reference.ref
You can prepare a database easily from a FASTA file using 'reference_util.py'.
The script can parse FASTA definition lines for NCBI/UniprotKB/Prokka styles.
- Convert a FASTA file into DFAST reference format
python $DFAST_APP_ROOT/scripts/reference_util.py fasta2dfast -i your_reference.fasta -o your_reference.ref
- Format database, configure, and run
Then, follow the same procedure as above.
OrthoSearch identifies orthologous genes based on a simple Reciprocal-Best-Hit (RBH) approach.
This is effective in reducing running time and in transferring annotations from a reference genome of the closely-related organism.
This recipe shows how to perform OrthoSearch.
OrthoSearch requires a 'reference proteome' file that contains all protein sequences in a genome.
The file format must be either of FASTA, GenBank, or DFAST reference format.
In addition to a plain FASTA format (sequence ID and definition), OrthoSearch can parse FASTA definition lines of UniProt, GenBank, and Prokka styles.
The format is automatically recognized.
Our recommendation is to download a GenBank-format file from the NCBI Assembly Database and to use it as a reference.
- Download a reference proteome
This will download the latest version of the Escherichia coli str. K-12 genome in a GenBank-format into the current directory with the file name 'GCF_000005845.2.gbk'. You can use the
python $DFAST_APP_ROOT/scripts/file_downloader.py --assembly GCF_000005845
--out
option to specify the directory into which the file is downloaded. - Run DFAST
Use--references
to specify the reference proteome(s).You can specify multiple proteome files with commas to separate files.dfast --genome your_genome.fa --references GCF_000005845.2.gbk
When multiple files are used as references, all-vs-all alignments are conducted between a query proteome and each of the reference proteomes, and the highest-scoring hit will be adopted as the result.dfast --genome your_genome.fa --references GCF_000005845.2.gbk,GCA_000008865.1.faa
- Configuration
Reference proteomes can also be specified in the configuration file.
Setenabled
to True, and specifyreferences
in the 'FUNCTIONAL_ANNOTATION' part.{ "component_name": "OrthoSearch", "enabled": True, "options": { # "cpu": 2, # Uncomment this to set the component-specific number of CPUs. "skipAnnotatedFeatures": False, "evalue_cutoff": 1e-6, "qcov_cutoff": 75, "scov_cutoff": 75, "aligner": "ghostx", "aligner_options": {}, "references": ["GCF_000005845.2.gbk", "GCA_000008865.1.faa"] }, },
BlastSearch is for protein homology search against a large-sized reference database, such as pre-formatted Blast databases like RefSeq Protein and SwissProt available at the NCBI FTP site.
- Download a database from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov//blast/db/swissprot.tar.gz tar xvfz swissprot.tar.gz
- Create a configuration file
Setenabled
to True, and specifydatabase
to be searched against. You can also specifydbtype
, but normally, leaving it 'auto' will do.
Place this part upstream of 'DBsearch' against the default database if you want to give priority to 'BlastSearch'.{ "component_name": "BlastSearch", "enabled": True, "options": { # "cpu": 2, # Uncomment this to set the component-specific number of CPUs. "skipAnnotatedFeatures": False, "evalue_cutoff": 1e-6, "qcov_cutoff": 75, "scov_cutoff": 75, "aligner": "blastp", # Must be blastp "aligner_options": {}, "dbtype": "auto", # Must be either of auto/ncbi/uniprot/plain "database": "/path/to/swissprot", }, },
- Run DFAST
dfast --genome your_genome.fa --config your_config.py
Here is an example to create a database for RefSeq nonredundant archaeal proteins.
- Download FASTA files
wget ftp://ftp.ncbi.nlm.nih.gov//refseq/release/archaea/archaea.nonredundant_protein.*.protein.faa.gz gunzip -c archaea.nonredundant_protein.*.protein.faa.gz > archaea.nonredundant_protein.faa
- Format database
Be sure to use-parse_seqids
.makeblastdb -hash_index -parse_seqids -dbtype prot -in archaea.nonredundant_protein.faa
- Create a configuration file and run DFAST
Follow the recipe described above.