Conditional Reciprocal Best BLAST - high confidence ortholog assignment.
CRB-BLAST is a novel method for finding orthologs between one set of sequences and another. This is particularly useful in genome and transcriptome annotation.
CRB-BLAST initially performs a standard reciprocal best BLAST. It does this by performing BLAST alignments of query->target and target->query. Reciprocal best BLAST hits are those where the best match for any given query sequence in the query->target alignment is also the best hit of the match in the reverse (target->query) alignment.
Reciprocal best BLAST is a very conservative way to assign orthologs. The main innovation in CRB-BLAST is to learn an appropriate e-value cutoff to apply to each pairwise alignment by taking into account the overall relatedness of the two datasets being compared. This is done by fitting a function to the distribution of alignment e-values over sequence lengths. The function provides the e-value cutoff for a sequence of given length.
CRB-BLAST greatly improves the accuracy of ortholog assignment for de-novo transcriptome assembly (Aubry et al. 2014).
The CRB-BLAST algorithm was designed by Steve Kelly, and this implementation is by Chris Boursnell and Richard Smith-Unna. The original reference implementation from the paper is available for online use at http://www.stevekellylab.com/software/conditional-orthology-assignment.
To install CRB-BLAST, simply use rubygems:
gem install crb-blast
- NCBI BLAST+ (preferably the latest version) should be installed and in your PATH.
- Ruby v2.0 or later. If you don't have Ruby, we suggest installing it with RVM.
\curl -sSL https://get.rvm.io | bash -s stable --ruby
CRB-BLAST can be run from the command-line as a standalone program, or used as a library in your own code.
CRB-BLAST can be run from the command line with:
crb-blast
The options are
--query, -q <s>: query fasta file in nucleotide format
--target, -t <s>: target fasta file as nucleotide or protein
--evalue, -e <f>: e-value cut off for BLAST. Format 1e-5 (default: 1.0e-05)
--threads, -h <i>: number of threads to run BLAST with (default: 1)
--output, -o <s>: output file as tsv
--split, -s: split the fasta files into chunks and run multiple blast
jobs and then combine them.
--help, -l: Show this message
An example command is:
crb-blast --query assembly.fa --target reference_proteins.fa --threads 8 --output annotation.tsv
To include the gem in your code just require 'crb-blast'
A quick example:
blaster = CRB_Blast.new('test/query.fasta', 'test/target.fasta')
blaster.run(1e-5, 4, true) # to run with an evalue cutoff of 1e-5 and 4 threads
A longer example with each step at a time:
blaster = CRB_Blast.new('test/query.fasta', 'test/target.fasta')
blaster.makedb
blaster.run_blast(1e-5, 6, true)
blaster.load_outputs
blaster.find_reciprocals
blaster.find_secondaries
The output file for CRB-Blast pulls columns from the blast output.
query - the name of the transcript from the 'query' fasta file
target - the name of the transcript from the 'target' fasta file
id - the percent sequence identity
alnlen - the alignment length
evalue - the blast evalue
bitscore - the blast bitscore
qstart..qend - the coordinates of the alignment in the query from start to end
tstart..tend - the coordinates of the alignment in the target from start to end
qlen - the length of the query transcript
tlen - the length of the target transcript
Please use the issue tracker if you find bugs or have trouble running CRB-BLAST.
Chris Boursnell cmb211@cam.ac.uk maintains this software.
This is adademic software - please cite us if you use it in your work.
CRB-BLAST is released under the MIT license.