This repository has been archived by the owner on Sep 30, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
Ecogenomics/Mannotator
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
------------------------------------------------ Mannotator ------------------------------------------------ Behold! The Mannotator walks among us! ------------------------------------------------ What is it? ------------------------------------------------ With such an exciting name, you may be forgiven for thinking that the mannotator is something other than a microbial sequence annotator, BUT, you'd be wrong… If you have a microbial assembly, then you can use this tool to annotate your contigs. This tool is not perfect, nor complete. Any and all suggestions are welcome! "Annotate your contig" (currently) means: * Call open reading frames (ORFs) on your contigs or complete genome using a consensus approach (e.g. RAST, Prodigal, others possible) * Assign function (KEGG, COG/NOG, eggNOG, Gene Ontology (GO)) to these ORFs using a best blast hit approach ------------------------------------------------ What to do before I can use it? ------------------------------------------------ Step 1: Install Mannotator's dependencies. You'll need Perl and the following Perl modules: * Bioperl * Getopt::Euclid * URI::Escape Step 1a: Install optional dependancies * Ruby * bioruby Step 2: Prepare database files to run your annotations with. Mannotator has support for EMBL Uniref/COG/KEGG and NCBI Nr/Taxonomy Step 2a: Download your protein database of choice, i.e. Uniref90 (http://www.ebi.ac.uk/uniprot/database/download.html) or Nr (ftp://ftp.ncbi.nih.gov/../blast/db/FASTA/nr.gz) Step 2b: Uncompress it and format it with the BLAST formatdb utility Step 2b: Use Mannotator's idMergeUniref.pl or idMergeNr.pl to produce a file of ID mappings, i.e. Uniref_mappings.txt or Nr_mappings.txt Now you are ready to annotate your sequence using Ze Mannotator ------------------------------------------------ How do I use it? ------------------------------------------------ NOTE: These instructions are made for the install on our Eury server. They can easily be modified to work elsewhere though… Step 1: Ensure all you contig names (fasta IDs) are computer friendly. Avoid the following characters in fasta headers (always a good idea anyway…). Also, make sure that all the fasta headers are unique! Avoid these characters: | space \ & * ( ) & Step 2. Submit your sequences for annotation on the RAST server at http://rast.nmpdr.org/. When it is over, export your annotations as a GFF file. The RAST annotation may last several days, so in the meantime, start step 3. Step 3. Find ORFs in your sequences using Prodigal (on eury) $ prodigal -i IN_FILE -o OUT_FILE.gff3 -f gff For metagenomic contigs, try: $ prodigal -i IN_FILE -o OUT_FILE.gff3 -f gff -p meta Step 4. Copy the Prodigal and RAST gff3 files and the multiple fasta file containing your contigs into a working directory. Step 5. Unleash the mannotator! Here is a quick rundown what is being done by the mannotator and how to run it. 1. Combine multiple gff3 files into one and extract fasta sequences of un-annotated ORFs 2. Use Blastx against the Uniref90 or Nr databases to get best similarity for each each ORF fasta sequence 3. Use the Uniref obtained in the previous step to assign KEGG and COG/NOG IDs to each of the un-annotated fasta files. If you used Nr, then you get GI (genbank identifiers) which are used to look up the putative function and taxonomy of the proteic sequence. 4. Mix all the information together and return a single gff3 file called 'mannotatored.gff3' that contains all your annotations. You can visualise the annotations contained in this file using Artemis or GBrowse. Running the Mannotator will take a while, so perhaps you should use nohup or screen. The order of the GFF files is IMPORTANT. You should put the RAST file first, followed by the prodigal file. The most basic command to get the mannotator running is: $ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa which (on the Luca server) load the correct balst database and annotation mapping file. However specifying a custom location for both of these files is also possible: $ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa -p <uniref_path> -i <Uniref_mappings_path> $ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa -p <nr_path> -i <Nr_mappings_path> If you ran your data against both Uniref (for function assignment) and Nr (for taxonomy assignment), you might want to recombine your GFF output files into a single file using the combineGffAnnotations.pl script. Where the file for <uniref_path> can be found at ACE: Luca: /srv/whitlam/bio/db/uniprot/uniref90/uniref90.fasta Eury: /Volumes/Biodata/BLAST_databases/UniProt/UniRef90/uniref90.fasta And the file for <Uniref_mappings_path>: Luca: /srv/whitlam/bio/db/mannotator/Uniref201106_mappings.txt Eury: /Volumes/Biodata/ANN_mappings/ANN_mappings.txt On Luca you no longer need to specify these two file locations as they are automatically used by default The pipeline will output a file called “mannotatored.gff3” FIN ------------------------------------------------ More info on the scripts themselves ------------------------------------------------ The mannotator can also be run as individual scripts... The first is called combineGffOrfs. It will take any number of gff3 files and merge their ORFs into one combined file. Also, it is currently programmed to extract the fasta sequences for ORFs which have not been given a RAST/FIG annotation. This script will run as a stand alone application. The order of the gff3 files is IMPORTANT. You need to supply them in the order in which you trust them. $ combineGffOrfs -gffs|g gff_file1[,gff_file2[, ... ]] -contigs|c contigs_file The second script is called blast2Ann. It will take a Uniref90 blastx result and add KEGG and COG/NOG annotations to your contigs. Again, it can be run as a standalone. $ blast2ann -u2a|u ANN_file -gff3|g gff3_file The ANN_file is a special text file which links Uniref90 IDs to KEGG and COG/NOG IDs. The format of the ANN_mappings file is as follows: UniProtID^[;Ontology_term=COG_ID:COG-NOGID[;Ontology_term=COG_DESC:COG-NOG text description]][;Ontology_term=KEGG_ID:KEGG entryID[;Ontology_term=KEG_PATHWAY:KEGG pathway ID][;Ontology_term=KEGG_ONTOLOGY:KEGG ko ID]] Some sample lines: B8NAL8^;Ontology_term=KEGG_ID:afv:AFLA_042430 B9LMU9^;Ontology_term=KEGG_ID:hla:Hlac_1092;Ontology_term=KEGG_PATHWAY:path:hla02010;Ontology_term=KEGG_ONTOLOGY:ko:K01999 Q2S9X8^;Ontology_term=COG_ID:COG2969;Ontology_term=COG_DESC:Stringent starvation protein B;Ontology_term=KEGG_ID:hch:HCH_10003;Ontology_term=KEGG_ONTOLOGY:ko:K03600 B2TZD5^;Ontology_term=KEGG_ID:sbc:SbBS512_E3074;Ontology_term=KEGG_ONTOLOGY:ko:K01146 Q21RE3^;Ontology_term=COG_ID:COG1881;Ontology_term=COG_DESC:Phospholipid-binding protein;Ontology_term=KEGG_ID:rfr:Rfer_3962;Ontology_term=KEGG_ONTOLOGY:ko:K06910 Use this file however you please… ------------------------------------------------ Usage and command-line options ------------------------------------------------ Usage: mannotator -g[ffs] | --gffs <gff>... -c[ontigs] | --contigs <contig> [options] mannotator --help mannotator --version Required arguments: -g[ffs] <gff>... | --gffs <gff>... A comma or space separated list of gff3 formatted files -c[ontigs] <contig> | --contigs <contig> A fasta formated file containing contigs Options: -p[rotdb] <protdb> | --protein-database <protdb> Specify a custom location for a blast database of sequences Default: /srv/whitlam/bio/db/uniprot/uniref90/uniref90.fasta -i[2a] <mapping> | --i2a <mapping> Specify a custom location for the database accession number mapping file Default: /srv/whitlam/bio/db/mannotator/Uniref201106_mappings.txt -m[in_len] <minlen> | --minimum-length <minlen> Remove ORFs less than this length (in bp). Default: 100 bp -d | --seq-embed Embed sequences into GFF files (useful to view the annotation in Artemis) -f[latfile] | --flat-file Optionally create multiple genbank files for your contigs -b[last_prg] <blastprog> | --blast-program <blastprog> The type of blast to run. Default: blastx -e[value] <evalue> | --evalue <evalue> E-value cutoff when running blast. Default: 1e-08 -t[hreads] <threads> | --threads <threads> Run blast on multiple threads. Default: 1 -o[ut] <outfile> | --outfile <outfile> Filename for the final gff3 file. Default: mannotatored.gff3 -k[eep] | --keep-tmp Keep all tmp files and directories created -x | --keep-blast Keep only the blast results and delete all other tmp files -s[ims] | --blast-sims Use a precomputed blast output file -n[ew_blast] | --blast+ Use the blast+ toolkit, default is to use blastall -one | --one Skip step one of the mannotator -two | --two Skip step two of the mannotator -three | --three skip step three of the mannotator -four | --four Skip step four of the mannotator -five | --five Skip step five of the mannotator
About
Microbial annotation pipeline - in progress
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published