Skip to content

Sequence Header Format

Toni Westbrook edited this page Jun 25, 2015 · 4 revisions

PALADIN currently encodes the following information into the sequence headers of the generated .PRO files (which subsequently will also appear in the SAM files) - these mostly have all the same info systematically arranged in the same order - the info just has slightly different meaning depending on which algorithm was used:

  1. The first line of all reference .PRO files created during the index process will contain a information sequence header containing information about options used during the index process. Example: ">NT=x:MF=x:VER=x" where NT indicates whether the original source was nucleotide (1) or amino acid (0), MF indicates whether the protein file contains all 6 frames per sequence (1) or just the reading frame (0), and VER contains the version of the program.

  2. When the reference was given as a nucleotide sequence and accompanying annotation (FASTA and GFF), the header for each sequence in the .PRO file will be "<Line # of CDS in GFF>:<frame #>:<original sequence header>".

  3. When the reference was given as a nucleotide sequence only (FASTA with only coding sequences, e.g. UniProt NT sequences), the header for each sequence in the .PRO file will be "<Sequence #>:<frame #>:<original sequence header>".

  4. When the reference was given as a protein sequence, the .PRO file will contain the information header only at the top, but will be blank, since the sequence is already in protein format

  5. For reads, the .PRO file will be "<Sequence #>:<ORF #>:<Relative frame #>" where ORF # represents which ORF this is for each sequence and frame, as future algorithms (especially for much larger reads) may return multiple ORFs per frame. Relative frame # is the frame # relative to the frame of the first identified ORF.

Note: For our tests, we've been using the ART generated reads created for MCBS913. These are in the following format (a combination of what ART generates and what we added): "<Sequence name>:-:<unique read ID>:-:<Start Location for Paired End 1>:-:<Start Location for Paired End 2>"

Clone this wiki locally