-
Notifications
You must be signed in to change notification settings - Fork 8
Sequence Header Format
PALADIN currently encodes the following information into the sequence headers of the generated .PRO files (which subsequently will also appear in the SAM files) - these mostly have all the same info systematically arranged in the same order - the info just has slightly different meaning depending on which algorithm was used:
-
The first line of all reference .PRO files created during the index process will contain a information sequence header containing information about options used during the index process. Example: "
>NT=x:MF=x:VER=x
" where NT indicates whether the original source was nucleotide (1) or amino acid (0), MF indicates whether the protein file contains all 6 frames per sequence (1) or just the reading frame (0), and VER contains the version of the program. -
When the reference was given as a nucleotide sequence and accompanying annotation (FASTA and GFF), the header for each sequence in the .PRO file will be "
<Line # of CDS in GFF>:<frame #>:<original sequence header>
". -
When the reference was given as a nucleotide sequence only (FASTA with only coding sequences, e.g. UniProt NT sequences), the header for each sequence in the .PRO file will be "
<Sequence #>:<frame #>:<original sequence header>
". -
When the reference was given as a protein sequence, the .PRO file will contain the information header only at the top, but will be blank, since the sequence is already in protein format
-
For reads, the .PRO file will be "
<Sequence #>:<ORF #>:<Relative frame #>
" where ORF # represents which ORF this is for each sequence and frame, as future algorithms (especially for much larger reads) may return multiple ORFs per frame. Relative frame # is the frame # relative to the frame of the first identified ORF.
Note: For our tests, we've been using the ART generated reads created for MCBS913. These are in the following format (a combination of what ART generates and what we added): "<Sequence name>:-:<unique read ID>:-:<Start Location for Paired End 1>:-:<Start Location for Paired End 2>
"