-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcladeDefinitionProcess.txt
43 lines (42 loc) · 3.93 KB
/
cladeDefinitionProcess.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
This process defines at a technical level how the clade definitions ar arrived at in BTV-GLUE.
1. Sequences downloaded up to a certain date from GenBank to the ncbi-curated source.
2. As well as GenBank annotations, the btvSegmentRecogniser module can be used to assign each sequence to the correct segment.
3. The team will update metadata from the literature, including isolate-sequence associations, also whether certain
sequences should be excluded.
4. The populateCompleteSegmentField.glue script within the project build will set the "complete segment" field according to some
per-segment length criteria.
5. An alignment BTV_COMPL_SEG_NT_* is made for each segment. Either loaded from JSON during the project build or recomputed
using the btvCompSegNtAlignments.js script, using MAFFT. This alignment contains the segment master reference together with the
non-excluded complete segment sequences from ncbi-curated.
6. Unaligned protein sequences are created from the sequences within the BTV_COMPL_SEG_NT_* alignments plus references and
outgroups using the btvOutgroupProteinUnaligned.js script, and stored in BTV_OUTG_UNALIGNED_*.faa files
7. Protein alignments BTV_OUTG_ALIGNED_*.faa are generated from these unaligned protein sequence files outside of GLUE using
MAFFT, coordinated by the bash script alignments/btvOutgroupProtein/mafftAlign.sh
8. Codon (nucletotide) alignments BTV_OUTG_CODON_* are generated by using the GLUE BLAST-based alignment importer;
These are either imported from JSON during the project build or recomputed using btvOutgroupCodonAlignments.js
Exported version of these alignments are checked by eye.
9. Phylo trees (phyloTrees/S*.tree) are built from the BTV_OUTG_CODON_* alignments using RAxML. The buildPhyloTrees.js
script will regenerate them or they can be imported to the phylogeny field of the relevant alignment using
glue/importUnrootedTrees.js (not invoked during the project build).
10. The script rerootPhyloTrees.js assumes the unrooted trees have been loaded as in point 9, reroots them using outgroup
rerooting, removes the outgroup, then outputs them as phyloTrees/S*_og_rerooted.tree. These trees can be imported to
the phylogeny field of the relevant alignment using glue/importRootedTrees.js (not invoked during the project build)
11. One avenue we explored was to use ClusterPicker to guide / support clade definitions. The script
generateClusterAnnotations.js works on the BTV_OUTG_CODON_* alignments and S*_og_rerooted.tree files to generate
alternative cluster annotations using different ClusterPicker parameters.
12. This also saves the display trees.
13. The team reviews the trees and decides on clades. Also reference sequences are selected for each clade.
14. The results of these deliberations are captured in a JSON file json/S*_clade_structure_and_refs.json
- Each segment's reference sequences are stored in a separate source: ncbi-s1-refseqs, etc.
- An NCBI importer module config can be generated from the segment's JSON file using
the generateRefSeqNcbiImporter function of module btvCladeStructureProcessor
- This source can be saved to disk and can be loaded in during the project build.
- The BTV_GENO_CODON_* alignments can be generated using the function generateGenotypingCodonAlignment
This simply consists of copying over rows of BTV_OUTG_CODON_* corresponding to the selected reference sequences.
The JSON file is processed during the project build:
- function createGlueReferenceSequences creates the reference sequences from the relevant source.
- btvAddFeatureLocations.js adds feature locations to all the reference sequences
- function createAlignmentTree creates the segment alignment tree.
15. btvGenerateS2ReferencePhylogeny.glue will generate the reference phylogeny from BTV_GENO_CODON_* as
trees/reference/S2_reference.tree using RAxML.
This includes some segment-specific outgroup rerooting.