-
Notifications
You must be signed in to change notification settings - Fork 3
/
README
323 lines (270 loc) · 13.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
All documentation for HaplotypeTools can be found at:
https://github.com/rhysf/HaplotypeTools
HaplotypeTools is a set of tools that can phase DNAseq data into haplotypes
for the purposes of identifying recombination (crossovers between samples),
identifying parental isolates or clades, or various population genetic tests.
The tools require BioPerl and Bio::DB::HTS modules (installed via cpan or cpanm.)
Inputs are Samtools indexed BAM, VCF, and reference FASTA.
Prerequisites:
--------------
R (plyr, RColorBrewer)
Perl
BioPerl
Bio::DB::HTS
Hash::Merge
Samtools v0.1.10 or higher (samtools.sourceforge.net)
FastTree (http://www.microbesonline.org/fasttree/)
Getting started / example pipeline for phasing individual sample
----------------------------------------------------------------
git clone git@github.com:rhysf/HaplotypeTools.git
cd HaplotypeTools/
perl HaplotypeTools.pl -v <vcf> -b <sorted BAM> -u <VCF sample name> -f <reference.fasta>
perl util/VCF_phased_to_PIA.pl \
-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
-f <reference.fasta>
perl util/VCF_phased_and_PIA_to_FASTA.pl \
-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
-l <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab> \
-r <reference.fasta>
perl util/FASTA_compare_sequences.pl \
-f <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_1.fasta> \
-a <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_2.fasta> \
-o a > summary
Example pipeline for phasing multi sample VCF
---------------------------------------------
perl HaplotypeTools.pl -v <multi sample vcf> \
-b <sorted BAMs (separated by comma)> \
-u <VCF sample names in order of input BAM files (separated by comma)> \
-f <reference.fasta>
perl util/VCF_phased_to_PIA.pl \
-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
-f <reference.fasta>
[if PIA is only for subset of samples] -s <VCF sample names to restrict analysis to>
perl util/VCF_phased_and_PIA_to_FASTA.pl \
-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
-l <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab> \
-r <reference.fasta>
-u <sample name in VCF to pull haplotypes from>
perl util/FASTA_compare_sequences.pl \
-f <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-<opt_u>.tab_1.fasta> \
-a <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-<opt_u>.tab_2.fasta> \
-o a > summary
Example pipeline for comparing haplotypes to other genomes
----------------------------------------------------------
perl util/VCF_and_FASTA_to_consensus_FASTA.pl \
-v <sample1.vcf> \
-r <reference.fasta>
Will make your consensus genomes. Next, we need to make a tabular file in the
following format (Name_Type_Location.tab) E.g.:
Sample1 FASTA /dir/with/your/consensus/FASTA/sample1.vcf-WGS-s-n-i-n-n-N-consensus.fasta
Sample2 FASTA /dir/with/your/consensus/FASTA/sample2.vcf-WGS-s-n-i-n-n-N-consensus.fasta
Sample3 FASTA /dir/with/your/consensus/FASTA/sample3.vcf-WGS-s-n-i-n-n-N-consensus.fasta
...
And an optional tabular file for clades/lineages E.g.:
Sample1 clade1
Sample2 clade1
Sample3 clade2
perl util/VCF_phased_to_PIA.pl \
-v <phased_sample1.vcf>,<phased_sample2.vcf>,<phased_sample3.vcf> \
-f <reference.fasta>
perl util/VCF_phased_and_PIA_to_FASTA.pl \
-v <phased_sample1.vcf> \
-l <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
-r <reference.fasta>
perl util/Haplotype_placer.pl \
-p <phased_sample1-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
-a <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_1.fasta> \
-b <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_2.fasta> \
-n <Name_Type_Location.tab>
For window plots as well
------------------------
perl util/Haplotype_placer.pl \
-p <phased_sample1-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
-a <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_1.fasta> \
-b <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_2.fasta> \
-n <Name_Type_Location.tab> \
-d y \
-l <reference.fasta>
perl util/Windows_haplotypes_to_R_figure.pl -w HaplotypeTools_windows
Example pipeline to identify crossovers
---------------------------------------
perl util/VCF_phased_compare_to_VCF_phased.pl \
-a <phased_sample1.vcf> \
-b <phased_sample2.vcf> \
-o sample1_vs_sample2
or
perl util/VCF_phased_compare_to_VCF_phased.pl \
-a <phased.vcf> \
-c <sample ID 1 from VCF>
-d <sample ID 2 from VCF>
-o sample1_vs_sample2
Example pipeline using test data (chr1 Bd) to phase individual sample and plot haploytpe windows
------------------------------------------------------------------------------------------------
cd HaplotypeTools/example
perl ../HaplotypeTools.pl \
-v Hybrid-SA.vcf-chr1.vcf \
-b Hybrid-SA.vcf-chr1.bam \
-u Hybrid-SA-EC3 \
-f Hybrid-SA.vcf-chr1.fasta
[note] if you can submit jobs to a cluster, use options -g, -a and -q to speed this step up.
perl ../util/VCF_phased_to_PIA.pl \
-v Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf \
-f Hybrid-SA.vcf-chr1.fasta
perl ../util/VCF_phased_and_PIA_to_FASTA.pl \
-v Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf \
-l Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab \
-r Hybrid-SA.vcf-chr1.fasta
perl ../util/FASTA_compare_sequences.pl \
-f Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_1.fasta \
-a Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_2.fasta \
-o a > summary
perl ../util/Haplotype_placer.pl \
-p Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab \
-a Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_1.fasta \
-b Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_2.fasta \
-n Name_Type_Location_consensus_genomes.tab \
-d y \
-l Hybrid-SA.vcf-chr1.fasta
perl util/Windows_haplotypes_to_R_figure.pl \
-w HaplotypeTools_windows -s 7000
[note] This should produce windows_plot-for-Haplotypes.R.pdf. For reference, an example copy is saved as Example-windows_plot-for-Haplotypes.R.pdf
HaplotypeTools Description
--------------------------
HaplotypeTools is a set of scripts designed to phase aligned WGS data, whereby
haplotypes are constructed from sufficient numbers of reads overlapping two or
more bi-allelic heterozygous position. Both haplotypes will need to be recovered
according to a cutoff (-c) in HaplotypeTools.pl. HaplotypeTools.pl can take a
lond time when not submitted in parallel to a cluster, so where available it should
be run in parallel (via settings -g, -a and -q). Once phased, the VCF can be compared
to consensus genomes belonging to relatives in order to identify possible parental
strains, or can be compared to other phased VCF's (or if multi VCF, samples within)
to look for crossovers.
Genomic regions that are phased in >=1 VCF (or multiple samples within a VCF)
are called 'phased in any/all; PIA'. Phased in any (-p 1) is the default, while
phased in all (-p 2) will output only regions phased in every sample.
PIA in either setting serve as an intermediate file needed to
extract haplotypes in FASTA format, as well as HaplotypePlacer. For comparing
haplotypes to consensus genomes, phased in any (-p 1) is best, as it will maximise
the number of genomic regions, or phased in just a single sample for fewer sites
that are most informative to that sample. Some applications may require identifying
regions that are phased in 2+ samples, in which case phased in all (-p 2) should
be used.
Haplotype_placer.pl will compare haplotypes from a phased VCF and consensus
genomes for other isolates, and iteratively construct trees using FastTree to
identify the nearest relative of each haplotype, which can be useful for identifying
parents of recombinants or hybrids, as well as visualising breakpoints. The output
includes both summaries and window files that can be plotted using the script XXX.
VCF_phased_compare_to_VCF_phased.pl can be used to compare two phased VCFs and
directly identify genomic regions that contain crossovers between them.
Individual script details:
--------------------------
HaplotypeTools parameters are shown below, followed by their default settings given in [].
Files are highlighted in <>.
HaplotypeTools.pl
Parameters: -v <VCF>
-b <BAM (sorted) separated by comma if phasing multiple samples in a multi VCF>
-f <reference fasta>
-u VCF sample names in order of input BAM files (separated by comma if phasing multiple samples)
Optional: -c Cut-off percent reads supporting phase group [90]
-m Minimum read depth overlapping two heterozygous SNPs [4]
-r Max phase length [10000]
-s Steps (1=process VCF, 2=process BAM, 3=assign read info to VCF, 4=validate and assign phase groups, 5=concatenate) [12345]
Parallel: -g Run commands on the grid (y/n) [n]
-a Platform (UGER, LSF, GridEngine) [UGER]
-q Queue name [short]
Outputs: -o Output folder for tmp files [opt_v-HaplotypeTools-phased-r-opt_r]
-p Phased VCF [opt_v-Phased-m-opt_m-c-opt_c-r-opt_r.vcf]
-y Phased summary [opt_v-Phased-m-opt_m-c-opt_c-r-opt_r-summary.tab]
FASTA_compare_sequences.pl
Parameters: -f <FASTA>
Optional: -o Output (p=pairwise within opt_f, s=summary of pairwise within opt_f, a=additional FASTA: 1:1 order) [p]
-g Exclude if gaps percent is greater than this [100]
-a <Additional FASTA in same order> []
Haplotype_list_count.pl <haplotypes list> <full sequence file>
Haplotype_placer.pl
Parameters: -p <phased in any/all; PIA>
-a <Haplotype file 1 FASTA>
-b <Haplotype file 2 FASTA>
-n <Name tab Type tab Location for consensus genomes>
Optional: -f Folder for output FASTA and Trees [HaplotypeTools_output]
-c Clades/Lineages/Metadata to cluster isolates by. Format (Name tab Clades/Lineages/Metadata) []
-t Location of FastTree [/seq/annotation/bio_tools/FastTree/current/FastTreeMP]
-m Mimimum length of haplotypes [100]
-v Verbose (y/n) [n]
Windows: -d Calculate windows (y/n) [n]
-l <Reference FASTA> []
-z Window length [10000]
Output: -f Output folder for genomic FASTA files and Trees [HaplotypeTools_output]
-s Output summary [HaplotypeTools_summary]
-w Output windows [HaplotypeTools_windows]
Haplotype_FASTA_files_to_compare_to_IRMS_het_sites.pl
Parameters: -a <PIA_1 FASTA>
-b <PIA_2 FASTA>
-c <IRMS HET.fasta>
-d <IRMS HET.details>
Optional: -p print to stdout (s) or (f) [s]
-o File to print to if opt_p=f <opt_a-and-2.accuracy>
-z Verbose for testing (y/n) [n]
VCF_and_FASTA_to_consensus_FASTA.pl
Parameters: -v <VCF>
-r <reference FASTA>
Optional: -i Include homozygous indels (y/n) [n]
-h Include bi-allelic heterozygous positions if ref base not included (y/n) [y]
-a For heterozygous positions, use ambiguity code (a) or first position for haploid consensus (f) [f]
-n Character for ambiguous [N]
-s Restrict to only this supercontig [n]
-t Restrict to only this isolate [n]
Notes: Prints to opt_v-isolate-name-consensus.fasta
VCF_phase.pl
Parameters: -v <VCF from step 1 of HaplotypeTools>
-a <alignment file from step 2 of HaplotypeTools>
Optional: -s Sample number
Output: -o Output [opt_v-phased-opt_s]
VCF_phase_haploid.pl <VCF file 1> > outfile.vcf
VCF_phased_and_PIA_to_FASTA.pl
Parameters: -v <VCF file>
-l <PIA file (contig tab start tab stop)>
-r <Reference FASTA>
Optional: -u Sample name in VCF to pull haplotypes from [WGS]
-p Printing option (o=outfile, s=split to opt_l_1 and _2) [s]
-e Exclude printing if haplotypes are identical (y/n) [n]
-m min length [10]
-i Include indels (y/n) [n]
-z Verbose for testing (y/n) [n]
VCF_phased_append_name_to_phase_group.pl <VCF file 1> > outfile.vcf
VCF_phased_calculate_haplotype_lengths.pl <Phased VCF file (with optional additional VCFs separated by comma)>
Optional: -s Sample name (with optional additional sample names separated by comma)
-e Output results for every sample (y/n) [n]
VCF_phased_compare_to_VCF_phased.pl
Parameters: -a <VCF file 1>
-b <VCF file 2>
or
-a <VCF>
-c Sample ID 1
-d Sample ID 2
Optional: -o Outfile extension [HaplotypeTools_VCF1_vs_VCF2]
VCF_phased_to_PIA.pl
Parameters: -v <VCF's (separated by comma)>
-f <reference FASTA>
Optional: -c Length cut-off for minimum haplotype to be considered [10]
-p Phased in any (1), Phased in all (2) [1]
-s Sample names to restrict analysis too (separated by comma)
-t Phase Tag (HaplotypeTools uses PID, Whatshap uses PS or HP etc) [PID]
Output: -u Summary [opt-v-PIA-p-Opt_p-c-Opt_c-s-Opt_s.summary]
-o Output [opt-v-PIA-p-Opt_p-c-Opt_c-s-Opt_s.tab]
VCF_phased_validate_and_assign_phase_groups.pl
Parameters: -v <VCF-phased-Sample_Number (from VCF_phase.pl)>
Optional: -c Cut-off percent reads supporting phase group [90]
-m Minimum read depth overlapping two heterozygous SNPs [4]
-z Verbose for error checking (y/n) [n]
Output: -o Output VCF [opt_v-and-assigned.tab]
-l Tallies for percent of reads agreeing with phases [opt_v-Tally-agree-disagree-opt_m-MD-opt_c-pc_cutoff.tab]
Windows_haplotypes_to_R_figure.pl
Parameters: -w <Windows haplotypes dataframe>
Optional -s Scaling factor for width (nucleotides per mm) [20000]
-h Scaling factor for height (genome per mm) [70]
-r Resolution (The nominal resolution in ppi) [100]
-c Minimum contig length to include label [1000000]
-g Optional size of axis text [2]
-l Legend size [2]
-x X-min [0]