-
Notifications
You must be signed in to change notification settings - Fork 0
/
qcreport.Rmd
260 lines (220 loc) · 12.3 KB
/
qcreport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
title: "Variant calling quality control report"
output: html_document
params:
outdir: "/opt/data/db/sra/jx_161118_100samples/oprimer"
primer_file: "/opt/data/db/sra/jx_161118_100samples/memo/primers.fa"
gene_file: ""
---
## 原始数据质量控制
测序得到下机序列数据为FASTQ格式,如同以下样子:
```
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
CCCCCGGGG#9BB<DFGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGFFGFFFFGG
```
第一行是一长串序列编号,记录了测序序列的一些信息;第二行是序列,为测序得到的ATCGN碱基信息;第三行是分隔符加号;第四行是对应测序碱基的质量分数,质量分数通过ASCII编码转换为字符串,方便记录和代码计算。如下是对应ASCII字符与质量分数的对应关系。
```
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0.2......................26...31........41
```
质量分数Q是对应碱基检测出错概率p的一个对应数字,对应公式如下:
\[
Q = -10 Log_{10}^p
\]
```{r, include=FALSE}
library(jsonlite)
library(dplyr)
library(knitr)
library(kableExtra) # https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
library(ggplot2)
library(Biostrings)
library(reshape)
library(openxlsx)
options(stringsAsFactors = F)
wd <- getwd()
fastpdir <- paste(params$outdir,"fastp/",sep = "/")
samples <- dir(path = fastpdir)
primers <- readDNAStringSet(params$primer_file)
all_primers <- sub("\\_3p|\\_5p|\\_fp|\\_rp", "", names(primers)) %>% unique()
# genes <- readLines(params$gene_file)
wb <- createWorkbook(creator="Acebiox Inc.")
```
本次数据共计`r length(samples)`个样本。原始数据统计如下,同附件工作表1:
```{r echo=FALSE, results='asis'}
compress <- function(tx) {
tx <- as.numeric(gsub("\\,", "", tx))
int <- c(1e-4, 1, 1e3, 1e6, 1e9, 1e12)
div <- findInterval(tx, int)
divisor <- c(1, 1e-2, 1, 1e3, 1e6, 1e9, 1e12)
paste(round( tx/divisor[div+1], 2), c("","%","", "K","M","B","T")[div+1] )
}
quality_curves <- data.frame()
summary_before_filtering <- data.frame()
summary_after_filtering <- data.frame()
summary_filtering_result <- data.frame()
for (sample in samples) {
json <- fromJSON(paste(fastpdir, sample, "/", sample, ".json", sep = ""))
quality_curves <- rbind(quality_curves,
data.frame(position = 1:length(json$read1_before_filtering$quality_curves$mean),
quality = json$read1_before_filtering$quality_curves$mean,
sample))
summary_before_filtering <- rbind(summary_before_filtering, data.frame(json$summary$before_filtering))
summary_after_filtering <- rbind(summary_after_filtering, data.frame(json$summary$after_filtering))
summary_filtering_result <- rbind(summary_filtering_result, data.frame(json$filtering_result))
}
rownames(summary_before_filtering) <- samples
summary_before_filtering <- t(summary_before_filtering[,c(1,2,5:7)])
rownames(summary_after_filtering) <- samples
summary_after_filtering <- t(summary_after_filtering[,c(1,2,5:7)])
rownames(summary_filtering_result) <- samples
summary_filtering_result <- t(summary_filtering_result)
summary_tbl <- rbind(summary_before_filtering, summary_after_filtering, summary_filtering_result)
rownames(summary_tbl) <- gsub(pattern = "_", replacement = " ", x = rownames(summary_tbl))
for(i in 1:nrow(summary_tbl)) {
summary_tbl[i,] <- compress(summary_tbl[i,])
}
addWorksheet(wb, sheetName = "preprocess summary")
writeData(wb, sheet = 1, startRow = 1, x = "Before filtering")
writeDataTable(wb, sheet = 1, startRow = 2, x = as.data.frame(summary_before_filtering), colNames = T, rowNames = T)
writeData(wb, sheet = 1, startRow = 8, x = "After filtering")
writeDataTable(wb, sheet = 1, startRow = 9, x = as.data.frame(summary_after_filtering), colNames = T, rowNames = T)
writeData(wb, sheet = 1, startRow = 15, x = "Filtering result")
writeDataTable(wb, sheet = 1, startRow = 16, x = as.data.frame(summary_filtering_result), colNames = T, rowNames = T)
setColWidths(wb, sheet = 1, widths = "auto", cols = 1:3)
kable(summary_tbl, caption = "Quality control summary", format="html", digits=0) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
group_rows("Before filtering", 1, 5) %>%
group_rows("After filtering", 6, 10) %>%
group_rows("Filtering result", 11, 13)
```
```{r echo=FALSE, results='asis', fig.align="center", fig.cap="Before filtering quality curves"}
ggplot(data=quality_curves, aes(x=position, y=quality, color=sample)) +
geom_line() +
theme_bw()# + theme(legend.position="none")
```
## 扩增片段质量控制
基于片段扩增的测序是针对引物扩增片段进行测序,通过检测引物序列可以检测测序数据中,各个扩增片段的分布情况。此次实验共设计```r length(primers)/2```对PCR引物,检测```r length(genes)```个目标基因(```r paste(genes, sep=", ")```)。具体共计结果如下表,详见附件工作表2-4:
```{r echo=FALSE, results='asis'}
cutdir <- paste(params$outdir,"cutprimers/",sep = "/")
dimer_df <- data.frame()
nsa_df <- data.frame()
ps_df <- data.frame()
bad_df <- data.frame()
for (sample in samples) {
dimer <- read.table(paste(cutdir, sample, "/", "dimer.txt", sep = ""), sep = "\t",header = T)
colnames(dimer)[2] <- sample
if (nrow(dimer_df) == 0){
dimer_df <- dimer
} else {
dimer_df <- merge(dimer_df, dimer, all=T)
}
nsa <- read.table(paste(cutdir, sample, "/", "nsa.txt", sep = ""), sep = "\t",header = T)
colnames(nsa)[2] <- sample
if (nrow(nsa_df) == 0){
nsa_df <- nsa
} else {
nsa_df <- merge(nsa_df, nsa, all=T)
}
ps <- read.table(paste(cutdir, sample, "/", "ps.txt", sep = ""), sep = "\t",header = T)
# ps$Primer <- sub("\\_3p|\\_5p|\\_fp|\\_rp", "",ps[,1:2],ignore.case = T)
# ps <- ps[!duplicated(ps$Primer),]
colnames(ps)[3] <- sample
if (nrow(ps_df) == 0){
ps_df <- ps[,1:3]
} else {
ps_df <- merge(ps_df, ps[,1:3], all=T)
}
if (nrow(dimer) > 0) {
dimer_primers <- strsplit(dimer$Primer.dimer, " & ") %>% unlist() %>% sub(pattern = "\\_3p|\\_5p|\\_fp|\\_rp", replacement = "") %>% unique()
} else {
dimer_primers <- character()
}
if (nrow(nsa) > 0) {
nsa_primers <- strsplit(nsa$NSA.pair, " & ") %>% unlist() %>% sub(pattern = "\\_3p|\\_5p|\\_fp|\\_rp", replacement = "") %>% unique()
} else {
nsa_primers <- character()
}
ps_primers <- sub("\\_3p|\\_5p|\\_fp|\\_rp", "",unlist(ps[,1:2]),ignore.case = T)
dimer_nsa_primers <- setdiff(union(dimer_primers, nsa_primers), ps_primers)
not_found_primers <- setdiff(all_primers, union(ps_primers,union(dimer_primers, nsa_primers)))
bad <- data.frame(Status=c("only dimer/nsa", "primer not found"),sample=c(paste(dimer_nsa_primers, collapse = ","),paste(not_found_primers, collapse = ",")))
colnames(bad)[2] <- sample
if (nrow(bad_df) == 0){
bad_df <- bad
} else {
bad_df <- merge(bad_df, bad, all=T)
}
}
rownames(dimer_df) <- dimer_df[,1]
dimer_df <- dimer_df[,-1]
dimer_df <- dimer_df[order(apply(dimer_df,1,sum,na.rm=T),decreasing = T),]
dimer_df <- dimer_df[apply(dimer_df,1,max,na.rm=T) >= 100,]
rownames(nsa_df) <- nsa_df[,1]
nsa_df <- nsa_df[,-1]
nsa_df <- nsa_df[order(apply(nsa_df,1,sum,na.rm=T),decreasing = T),]
nsa_df <- nsa_df[apply(nsa_df,1,max,na.rm=T) >= 100,]
rownames(ps_df) <- paste(ps_df[,1],ps_df[,2],sep = "-")
ps_df <- ps_df[,-(1:2)]
ps_df <- ps_df[order(rownames(ps_df)),]
rownames(bad_df) <- bad_df[,1]
bad_df <- bad_df[,-1]
is_paired <- sapply(strsplit(rownames(ps_df), "_|-"), function(x){return(x[1] == x[3] & x[2] == "5p")})
ps_correct_df <- ps_df[is_paired,]
amplicon_tbl <- rbind(dimer_df, nsa_df, ps_correct_df, bad_df)
ps_df$is_paired <- is_paired
addWorksheet(wb, sheetName = "dimer summary")
writeDataTable(wb, sheet = 2, x = dimer_df, colNames = T, rowNames = T)
setColWidths(wb, sheet = 2, widths = "auto", cols = 1:3)
addWorksheet(wb, sheetName = "NSA summary")
writeDataTable(wb, sheet = 3, x = nsa_df, colNames = T, rowNames = T)
setColWidths(wb, sheet = 3, widths = "auto", cols = 1:3)
addWorksheet(wb, sheetName = "product summary")
writeDataTable(wb, sheet = 4, x = ps_df, colNames = T, rowNames = T)
setColWidths(wb, sheet = 4, widths = "auto", cols = 1:3)
options(knitr.kable.NA = ".")
kable(amplicon_tbl, caption = "Amplicon statistics", format="html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
group_rows("Dimer (at least one sample >= 100)", 1, nrow(dimer_df)) %>%
group_rows("Non-specific amplicon (at least one sample >= 100)", nrow(dimer_df)+1, nrow(dimer_df)+nrow(nsa_df)) %>%
group_rows("Amplicon product stats", nrow(dimer_df)+nrow(nsa_df)+1, nrow(dimer_df)+nrow(nsa_df)+nrow(ps_correct_df)) %>%
group_rows("Failed primer stats", nrow(dimer_df)+nrow(nsa_df)+nrow(ps_correct_df)+1, nrow(amplicon_tbl))
```
## 测序数据序列比对
将测序数据经过接头序列清理、低质量碱基清理[^1]、引物序列清理[^2]后,将序列与参考序列进行比对。比对使用的是长序列(大于70碱基)比对工具minimap2[^3]。对目标基因的所有编码外显子区域进行reads覆盖统计,结果见下表,同附件工作表5:
```{r echo=FALSE, results='asis'}
coverage_tbl <- read.table(paste(params$outdir, "/coverage.txt", sep = ""), sep = "\t",header = T, comment.char = "")
coverage_tbl$start <- coverage_tbl$start + 1
gene_pos_tbl <- coverage_tbl[coverage_tbl$sample == coverage_tbl$sample[1], c("X.chrom", "start", "end", "gene")]
addWorksheet(wb, sheetName = "coverage summary")
writeDataTable(wb, sheet = 5, x = coverage_tbl, colNames = T, rowNames = F)
setColWidths(wb, sheet = 5, widths = "auto", cols = 1:11)
kable(cast(coverage_tbl, sample~gene, value = "percentcovered"), caption = "Sample / Gene percent covered", format="html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
```
## 突变发现与分析
根据测序数据比对的结果,以参考基因组(*Homo sapiens* assembly hg38)为模板进行突变发现与分析。突变分析使用Strelka2 Small Variant Caller[^4]。在目标基因区域进行突变发现,并对潜在突变做质控统计,标记数据质量较差的突变结果——如深度太低(read depth小于3),基因分型分数过低等。一下为第一个样本的突变结果,全部结果详见附件工作表6-7:
```{r echo=FALSE, results='asis'}
wide_tbl <- read.table(paste(params$outdir, "/variant/results/variants/wide.txt", sep = ""), sep = "\t",header = T)
long_tbl <- read.table(paste(params$outdir, "/variant/results/variants/long.txt", sep = ""), sep = "\t",header = T)
gene <- ""
for (i in 1:nrow(wide_tbl)) {
gene[i] <- gene_pos_tbl$gene[gene_pos_tbl$X.chrom == wide_tbl$CHROM[i] & gene_pos_tbl$start - 200 <= wide_tbl$POS[i] & gene_pos_tbl$end + 200 >= wide_tbl$POS[i]]
}
wide_tbl <- cbind(data.frame(GENE = gene), wide_tbl)
long_tbl <- cbind(data.frame(GENE = gene), long_tbl)
addWorksheet(wb, sheetName = "variant column")
writeDataTable(wb, sheet = 6, x = wide_tbl, colNames = T, rowNames = F)
setColWidths(wb, sheet = 6, widths = "auto", cols = 1:ncol(wide_tbl))
addWorksheet(wb, sheetName = "variant row")
writeDataTable(wb, sheet = 7, x = long_tbl, colNames = T, rowNames = F)
setColWidths(wb, sheet = 7, widths = "auto", cols = 1:ncol(long_tbl))
saveWorkbook(wb, paste(params$outdir,"qcreport.xlsx",sep = "/"), overwrite = T)
kable(wide_tbl[,1:11], caption = "Sample / Gene percent covered", format="html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
```
[^1]: An ultra-fast all-in-one FASTQ preprocessor: [fastp](https://github.com/OpenGene/fastp), Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. fastp: an ultra-fast all-in-one FASTQ preprocessor. bioRxiv 274100; doi: https://doi.org/10.1101/274100
[^2]: [cutPrimers](https://github.com/ray1919/cutPrimers) trimming primer sequences from amplicon based NGS reads
[^3]: A versatile pairwise aligner for genomic and spliced nucleotide sequences https://lh3.github.io/minimap2, Li, H. (2017). Minimap2: fast pairwise alignment for long nucleotide sequences. arXiv:1708.01492
[^4]: Kim, S., Scheffler, K. et al. (2017) Strelka2: Fast and accurate variant calling for clinical sequencing applications. bioRxiv doi: 10.1101/192872