From 0c3f03c9fbe2b4ef4455e802f97e0df8139c7dd9 Mon Sep 17 00:00:00 2001
From: sroener <40714954+sroener@users.noreply.github.com>
Date: Fri, 18 Aug 2023 11:42:48 +0200
Subject: [PATCH 01/55] refactor: minor textual changes
---
README.md | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/README.md b/README.md
index 52ac2cd..930df8d 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,4 @@
-# cfDNA UniFlow - Unified Preprocessing Pipeline for cell-free DNA from liquid biopsies
-
-[![Snakemake](https://img.shields.io/badge/snakemake-≥6.4.1-brightgreen.svg)](https://snakemake.bitbucket.io)
+# cfDNA UniFlow: A unified preprocessing pipeline for cell-free DNA from liquid biopsies
cfDNA UniFlow is a unified, standardized, and ready-to-use workflow for processing WGS cfDNA samples from liquid biopsies. It includes essential steps for pre-processing raw cfDNA samples, quality control and reporting. Additionally, several optional utility functions like GC bias correction and estimation of copy number state are included. Figure S1 gives a detailed overview of the workflow.
@@ -10,7 +8,7 @@ cfDNA UniFlow is a unified, standardized, and ready-to-use workflow for processi
- Figure S1: Overview of cfDNA Uniflow. Functionalities are color coded by task. Red boxes represents rules for the automatic download of public resources. Grey boxes are optional steps. Blue boxes containt the core functionailty of cfDNA Uniflow. Green boxes are optional, but highly recommended steps and yellow boxes summarize the Quality Control and reporting steps.
+ Figure S1: Overview of cfDNA Uniflow. Functionalities are color coded by task. Red boxes represent rules for the automatic download of public resources. Grey boxes are optional steps. Blue boxes contain the core functionalty of cfDNA Uniflow. Green boxes are optional, but highly recommended steps and yellow boxes summarize the Quality Control and reporting steps.
@@ -23,7 +21,7 @@ cfDNA UniFlow is a unified, standardized, and ready-to-use workflow for processi
## Table of Contents
-- [cfDNA UniFlow - Unified Preprocessing Pipeline for cell-free DNA from liquid biopsies](#cfdna-uniflow---unified-preprocessing-pipeline-for-cell-free-dna-from-liquid-biopsies)
+- [cfDNA UniFlow: A unified preprocessing pipeline for cell-free DNA from liquid biopsies](#cfdna-uniflow-a-unified-preprocessing-pipeline-for-cell-free-dna-from-liquid-biopsies)
- [1 Dependencies](#1-dependencies)
- [2 Setup](#2-setup)
- [Step 1: Obtain a copy of this workflow](#step-1-obtain-a-copy-of-this-workflow)
@@ -56,7 +54,7 @@ The only exception is NGmerge, a read merging and adapter removal program, which
### Step 1: Obtain a copy of this workflow
-1. Create a new github repository using this workflow [as a template](https://help.github.com/en/articles/creating-a-repository-from-a-template).
+1. Create a new GitHub repository using this workflow [as a template](https://help.github.com/en/articles/creating-a-repository-from-a-template).
2. [Clone](https://help.github.com/en/articles/cloning-a-repository) the newly created repository to your local system, into the place where you want to perform the data analysis.
### Step 2: Install Snakemake
@@ -125,7 +123,7 @@ In the quality control step, general post-alignment statistics and graphs are ca
### 3.3 Utility functionality
-In addition to the preprocessing and quality control functionality, cfDNA UniFlow contains some utility functions. The first is the widely used tool IchorCNA, which can be used for predicting copy number alteration (CNA) states across the genome. Further, it uses this information for estimating tumor fractions in cfDNA samples.
+In addition to the preprocessing and quality control functionality, cfDNA UniFlow contains some utility functions. The first is the widely used tool ichorCNA, which can be used for predicting copy number alteration (CNA) states across the genome. Further, it uses this information for estimating tumor fractions in cfDNA samples.
The second utility function is our [inhouse GC bias estimation method](https://github.com/kircherlab/cfDNA_GCcorrection). It can not only be used for estimating fragment length and GC-content dependent technical biases, but also includes the option of attaching correction values to the reads. These can be used downstream for a wide variety of signal extraction methods, while preserving the original read coverage patterns.
@@ -141,15 +139,15 @@ Finally, all results and summary statistics for the specified samples are aggreg
# Indexing the reference sequence (Requires 28N GB memory where N is the size of the reference sequence).
./bwa-mem2 index [-p prefix]
Where
- is the path to reference sequence fasta file and
- is the prefix of the names of the files that store the resultant index. Default is in.fasta.
+ is the path to reference sequence FASTA file and
+ is the prefix of the names of the files that store the resultant index. Default is in.FASTA.
```
-* bwa-mem2 mem uses around 4GB memory per thread
+* bwa-mem2 mem uses around 4GB memory per thread.
## 4 Quickstart guide
-Our goal in developing cfDNA UniFlow is to provide a scalable, configurable and easy-to-use workflow specifically tailored towards the processing of cfDNA samples. Users only need to provide sequencing information in FASTQ or BAM format and optionally modify the configuration file to their needs. Here we provide an example with a small input file for testing the workflows functionality.
+Our goal in developing cfDNA UniFlow is to provide a scalable, configurable, and easy-to-use workflow specifically tailored towards the processing of cfDNA samples. Users only need to provide sequencing information in FASTQ or BAM format and optionally modify the configuration file to their needs. Here we provide an example with a small input file for testing the workflows functionality.
**Note:** For simplicity, we expect a Unix system for this guide.
@@ -174,7 +172,7 @@ There are two files that are used in this example:
- config/test-config.yaml
- config/test-samples.tsv
-Both files don't need to be edited and are configured for a quick functionality test.
+Both files do not need to be edited and are configured for a quick functionality test.
### 4.3 Executing the workflow
@@ -194,4 +192,6 @@ snakemake --use-conda --configfile config/test-config.yaml --cores $N
For cluster execution, read the guidelines in the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html).
+**Note:** Creating the index for bwa-mem2 requires 28N GB memory where N is the size of the reference sequence file. More information can be found in the [documentation](https://github.com/bwa-mem2/bwa-mem2#usage) or [this issue](https://github.com/bwa-mem2/bwa-mem2/issues/111).
+
From d749bea4fd5c1881c009c8ff1f7d587b9f830f52 Mon Sep 17 00:00:00 2001
From: sroener <40714954+sroener@users.noreply.github.com>
Date: Fri, 18 Aug 2023 16:23:34 +0200
Subject: [PATCH 02/55] feat: overhaul support for single end sequencing
---
config/example.config.yaml | 24 +++---
config/test-config.yaml | 22 +++---
workflow/rules/bam_to_fastq.smk | 10 ++-
workflow/rules/common.smk | 111 +++++++++++++++++++++------
workflow/rules/filtering.smk | 15 +++-
workflow/rules/trimming.smk | 12 +--
workflow/schemas/config.schema.yaml | 12 ++-
workflow/scripts/bwa-mem2_wrapper.py | 7 +-
8 files changed, 149 insertions(+), 64 deletions(-)
diff --git a/config/example.config.yaml b/config/example.config.yaml
index 2919d27..2fa3ff4 100644
--- a/config/example.config.yaml
+++ b/config/example.config.yaml
@@ -27,7 +27,7 @@ SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation
### trimming ###
-trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
+PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
#### NGmerge specific options ####
@@ -35,7 +35,7 @@ length-filter:
MINLEN: 30 # min lenght of reads in additional filter steps
#### trimmomatic specific options ####
-
+phred-quality-encoding: # three options: empty = automatic detection, phred-33 and phred-64
# Illuminaclip takes a fasta file with adapter sequences and removes them in the trimming step.
# The adapter_files option takes either the path to a custom file
@@ -66,19 +66,15 @@ trimmers:
### Mapping ###
-# This option lets you add unpaired/singleton reads in the mapping step that were filtered,
-# but are either not paired or not merged. Otherwise these reads are not further processed.
+# This option lets you add unmerged/singleton or single-end reads in the mapping step.
+# Unmerged or singleton reads are paired end reads that were filtered by samtools fastq or NGmerge.
+# Single-end reads are from single end libraries.These categories can be excluded for specialised analyses.
mapping:
- unmerged: True # default is True
- singleton: False # default is False
-
-### Utility ###
-
-utility:
- GCbias-plot: True
- GCbias-correction: True
- ichorCNA: True
-
+ paired_end:
+ unmerged: True # default is True. Reads not merged by NGmerge.
+ singleton: False # default is False. Reads that are from paired end libraries without a matching pair.
+ single_end:
+ SEreads: True # default is True. This option is essential for Single End libraries. Setting to true in PE libraries has no effect on the output.
#### ichorCNA ####
diff --git a/config/test-config.yaml b/config/test-config.yaml
index 1c82140..005272e 100644
--- a/config/test-config.yaml
+++ b/config/test-config.yaml
@@ -27,7 +27,7 @@ SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation
### trimming ###
-trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
+PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
#### NGmerge specific options ####9
@@ -36,6 +36,7 @@ length-filter:
#### trimmomatic specific options ####
+phred-quality-encoding: # three options: empty = automatic detection, phred-33 and phred-64
# Illuminaclip takes a fasta file with adapter sequences and removes them in the trimming step.
# The adapter_files option takes either the path to a custom file
@@ -66,18 +67,15 @@ trimmers:
### Mapping ###
-# This option lets you add unpaired/singleton reads in the mapping step that were filtered,
-# but are either not paired or not merged. Otherwise these reads are not further processed.
+# This option lets you add unmerged/singleton or single-end reads in the mapping step.
+# Unmerged or singleton reads are paired end reads that were filtered by samtools fastq or NGmerge.
+# Single-end reads are from single end libraries.These categories can be excluded for specialised analyses.
mapping:
- unmerged: True # default is True
- singleton: True # default is False
-
-### Utility ###
-
-utility:
- GCbias-plot: True
- GCbias-correction: True
- ichorCNA: True
+ paired_end:
+ unmerged: True # default is True. Reads not merged by NGmerge.
+ singleton: False # default is False. Reads that are from paired end libraries without a matching pair.
+ single_end:
+ SEreads: True # default is True. This option is essential for Single End libraries. Setting to true in PE libraries has no effect on the output.
diff --git a/workflow/rules/bam_to_fastq.smk b/workflow/rules/bam_to_fastq.smk
index ef45898..f284050 100644
--- a/workflow/rules/bam_to_fastq.smk
+++ b/workflow/rules/bam_to_fastq.smk
@@ -7,13 +7,14 @@ rule bam_to_fastq:
bam=lambda wildcards: samples["bam"][wildcards.SAMPLE],
bai=lambda wildcards: samples["bam"][wildcards.SAMPLE] + ".bai",
output:
- r1=temp("results/{ID}/fastq/{SAMPLE}_R1.fastq.gz"),
- r2=temp("results/{ID}/fastq/{SAMPLE}_R2.fastq.gz"),
- s1=temp("results/{ID}/fastq/{SAMPLE}_single_read.fastq.gz"),
+ r1=temp("results/{ID}/fastq/{SAMPLE}_R1.fastq.gz"), # Read 1 in Paired End Sequencing
+ r2=temp("results/{ID}/fastq/{SAMPLE}_R2.fastq.gz"), # Read 2 in Paired End Sequencing
+ s1=temp("results/{ID}/fastq/{SAMPLE}_PEsingleton.fastq.gz"), # Singletons (paired, but only one read present) in Paired End Sequencing
+ SR=temp("results/{ID}/fastq/{SAMPLE}_single_read.fastq.gz"), # File for Single End Sequencing (flags for both R1 and R2 set, or both unset)
log:
"results/logs/{ID}/bam_to_fastq/{SAMPLE}.log",
params:
- TMPDIR=config["TMPDIR"],
+ TMPDIR=config["TMPDIR"]+"{SAMPLE}",
conda:
"../envs/cfDNA_prep.yaml"
threads: 32
@@ -24,4 +25,5 @@ rule bam_to_fastq:
-1 {output.r1} \
-2 {output.r2} \
-s {output.s1} \
+ -0 {output.SR} \
)2> {log}"""
diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
index ff773ea..db1e165 100644
--- a/workflow/rules/common.smk
+++ b/workflow/rules/common.smk
@@ -114,6 +114,26 @@ def get_trimming_input(wildcards):
f"Please check the your sample sheet."
)
+def get_SEtrimming_input(wildcards):
+ sample = wildcards.SAMPLE
+ bam = samples.loc[sample].loc["bam"]
+ fq1 = samples.loc[sample].loc["fq1"]
+ fq2 = samples.loc[sample].loc["fq2"]
+
+ if ".bam" in bam.lower():
+ return ["results/{ID}/fastq/{SAMPLE}_single_read.fastq.gz"]
+ elif ".fastq" in fq1.lower() and ".fastq" in fq2.lower():
+ return ["results/{ID}/fastq/{SAMPLE}_single_read.fastq.gz"]
+ elif ".fastq" in fq1.lower() and not ".fastq" in fq2.lower():
+ return [fq1]
+ elif ".fastq" in fq2.lower() and not ".fastq" in fq1.lower():
+ return [fq2]
+ else:
+ raise ValueError(
+ f"Something went wrong for sample: {sample} (bam: {bam}; fq1: {fq1}; fq2: {fq2})."
+ f"Please check the your sample sheet."
+ )
+
### get reference based on genome_build provided in samples.tsv
def get_reference(wildcards):
@@ -247,43 +267,88 @@ def get_trimmomatic_trimmers():
return trimmers
+def get_trimmomatic_extra():
+ """Checks if phred-qualty-encoding is set to phred-33 or phred-64."""
+ trimmomatic_extra = ""
+ if config["phred-quality-encoding"] == "phred-33":
+ trimmomatic_extra = "-phred33"
+ elif config["phred-quality-encoding"] == "phred-64":
+ trimmomatic_extra = "-phred64"
+
+ return trimmomatic_extra
+
def get_mapping_input(wildcards):
sample = wildcards.SAMPLE
bam = samples.loc[sample].loc["bam"]
+ fq1 = samples.loc[sample].loc["fq1"]
+ fq2 = samples.loc[sample].loc["fq2"]
mapping_input = dict()
- trimming_algorithm = config["trimming_algorithm"]
- unmerged = config["mapping"]["unmerged"]
- singleton = config["mapping"]["singleton"]
+ # Paired End parameters
+ trimming_algorithm = config["PE_trimming_algorithm"]
+ unmerged = config["mapping"]["paired_end"]["unmerged"]
+ singleton = config["mapping"]["paired_end"]["singleton"]
+ # Single End parameter
+ SEreads = config["mapping"]["single_end"]["SEreads"]
if trimming_algorithm.lower() == "ngmerge":
- mapping_input[
- "reads"
- ] = "results/{ID}/NGmerge/merged/{SAMPLE}_merged.filtered.fastq.gz"
- if unmerged:
+ # these options are for Paired End sequencing libraries
+ if ".bam" in bam.lower() or (".fastq" in fq1.lower() and ".fastq" in fq2.lower()):
+ mapping_input[
+ "reads"
+ ] = "results/{ID}/NGmerge/merged/{SAMPLE}_merged.filtered.fastq.gz"
+ if unmerged:
+ mapping_input[
+ "noadapter_R1"
+ ] = "results/{ID}/NGmerge/nonmerged/{SAMPLE}_noadapters_1.filtered.fastq.gz"
+ mapping_input[
+ "noadapter_R2"
+ ] = "results/{ID}/NGmerge/nonmerged/{SAMPLE}_noadapters_2.filtered.fastq.gz"
+ if singleton and ".bam" in bam.lower():
+ mapping_input[
+ "PEsingleton"
+ ] = "results/{ID}/fastq/{SAMPLE}_PEsingleton.filtered.fastq.gz"
+ # this option is for reads in Single End sequencing libraries
+ if SEreads and (".fastq" in fq1.lower() and not ".fastq" in fq2.lower()):
mapping_input[
- "noadapter_R1"
- ] = "results/{ID}/NGmerge/nonmerged/{SAMPLE}_noadapters_1.filtered.fastq.gz"
+ "reads"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
+ elif SEreads and (".fastq" in fq2.lower() and not ".fastq" in fq1.lower()):
mapping_input[
- "noadapter_R2"
- ] = "results/{ID}/NGmerge/nonmerged/{SAMPLE}_noadapters_2.filtered.fastq.gz"
- if singleton and ".bam" in bam.lower():
+ "reads"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
+ elif SEreads and not (".fastq" in fq1.lower() and ".fastq" in fq2.lower()):
mapping_input[
"single_reads"
- ] = "results/{ID}/fastq/{SAMPLE}_single_read.filtered.fastq.gz"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
+
elif trimming_algorithm.lower() == "trimmomatic":
- mapping_input["reads"] = [
- "results/{ID}/trimmed/trimmomatic/{SAMPLE}.1.fastq.gz",
- "results/{ID}/trimmed/trimmomatic/{SAMPLE}.2.fastq.gz",
- ]
- if all_data:
+ # these options are for Paired End sequencing libraries
+ if ".bam" in bam.lower() or (".fastq" in fq1.lower() and ".fastq" in fq2.lower()):
+ mapping_input["reads"] = [
+ "results/{ID}/trimmed/trimmomatic/{SAMPLE}.1.fastq.gz",
+ "results/{ID}/trimmed/trimmomatic/{SAMPLE}.2.fastq.gz",
+ ]
+ if unmerged:
+ mapping_input[
+ "noadapter_R1"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}.1.unpaired.fastq.gz"
+ mapping_input[
+ "noadapter_R2"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}.2.unpaired.fastq.gz"
+ # this option is for reads in Single End sequencing libraries
+ if SEreads and (".fastq" in fq1.lower() and not ".fastq" in fq2.lower()):
mapping_input[
- "noadapter_R1"
- ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}.1.unpaired.fastq.gz"
+ "reads"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
+ elif SEreads and (".fastq" in fq2.lower() and not ".fastq" in fq1.lower()):
mapping_input[
- "noadapter_R2"
- ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}.2.unpaired.fastq.gz"
-
+ "reads"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
+ elif SEreads and not (".fastq" in fq1.lower() and ".fastq" in fq2.lower()):
+ mapping_input[
+ "single_reads"
+ ] = "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"
return mapping_input
diff --git a/workflow/rules/filtering.smk b/workflow/rules/filtering.smk
index d2180fa..ebb715f 100644
--- a/workflow/rules/filtering.smk
+++ b/workflow/rules/filtering.smk
@@ -25,12 +25,23 @@ rule filter_noadapter:
shell:
'awk \'BEGIN {{FS = "\\t" ; OFS = "\\n"}} {{header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= {params.min_RL}) {{print header, seq, qheader, qseq}}}}\' <( zcat {input.unfiltered} ) | gzip -c > {output.filtered}'
+rule filter_PE_singleton:
+ input:
+ unfiltered="results/{ID}/fastq/{SAMPLE}_PEsingleton.fastq.gz",
+ output:
+ filtered=temp("results/{ID}/fastq/{SAMPLE}_PEsingleton.filtered.fastq.gz"),
+ params:
+ min_RL=config["length-filter"]["MINLEN"],
+ conda:
+ "../envs/cfDNA_prep.yaml"
+ shell:
+ 'awk \'BEGIN {{FS = "\\t" ; OFS = "\\n"}} {{header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= {params.min_RL}) {{print header, seq, qheader, qseq}}}}\' <( zcat {input.unfiltered} ) | gzip -c > {output.filtered}'
rule filter_single_read:
input:
- unfiltered="results/{ID}/fastq/{SAMPLE}_single_read.fastq.gz",
+ unfiltered="results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.trimmed.fastq.gz",
output:
- filtered=temp("results/{ID}/fastq/{SAMPLE}_single_read.filtered.fastq.gz"),
+ filtered=temp("results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.filtered.fastq.gz"),
params:
min_RL=config["length-filter"]["MINLEN"],
conda:
diff --git a/workflow/rules/trimming.smk b/workflow/rules/trimming.smk
index 62ec378..79a613e 100644
--- a/workflow/rules/trimming.smk
+++ b/workflow/rules/trimming.smk
@@ -71,7 +71,7 @@ rule trimmomatic_pe:
# list of trimmers (see manual)
trimmer=get_trimmomatic_trimmers(),
# optional parameters
- extra="",
+ extra=get_trimmomatic_extra(),
compression_level="-9",
threads: 32
# optional specification of memory usage of the JVM that snakemake will respect with global
@@ -81,15 +81,15 @@ rule trimmomatic_pe:
resources:
mem_mb=1024,
wrapper:
- "v1.1.0/bio/trimmomatic/pe"
+ "v2.3.2/bio/trimmomatic/pe"
rule trimmomatic_se:
input:
- "reads/{sample}.fastq.gz", # input and output can be uncompressed or compressed
+ unpack(get_SEtrimming_input)
output:
temp(
- "results/{ID}/trimmed/trimmomatic/{SAMPLE}.trimmed.fastq.gz",
+ "results/{ID}/trimmed/trimmomatic/{SAMPLE}_single_read.trimmed.fastq.gz",
),
log:
"logs/{ID}/trimmomatic/{SAMPLE}.log",
@@ -97,7 +97,7 @@ rule trimmomatic_se:
# list of trimmers (see manual)
trimmer=get_trimmomatic_trimmers(),
# optional parameters
- extra="",
+ extra=get_trimmomatic_extra(),
# optional compression levels from -0 to -9 and -11
compression_level="-9",
threads: 32
@@ -108,4 +108,4 @@ rule trimmomatic_se:
resources:
mem_mb=1024,
wrapper:
- "v1.1.0/bio/trimmomatic/se"
+ "v2.3.2/bio/trimmomatic/se"
diff --git a/workflow/schemas/config.schema.yaml b/workflow/schemas/config.schema.yaml
index ba1fe4c..48ab90a 100644
--- a/workflow/schemas/config.schema.yaml
+++ b/workflow/schemas/config.schema.yaml
@@ -1,4 +1,4 @@
-$schema: "http://json-schema.org/draft-04/schema#"
+$schema: "http://json-schema.org/draft-06/schema#"
description: snakemake configuration file
@@ -64,6 +64,14 @@ properties:
type: number
default: 30
description: min lenght of reads in additional filter steps
+ phred-quality-encoding:
+ type: ["string", "null"]
+ default:
+ enum:
+ -
+ - phred-33
+ - phred-64
+ description: three options for setting trimmomatic quality score encoding; empty = automatic detection, phred-33 and phred-64
trimmers:
Illuminaclip:
adapter_files:
@@ -219,7 +227,7 @@ required:
- hg19
- hg38
- TMPDIR
- - trimming_algorithm
+ - PE_trimming_algorithm
- length-filter
- trimmers
- mapping
diff --git a/workflow/scripts/bwa-mem2_wrapper.py b/workflow/scripts/bwa-mem2_wrapper.py
index 64c1e63..88d4400 100644
--- a/workflow/scripts/bwa-mem2_wrapper.py
+++ b/workflow/scripts/bwa-mem2_wrapper.py
@@ -36,6 +36,7 @@
non_merged_cmd = ""
singleton_cmd = ""
+singleEnd_cmd = ""
### it is okay, that all bwa-mem2 commands get the same number of CPUs, as they are run sequentially.
@@ -46,14 +47,18 @@
non_merged_cmd = f"bwa-mem2 mem -t {remaining_CPU} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.noadapter_R1}} {{snakemake.input.noadapter_R2}}| grep -v \"^@\" || true ; "
# non_merged_cmd = "bwa-mem2 mem -t {snakemake.threads} -R \"{snakemake.params.RG}\" {snakemake.input.ref} {snakemake.input.noadapter_R1} {snakemake.input.noadapter_R2}| grep -v \"^@\" || true ; "
+if snakemake.input.get("PEsingleton"):
+ singleton_cmd = f"bwa-mem2 mem -t {remaining_CPU} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.PEsingleton}} | grep -v \"^@\" || true ; "
+
if snakemake.input.get("single_reads"):
- singleton_cmd = f"bwa-mem2 mem -t {remaining_CPU} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.single_reads}} | grep -v \"^@\" || true ; "
+ singleEnd_cmd = f"bwa-mem2 mem -t {remaining_CPU} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.single_reads}} | grep -v \"^@\" || true ; "
#singleton_cmd = "bwa-mem2 mem -t {snakemake.threads} -R \"{snakemake.params.RG}\" {snakemake.input.ref} {snakemake.input.single_reads} | grep -v \"^@\" || true ; "
base_cmd = f"((bwa-mem2 mem -t {remaining_CPU} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.reads}}; \
{non_merged_cmd}\
{singleton_cmd}\
+{singleEnd_cmd}\
) | samtools sort -O bam -@ {samtools_CPU} -o {{snakemake.output.mapped_reads}} - ) 2>{{snakemake.log}}"
#base_cmd = f"((bwa-mem2 mem -t {{snakemake.threads}} -R \"{{snakemake.params.RG}}\" {{snakemake.input.ref}} {{snakemake.input.reads}}; \
From 5c6a821462f7b49c50f4e2f77c926ef2f0ee845b Mon Sep 17 00:00:00 2001
From: sroener <40714954+sroener@users.noreply.github.com>
Date: Fri, 18 Aug 2023 16:24:56 +0200
Subject: [PATCH 03/55] refactor: update environment
---
workflow/envs/GC_bias.yaml | 1 +
1 file changed, 1 insertion(+)
diff --git a/workflow/envs/GC_bias.yaml b/workflow/envs/GC_bias.yaml
index 7c864c0..744e921 100644
--- a/workflow/envs/GC_bias.yaml
+++ b/workflow/envs/GC_bias.yaml
@@ -4,6 +4,7 @@ channels:
- bioconda
dependencies:
- python=3.8
+ - bx-python
- pybedtools
- pybigwig
- bedtools
From e42fd2640399139c505bbb825c148b097dd1e273 Mon Sep 17 00:00:00 2001
From: sroener <40714954+sroener@users.noreply.github.com>
Date: Fri, 18 Aug 2023 16:33:59 +0200
Subject: [PATCH 04/55] feat: add signal extration around target regions; add
GC correction overlay plots
---
config/example.config.yaml | 29 +
config/test-config.yaml | 37 +-
workflow/Snakefile | 1 +
workflow/rules/GC_bias.smk | 76 +++
workflow/rules/common.smk | 39 ++
workflow/rules/extract_signals.smk | 182 ++++++
workflow/schemas/config.schema.yaml | 94 ++-
workflow/scripts/extract_cfDNA_signals.py | 568 ++++++++++++++++++
workflow/scripts/plot_overlay_GCcorrection.py | 491 +++++++++++++++
9 files changed, 1504 insertions(+), 13 deletions(-)
create mode 100644 workflow/rules/extract_signals.smk
create mode 100755 workflow/scripts/extract_cfDNA_signals.py
create mode 100755 workflow/scripts/plot_overlay_GCcorrection.py
diff --git a/config/example.config.yaml b/config/example.config.yaml
index 2fa3ff4..4331a2f 100644
--- a/config/example.config.yaml
+++ b/config/example.config.yaml
@@ -25,6 +25,14 @@ TMPDIR: "./" # path to directory for writing TMP files
SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation
+### Utility options ###
+
+utility:
+ GCbias-plot: True
+ GCbias-correction: True
+ ichorCNA: True
+ case-control-plot: True
+
### trimming ###
PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
@@ -114,3 +122,24 @@ ichorCNA:
scStates: '"c(1,3)"'
txnE: 0.9999
txnStrength: 10000
+
+#### Signal extraction ####
+
+minRL: 120 # minimum read length for calculating WPS
+maxRL: 180 # maximum read length for calculating WPS
+bpProtection: 120 # bp protection for calculating WPS
+lengthSR: 76 # length of single reads, if used for calculating WPS
+
+#### Signal processing ####
+
+overlay_mode: "mean" # Can be either "mean" or "median". Sets overlay mode, specifying how regions should be aggregated for each sample.
+smoothing: True # Activates smoothing with Savitzky-Golay filter.
+smooth_window: 21 # Sets windows size used for smoothing with Savitzky-Golay filter.
+smooth_polyorder: 2 # Sets order of polynomial used for smoothing with Savitzky-Golay filter.
+rolling: True # Activates trend removal with a rolling median filter.
+rolling_window: 1000 # Sets window size used in rolling median filter.
+flank_norm: True # Activates normalization by dividing the signals by the mean coverage in flanking intervals around the region of interest.
+flank: 2000 # Sets the size of the flanking intervals around the region of interest. Should be <= 0.5 of the extracted signals
+signal: "coverage" # can be either "coverage" or "WPS"
+display_window: [-1500,1500]
+aggregate_controls: True
diff --git a/config/test-config.yaml b/config/test-config.yaml
index 005272e..843cf6e 100644
--- a/config/test-config.yaml
+++ b/config/test-config.yaml
@@ -1,7 +1,11 @@
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
-samples: "config/test-samples.tsv" #"config/Delfi_samples.tsv"
+samples: "config/test-samples.tsv"
+
+regions: "config/regions_cfDNA-Uniflow.tsv"
+
+control_name: "healthy" # name of the control samples specified in the samples.tsv. Has to match the name in the status field.
### genome build specific options ###
@@ -25,6 +29,14 @@ TMPDIR: "/fast/users/roeners_c/scratch/reads/" # path to directory for writing T
SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation
+### Utility options ###
+
+utility:
+ GCbias-plot: True
+ GCbias-correction: True
+ ichorCNA: True
+ case-control-plot: True
+
### trimming ###
PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
@@ -78,6 +90,7 @@ mapping:
SEreads: True # default is True. This option is essential for Single End libraries. Setting to true in PE libraries has no effect on the output.
+### Utility parameters ###
#### ichorCNA ####
@@ -116,3 +129,25 @@ ichorCNA:
scStates: '"c(1,3)"'
txnE: 0.9999
txnStrength: 10000
+
+
+#### Signal extraction ####
+
+minRL: 120 # minimum read length for calculating WPS
+maxRL: 180 # maximum read length for calculating WPS
+bpProtection: 120 # bp protection for calculating WPS
+lengthSR: 76 # length of single reads, if used for calculating WPS
+
+#### Signal processing ####
+
+overlay_mode: "mean" # Can be either "mean" or "median". Sets overlay mode, specifying how regions should be aggregated for each sample.
+smoothing: True # Activates smoothing with Savitzky-Golay filter.
+smooth_window: 21 # Sets windows size used for smoothing with Savitzky-Golay filter.
+smooth_polyorder: 2 # Sets order of polynomial used for smoothing with Savitzky-Golay filter.
+rolling: True # Activates trend removal with a rolling median filter.
+rolling_window: 1000 # Sets window size used in rolling median filter.
+flank_norm: True # Activates normalization by dividing the signals by the mean coverage in flanking intervals around the region of interest.
+flank: 2000 # Sets the size of the flanking intervals around the region of interest. Should be <= 0.5 of the extracted signals
+signal: "coverage" # can be either "coverage" or "WPS"
+display_window: [-1500,1500]
+aggregate_controls: True
diff --git a/workflow/Snakefile b/workflow/Snakefile
index d9cde71..f73c7d0 100644
--- a/workflow/Snakefile
+++ b/workflow/Snakefile
@@ -13,6 +13,7 @@ include: "rules/filtering.smk"
include: "rules/mapping.smk"
include: "rules/ichorCNA.smk"
include: "rules/GC_bias.smk"
+include: "rules/extract_signals.smk"
include: "rules/QualityControl.smk"
diff --git a/workflow/rules/GC_bias.smk b/workflow/rules/GC_bias.smk
index b71b76f..4e1ca32 100644
--- a/workflow/rules/GC_bias.smk
+++ b/workflow/rules/GC_bias.smk
@@ -154,3 +154,79 @@ rule correctGCbias:
{params.GC_weights} \
-o {output.gc_weighted_bam}
"""
+
+
+def get_uncorrected_signals(wildcards):
+ ID = wildcards.ID
+ status_name = wildcards.status_name
+ paths = expand("-us results/{s.ID}/signals/signal-uncorrected/{target_region}.{s.sample}-uncorrected_{signal}.{s.genome_build}.csv.gz",
+ s=list(samples.loc[(samples["ID"] == ID) & (samples["status"] == status_name)].itertuples()),
+ allow_missing=True,
+ target_region = wildcards.target_region,
+ signal = wildcards.signal,
+ )
+ return paths
+
+def get_corrected_signals(wildcards):
+ ID = wildcards.ID
+ status_name = wildcards.status_name
+ paths = expand("-cs results/{s.ID}/signals/signal-corrected/{target_region}.{s.sample}-corrected_{signal}.{s.genome_build}.csv.gz",
+ s=list(samples.loc[(samples["ID"] == ID) & (samples["status"] == status_name) ].itertuples()),
+ target_region = wildcards.target_region,
+ signal = wildcards.signal,
+ allow_missing=True,
+ )
+ return paths
+
+rule plot_GC_overlay:
+ input:
+ uncorrected_signals = lambda wc: expand(
+ "results/{ID}/signals/signal-uncorrected/{target_region}.{s.sample}-uncorrected_{signal}.{s.genome_build}.csv.gz",
+ s=list(samples.loc[(samples["ID"] == wc.ID) & (samples["status"] == wc.status_name) ].itertuples()),
+ allow_missing=True,
+ ),
+ corrected_signals = lambda wc: expand(
+ "results/{ID}/signals/signal-corrected/{target_region}.{s.sample}-corrected_{signal}.{s.genome_build}.csv.gz",
+ s=list(samples.loc[(samples["ID"] == wc.ID) & (samples["status"] == wc.status_name) ].itertuples()),
+ allow_missing=True,
+ ),
+ output:
+ "results/{ID}/signals/GCcorrection-plots/{target_region}.{status_name}-GCcorrected_{signal}.{GENOME}.png",
+ params:
+ uncorrected_samples = get_uncorrected_signals,
+ corrected_samples = get_corrected_signals,
+ name_regex = "'.*\.(.*?)-[un]*corrected_.*'", # matches everything between a '.' and '-[un]corrected_'
+ signal = "{signal}",
+ target = "{target_region}",
+ overlay_mode=config["overlay_mode"],
+ smoothing= "--smoothing" if config["smoothing"] else "",
+ smooth_window=config["smooth_window"],
+ smooth_polyorder=config["smooth_polyorder"],
+ rolling="--rolling" if config["rolling"] else "",
+ rolling_window=config["rolling_window"],
+ flank_norm="--flank_norm" if config["flank_norm"] else "",
+ flank=config["flank"],
+ display_window = config["display_window"],
+ figsize = (12, 9),
+ conda:
+ "../envs/GC_bias.yaml"
+ shell:
+ """
+ workflow/scripts/plot_overlay_GCcorrection.py \
+ --output {output} \
+ --regex {params.name_regex} \
+ --signal {params.signal} \
+ --target {params.target} \
+ --overlay_mode {params.overlay_mode} \
+ {params.smoothing} \
+ --smooth_window {params.smooth_window} \
+ --smooth_polyorder {params.smooth_polyorder} \
+ {params.rolling} \
+ --rolling_window {params.rolling_window} \
+ {params.flank_norm} \
+ --flank {params.flank} \
+ --display_window {params.display_window} \
+ --figsize {params.figsize} \
+ {params.uncorrected_samples} \
+ {params.corrected_samples}
+ """
diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
index db1e165..7a7d206 100644
--- a/workflow/rules/common.smk
+++ b/workflow/rules/common.smk
@@ -13,6 +13,9 @@ samples = pd.read_csv(config["samples"], sep="\t").set_index("sample", drop=Fals
samples.index.names = ["sample_id"]
validate(samples, schema="../schemas/samples.schema.yaml")
+regions = pd.read_csv(config["regions"], sep="\t").set_index("target", drop=False)
+regions.index.names = ["region_id"]
+
def get_final_output():
final_output = []
@@ -35,10 +38,27 @@ def get_final_output():
GENOME=samples["genome_build"],
)
)
+
final_output.extend(
expand("results/{ID}/qc/multiqc.html", ID=samples["ID"].unique())
)
+ if config["utility"]["GCbias-correction"]:
+ final_output.extend(
+ expand(
+ expand(
+ "results/{ID}/signals/signal-corrected/{target_region}.{SAMPLE}-corrected_WPS.{GENOME}.csv.gz",
+ zip,
+ ID=samples["ID"],
+ SAMPLE=samples["sample"],
+ GENOME=samples["genome_build"],
+ allow_missing=True,
+ ),
+ target_region=regions["target"],
+ #blacklist=["repeatmasker"],
+ )
+ )
+
if config["utility"]["GCbias-plot"]:
final_output.extend(
expand(
@@ -54,6 +74,25 @@ def get_final_output():
)
)
+ final_output.extend(
+ set(
+ expand(
+ expand(
+ "results/{ID}/signals/GCcorrection-plots/{target_region}.{status_name}-GCcorrected_{signal}.{GENOME}.png",
+ zip,
+ ID=samples["ID"],
+ GENOME=samples["genome_build"],
+ status_name = samples["status"],#.loc[samples["status"] != config["control_name"]],
+ allow_missing=True,
+ ),
+ signal = "COV" if config["signal"].lower() == "coverage" else "WPS",
+ target_region=regions["target"],
+ )
+ )
+ )
+
+ # add uncorrected as target
+
if config["utility"]["ichorCNA"]:
final_output.extend(
expand(
diff --git a/workflow/rules/extract_signals.smk b/workflow/rules/extract_signals.smk
new file mode 100644
index 0000000..62c237b
--- /dev/null
+++ b/workflow/rules/extract_signals.smk
@@ -0,0 +1,182 @@
+
+
+rule extract_counts:
+ input:
+ #target="results/regions/{GENOME}/target_region/{target_region}.blacklist-excluded.bed.gz",
+ target=lambda wildcards: regions["path"][wildcards.target_region],
+ BAMFILE="results/{ID}/mapped_reads/{SAMPLE}_processed.{GENOME}.bam",
+ BAIFILE="results/{ID}/mapped_reads/{SAMPLE}_processed.{GENOME}.bam.bai",
+ twobit_genome=lambda wildcards: config[wildcards.GENOME]["2bit_ref"],
+ output:
+ WPS="results/{ID}/signals/signal-uncorrected/{target_region}.{SAMPLE}-uncorrected_WPS.{GENOME}.csv.gz",
+ COV="results/{ID}/signals/signal-uncorrected/{target_region}.{SAMPLE}-uncorrected_COV.{GENOME}.csv.gz",
+ MEAN_WEIGHT="results/{ID}/signals/signal-uncorrected/{target_region}.{SAMPLE}-uncorrected_MEAN_WEIGHT.{GENOME}.csv.gz",
+ params:
+ minRL=config["minRL"],
+ maxRL=config["maxRL"],
+ lengthSR=76,
+ protection=120,
+ seed="--seed 42",
+ out_pre="results/{ID}/signals/signal-uncorrected/{target_region}.{SAMPLE}-uncorrected_%s.{GENOME}.csv.gz",
+ threads: 30
+ conda:
+ #"../envs/cfDNA_rework.yml"
+ "../envs/GC_bias.yaml"
+ shell:
+ """
+ workflow/scripts/extract_cfDNA_signals.py \
+ -b {input.BAMFILE} \
+ -g {input.twobit_genome} \
+ -r {input.target} \
+ -o {params.out_pre} \
+ -min {params.minRL} \
+ -max {params.maxRL} \
+ -w {params.protection} \
+ -l {params.lengthSR} \
+ -p {threads} \
+ {params.seed}
+ """
+
+
+rule extract_GCcorrected_counts:
+ input:
+ #target="results/regions/{GENOME}/target_region/{target_region}.blacklist-excluded.bed.gz",
+ target=lambda wildcards: regions["path"][wildcards.target_region],
+ BAMFILE="results/{ID}/mapped_reads/{SAMPLE}_processed.{GENOME}.bam",
+ BAIFILE="results/{ID}/mapped_reads/{SAMPLE}_processed.{GENOME}.bam.bai",
+ twobit_genome=lambda wildcards: config[wildcards.GENOME]["2bit_ref"],
+ #GCfreqfile="results/{ID}/GCBias/bias_table/{SAMPLE}-{computation}_GCbias_interpolated_{analysis}.{GENOME}.tsv.gz",
+ output:
+ WPS="results/{ID}/signals/signal-corrected/{target_region}.{SAMPLE}-corrected_WPS.{GENOME}.csv.gz",
+ COV="results/{ID}/signals/signal-corrected/{target_region}.{SAMPLE}-corrected_COV.{GENOME}.csv.gz",
+ MEAN_WEIGHT="results/{ID}/signals/signal-corrected/{target_region}.{SAMPLE}-corrected_MEAN_WEIGHT.{GENOME}.csv.gz",
+ params:
+ minRL=config["minRL"],
+ maxRL=config["maxRL"],
+ lengthSR=config["lengthSR"],
+ protection=config["bpProtection"],
+ seed="--seed 42",
+ out_pre="results/{ID}/signals/signal-corrected/{target_region}.{SAMPLE}-corrected_%s.{GENOME}.csv.gz",
+ threads: 30
+ conda:
+ "../envs/GC_bias.yaml"
+ shell:
+ """
+ workflow/scripts/extract_cfDNA_signals.py \
+ -b {input.BAMFILE} \
+ -g {input.twobit_genome} \
+ -r {input.target} \
+ -o {params.out_pre} \
+ -gc \
+ -min {params.minRL} \
+ -max {params.maxRL} \
+ -w {params.protection} \
+ -l {params.lengthSR} \
+ -p {threads} \
+ {params.seed}
+ """
+
+
+def get_control_files(wildcards):
+ control_name = config["control_name"]
+ path_template = "results/{s.ID}/signals/signal-{correction}/{target_region}.{s.sample}-{correction}_{signal}.{s.genome_build}.csv.gz"
+ paths = expand(
+ path_template,
+ s=list(samples.loc[samples["status"] == control_name].itertuples()),
+ allow_missing=True,
+ )
+ return paths
+
+
+rule aggregate_controls:
+ input:
+ control_files=get_control_files,
+ #control_files=["results/cfDNA_UniFlow/signals/signal-corrected/LYL1.EGAF00002727162-corrected_COV.hg38.csv.gz",
+ # "results/cfDNA_UniFlow/signals/signal-corrected/LYL1.EGAF00002727163-corrected_COV.hg38.csv.gz"]
+ output:
+ "results/{ID}/signals/controls/{target_region}.{control_name}-{correction}_{signal}.{GENOME}.tsv.gz",
+ params:
+ control_name=config["control_name"],
+ overlay_mode=config["overlay_mode"],
+ smoothing= "--smoothing" if config["smoothing"] else "",
+ smooth_window=config["smooth_window"],
+ smooth_polyorder=config["smooth_polyorder"],
+ rolling="--rolling" if config["rolling"] else "",
+ rolling_window=config["rolling_window"],
+ flank_norm="--flank_norm" if config["flank_norm"] else "",
+ flank=config["flank"],
+ conda:
+ "../envs/GC_bias.yaml"
+ shell:
+ """
+ workflow/scripts/aggregate_controls.py \
+ --output {output}\
+ --control_name {params.control_name} \
+ --overlay_mode {params.overlay_mode} \
+ {params.smoothing} \
+ --smooth_window {params.smooth_window} \
+ --smooth_polyorder {params.smooth_polyorder} \
+ {params.rolling} \
+ --rolling_window {params.rolling_window} \
+ {params.flank_norm} \
+ --flank {params.flank} \
+ {input.control_files}
+ """
+
+def get_case_files(wildcards):
+ case_name = wildcards.case_name
+ path_template = "results/{s.ID}/signals/signal-{correction}/{target_region}.{s.sample}-{correction}_{signal}.{s.genome_build}.csv.gz"
+ paths = expand(
+ path_template,
+ s=list(samples.loc[samples["status"] == case_name].itertuples()),
+ allow_missing=True,
+ )
+ return paths
+
+rule plot_case_control:
+ input:
+ control_file= "results/{ID}/signals/controls/{target_region}.{control_name}-{correction}_{signal}.{GENOME}.tsv.gz",
+ case_files = get_case_files,
+ output:
+ "results/{ID}/signals/case-control-plots/{target_region}.{case_name}-vs-{control_name}-{correction}_{signal}.{GENOME}.png",
+ params:
+ signal = "{signal}",
+ target = "{target_region}",
+ case_name="{case_name}",
+ overlay_mode=config["overlay_mode"],
+ smoothing= "--smoothing" if config["smoothing"] else "",
+ smooth_window=config["smooth_window"],
+ smooth_polyorder=config["smooth_polyorder"],
+ rolling="--rolling" if config["rolling"] else "",
+ rolling_window=config["rolling_window"],
+ flank_norm="--flank_norm" if config["flank_norm"] else "",
+ flank=config["flank"],
+ display_window = config["display_window"],
+ aggregate_controls = "--aggregate_controls" if config["aggregate_controls"] else "",
+ GC_corrected = "--GC_corrected" if config["utility"]["GCbias-correction"] else "",
+ figsize = (12, 9),
+ conda:
+ "../envs/GC_bias.yaml"
+ shell:
+ """
+ workflow/scripts/plot_overlay_case-control.py \
+ --control_samples {input.control_file} \
+ --output {output} \
+ --signal {params.signal} \
+ --target {params.target} \
+ --case_name {params.case_name} \
+ --overlay_mode {params.overlay_mode} \
+ {params.smoothing} \
+ --smooth_window {params.smooth_window} \
+ --smooth_polyorder {params.smooth_polyorder} \
+ {params.rolling} \
+ --rolling_window {params.rolling_window} \
+ {params.flank_norm} \
+ --flank {params.flank} \
+ --display_window {params.display_window} \
+ {params.aggregate_controls} \
+ {params.GC_corrected} \
+ --figsize {params.figsize} \
+ {input.case_files}
+ """
+
diff --git a/workflow/schemas/config.schema.yaml b/workflow/schemas/config.schema.yaml
index 48ab90a..4e15804 100644
--- a/workflow/schemas/config.schema.yaml
+++ b/workflow/schemas/config.schema.yaml
@@ -52,7 +52,20 @@ properties:
SEED:
type: number
description: seed for increased reproducibility
- trimming_algorithm:
+ utility:
+ GCbias-plot:
+ type: boolean
+ default: True
+ GCbias-correction:
+ type: boolean
+ default: True
+ ichorCNA:
+ type: boolean
+ default: True
+ case-control-plot:
+ type: boolean
+ default: True
+ PE_trimming_algorithm:
type: string
default: NGmerge
enum:
@@ -112,16 +125,6 @@ properties:
singleton:
type: boolean
default: False # default is False
- utility:
- GCbias-plot:
- type: boolean
- default: True
- GCbias-correction:
- type: boolean
- default: True
- ichorCNA:
- type: boolean
- default: True
read_counter:
window:
type: number
@@ -221,7 +224,74 @@ properties:
txnStrength:
type: number
default: 10000
-
+ minRL:
+ type: number
+ default: 120
+ description: minimum read length for calculating WPS
+ maxRL:
+ type: number
+ default: 180
+ description: maximum read length for calculating WPS
+ bpProtection:
+ type: number
+ default: 120
+ description: bp protection for calculating WPS
+ lengthSR:
+ type: number
+ default: 76
+ description: length of single reads, if used for calculating WPS
+ overlay_mode:
+ type: string
+ default: mean
+ enum:
+ - mean
+ - median
+ description: Can be either "mean" or "median". Sets overlay mode, specifying how regions should be aggregated for each sample.
+ smoothing:
+ type: boolean
+ default: True
+ description: Activates smoothing with Savitzky-Golay filter.
+ smooth_window:
+ type: number
+ default: 21
+ description: Sets windows size used for smoothing with Savitzky-Golay filter.
+ smooth_polyorder:
+ type: number
+ default: 2
+ description: Sets order of polynomial used for smoothing with Savitzky-Golay filter.
+ rolling:
+ type: boolean
+ default: True
+ description: Activates trend removal with a rolling median filter.
+ rolling_window:
+ type: number
+ default: 1000
+ description: Sets window size used in rolling median filter.
+ flank_norm:
+ type: boolean
+ default: True
+ description: Activates normalization by dividing the signals by the mean coverage in flanking intervals around the region of interest.
+ flank:
+ type: number
+ default: 2000
+ description: Sets the size of the flanking intervals around the region of interest. Should be <= 0.5 of the extracted signals
+ signal:
+ type: string
+ default: "coverage"
+ enum:
+ - coverage
+ - WPS
+ description: can be either "coverage" or "WPS"
+ display_window:
+ type: array
+ default: [-1500,1500]
+ items:
+ type: number
+ description: Sets window around center of region of interest for plotting purposes.
+ aggregate_controls:
+ type: boolean
+ default: True
+ description: Switch between aggregating controls or plotting them seperately.
required:
- samples
- hg19
diff --git a/workflow/scripts/extract_cfDNA_signals.py b/workflow/scripts/extract_cfDNA_signals.py
new file mode 100755
index 0000000..9a607b1
--- /dev/null
+++ b/workflow/scripts/extract_cfDNA_signals.py
@@ -0,0 +1,568 @@
+#!/usr/bin/env python
+
+import argparse
+import gzip
+import math
+import os
+import random
+import sys
+from collections import defaultdict
+import re
+
+import click
+import numpy as np
+import pandas as pd
+import py2bit
+import pybedtools as pbt
+import pysam
+from bx.intervals.intersection import Intersecter, Interval
+from cfDNA_GCcorrection import bamHandler
+from cfDNA_GCcorrection.utilities import (getGC_content, hash_file, map_chroms,
+ read_precomputed_table)
+from csaps import csaps
+from loguru import logger
+from matplotlib import pyplot as plt
+from mpire import WorkerPool
+from scipy import stats
+from scipy.signal import savgol_filter
+from scipy.ndimage import median_filter
+
+### define functions ###
+
+
+def isSoftClipped(cigar):
+ # Op BAM Description
+ # M 0 alignment match (can be a sequence match or mismatch)
+ # I 1 insertion to the reference
+ # D 2 deletion from the reference
+ # N 3 skipped region from the reference
+ # S 4 soft clipping (clipped sequences present in SEQ)
+ # H 5 hard clipping (clipped sequences NOT present in SEQ)
+ # P 6 padding (silent deletion from padded reference)
+ # = 7 sequence match
+ # X 8 sequence mismatch
+ for (op, count) in cigar:
+ if op in [4, 5, 6]:
+ return True
+ return False
+
+
+def aln_length(cigarlist):
+ tlength = 0
+ for operation, length in cigarlist:
+ if operation == 0 or operation == 2 or operation == 3 or operation >= 6:
+ tlength += length
+ return tlength
+
+
+STANDARD_CHROMOSOMES = (
+ [str(i) for i in range(1, 23)]
+ + ["X", "Y"]
+ + ["chr" + str(i) for i in range(1, 23)]
+ + ["chrX", "chrY"]
+)
+
+
+def roundGCLenghtBias(gc):
+ gc_frac, gc_int = math.modf(round(gc * 100, 2))
+ gc_new = gc_int + rng.binomial(1, gc_frac)
+ return int(gc_new)
+
+
+def filter_read(read):
+ if read.is_duplicate or read.is_qcfail or read.is_unmapped:
+ return None
+ if isSoftClipped(read.cigar):
+ return None
+ if read.is_paired:
+ if read.mate_is_unmapped:
+ return None
+ if read.next_reference_id != read.reference_id:
+ return None
+ return read
+
+
+def calculate_signals(
+ chrom, start, end, bam, reference, chr_name_bam_to_bit_mapping, R_gc_dict=None, name="unnamed_ROI", strand=".", verbose=False, downsample=None, minInsSize = None, maxInsSize = None, protection=60, gc_correct=False, merged=False, trimmed=False, lengthSR=76
+):
+ bam = bamHandler.openBam(bam)
+ reference = py2bit.open(reference)
+ tbit_chrom = chr_name_bam_to_bit_mapping[chrom]
+ #logger.debug(f"chrom: {chrom}; mapped tbit_chrom: {tbit_chrom}")
+ ID = f"{name}"
+ coordinate = f"{chrom}:{start}-{end}"
+ regionStart,regionEnd = int(start),int(end)
+ tag = 1
+ # posRange is a dict of lists of two values.
+ # Each key is a position, the values are coverage, endpoints, average value, number of reads.
+ # The dict get updated read per read and position per position.
+ posRange = defaultdict(lambda: [0, 0, 0, 0]) #defaultdict(lambda: [0, 0, 0])#
+ # filteredReads is an instance of an interval tree (intersecter) class of the bx python package.
+ # It is a data structure for performing intersect queries on a set of intervals
+ # In this case, intervals represent reads that passed the filters and contain additional information
+ filteredReads = Intersecter()
+
+ for read in bam.fetch(chrom, max(regionStart - protection - 1, 0), regionEnd + protection + 1,):
+ read = filter_read(read)
+ if not read:
+ continue
+ elif read.is_paired:
+ if read.is_read1 or (read.is_read2 and read.pnext + read.qlen < regionStart - protection - 1):
+ if read.isize == 0:
+ continue
+ if downsample != None and random.random() >= downsample:
+ continue
+ rstart = min(read.pos, read.pnext) + 1 # 1-based
+ lseq = abs(read.isize)
+ r_len = abs(read.template_length)
+ rend = rstart + lseq - 1 # end included
+ if minInsSize != None and ((lseq < minInsSize) or (lseq > maxInsSize)):
+ continue
+ else:
+ continue
+ else:
+ if downsample != None and random.random() >= downsample:
+ continue
+ rstart = read.pos + 1 # 1-based
+ lseq = aln_length(read.cigar)
+ r_len = read.query_length
+ rend = rstart + lseq - 1 # end included
+ #logger.debug(rstart, rend)
+ if minInsSize != None and ((lseq < minInsSize) or (lseq > maxInsSize)):
+ continue
+ # if GC should be used, change the value of the tag to the corresponding bias value
+
+ if gc_correct:
+ try:
+ gc = getGC_content(
+ reference,
+ tbit_chrom,
+ read.reference_start,
+ read.reference_end,
+ fraction=True,
+ )
+ gc = roundGCLenghtBias(gc)
+ except Exception as detail:
+ if verbose:
+ logger.exception(detail)
+ continue
+ if R_gc_dict:
+ try:
+ tag = round(float(1) / R_gc_dict[r_len][str(gc)],5)#2) ; should be set by round_digit
+ except KeyError as e:
+ tag = 1
+ else:
+ try:
+ tag=round(dict(read.tags)["YC"],5)
+ except KeyError as e:
+ tag=1
+ # add tag to interval. tag defaults to 1 if there is no YC weight attached to the read or gc_correct is falsy
+ filteredReads.add_interval(Interval(rstart, rend, value={"YC": tag}))
+ if read.is_paired:
+ for i in range(rstart, rend + 1):
+ if i >= regionStart and i <= regionEnd:
+ posRange[i][0] += tag
+ posRange[i][2] += 1
+ posRange[i][3] += (tag-posRange[i][3])/posRange[i][2]
+ if rstart >= regionStart and rstart <= regionEnd:
+ posRange[rstart][1] += tag
+ if rend >= regionStart and rend <= regionEnd:
+ posRange[rend][1] += tag
+ else:
+ for i in range(rstart, rend + 1):
+ if i >= regionStart and i <= regionEnd:
+ posRange[i][0] += tag
+ posRange[i][2] += 1
+ posRange[i][3] += (tag-posRange[i][3])/posRange[i][2]
+ if (merged or read.qname.startswith("M_")) or (
+ (trimmed or read.qname.startswith("T_"))
+ and read.qlen <= lengthSR - 10
+ ):
+ if rstart >= regionStart and rstart <= regionEnd:
+ posRange[rstart][1] += tag
+ if rend >= regionStart and rend <= regionEnd:
+ posRange[rend][1] += tag
+ elif read.is_reverse:
+ if rend >= regionStart and rend <= regionEnd:
+ posRange[rend][1] += tag
+ else:
+ if rstart >= regionStart and rstart <= regionEnd:
+ posRange[rstart][1] += tag
+
+ if verbose:
+ sys.stderr.write("Evaluating posRange vector...\n")
+ # filename = outfile%cid
+ # outfile = gzip.open(filename,'w')
+ cov_sites = 0
+ outLines = []
+ wps_list = []
+ cov_list = []
+ starts_list = []
+ mean_weight_list = []
+
+ for pos in range(regionStart, regionEnd + 1):
+ rstart, rend = pos - protection, pos + protection
+ gcount, bcount, ecount = 0.0, 0.0, 0.0
+ for read in filteredReads.find(rstart, rend):
+ if (read.start > rstart) or (read.end < rend):
+ bcount += read.value["YC"]
+ else:
+ gcount += read.value["YC"]
+
+ covCount, startCount, n_reads, mean_value = posRange[pos]
+ #logger.debug(posRange[pos])
+ #covCount, startCount, n_reads = posRange[pos]
+ #mean_value = covCount/n_reads
+ cov_sites += covCount
+ wpsValue = gcount - (bcount + ecount)
+
+
+ wps_list.append(wpsValue)
+ cov_list.append(covCount)
+ starts_list.append(startCount)
+ mean_weight_list.append(mean_value)
+
+ if strand == "-":
+ wps_list = wps_list[::-1]
+ cov_list = cov_list[::-1]
+ starts_list = starts_list[::-1]
+ mean_weight_list = mean_weight_list[::-1]
+
+
+ return ID,coordinate,wps_list,cov_list,starts_list,mean_weight_list
+
+
+@click.command()
+@click.option(
+ "--bamfile",
+ "-b",
+ "bam_file",
+ required=True,
+ type=click.Path(exists=True, readable=True),
+ help="Sorted BAM file.",
+)
+@click.option(
+ "--genome",
+ "-g",
+ "reference_file",
+ required=True,
+ type=click.Path(exists=True, readable=True),
+ help="""Genome reference in two bit format. Most genomes can be
+ found here: http://hgdownload.cse.ucsc.edu/gbdb/
+ Search for the .2bit ending. Otherwise, fasta
+ files can be converted to 2bit using the UCSC
+ programm called faToTwoBit available for different
+ plattforms at '
+ http://hgdownload.cse.ucsc.edu/admin/exe/""",
+)
+@click.option(
+ "--regions",
+ "-r",
+ "regions",
+ required=True,
+ type=click.Path(exists=True, readable=True),
+ help="Regions of interest in .bed fromat.",
+)
+@click.option(
+ "--outfile",
+ "-o",
+ "outfile",
+ show_default=True,
+ type=click.Path(writable=True),
+ help="""Outfile prefix to save the extracted signals.
+ The filename must containt %s or {}, which will be substituted with "WPS" and "COV".
+ Output file will be gzipped tab separated files.""",
+ default="signal_{}.tsv.gz",
+)
+@click.option(
+ "-gc",
+ "--gc_correct",
+ "GC_correct",
+ is_flag=True,
+ help="""Flag: indacte if gc correction should be used.
+ Expects bam file with GC weights attached to reads as YC tags.
+ Is automatically set to true if a GCbiasFrequencyfile is supplied.""",
+)
+@click.option(
+ "--GCbiasFrequenciesFile",
+ "-freq",
+ "GC_file",
+ type=click.Path(exists=True, readable=True),
+ help="""Indicate the output file from computeGCBias containing
+ the coverage bias values per GC-content and fragment length.
+ This file should be a (gzipped) tab separated file.""",
+)
+@click.option(
+ "--minInsSize",
+ "-min",
+ "minInsSize",
+ default=None,
+ type=click.INT,
+ help="Minimum read length threshold to consider for signal extraction.",
+)
+@click.option(
+ "--maxInsSize",
+ "-max",
+ "maxInsSize",
+ default=None,
+ type=click.INT,
+ help="Maximum read length threshold to consider for signal extraction.",
+)
+@click.option(
+ "--max_length",
+ "max_length",
+ default=1000,
+ type=click.INT,
+ show_default=True,
+ help="Assumed maximum insert size.",
+)
+@click.option(
+ "--protection",
+ "-w",
+ "protection",
+ default=120,
+ type=click.INT,
+ help="Base pair protection window size.",
+)
+@click.option(
+ "--lengthSR",
+ "-l",
+ "lengthSR",
+ default=76,
+ type=click.INT,
+ help="Length of full reads.",
+)
+@click.option(
+ "--merged",
+ "-m",
+ "merged",
+ is_flag=True,
+ default=False,
+ show_default=True,
+ help="""Assume reads are merged.""",
+)
+@click.option(
+ "--trimmed",
+ "-t",
+ "trimmed",
+ is_flag=True,
+ default=False,
+ show_default=True,
+ help="""Assume reads are trimmed.""",
+)
+@click.option(
+ "--num_cpus",
+ "-p",
+ "num_cpus",
+ type=click.INT,
+ default=1,
+ show_default=True,
+ help="Number of processors to use.",
+)
+@click.option(
+ "--seed",
+ "seed",
+ default=None,
+ type=click.INT,
+ help="""Set seed for reproducibility.""",
+)
+@click.option("-v", "--verbose", "verbose", is_flag=True, help="Enables verbose mode")
+@click.option("--debug", "debug", is_flag=True, help="Enables debug mode")
+@click.option(
+ "--filter_size",
+ "filter_size",
+ default=None,
+ type=click.INT,
+ help="Filter size.",
+)
+@click.option(
+ "--round",
+ "round_digits",
+ default=2,
+ type=click.INT,
+ help="Digits to round the correction values to.",
+)
+@click.option(
+ "-sc",
+ "--standard_chroms",
+ "standard_chroms",
+ is_flag=True,
+ help="Flag: filter chromosomes to human standard chromosomes.",
+)
+def main(
+ bam_file,
+ reference_file,
+ regions,
+ outfile,
+ GC_correct,
+ GC_file,
+ minInsSize,
+ maxInsSize,
+ max_length,
+ protection,
+ lengthSR,
+ merged,
+ trimmed,
+ num_cpus,
+ seed,
+ verbose,
+ debug,
+ filter_size,
+ round_digits,
+ standard_chroms
+):
+
+ ### initial setup
+ progress_bar = False
+
+ if debug:
+ debug_format = "