Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #37

Merged
merged 56 commits into from
Oct 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
0c3f03c
refactor: minor textual changes
sroener Aug 18, 2023
d749bea
feat: overhaul support for single end sequencing
sroener Aug 18, 2023
5c6a821
refactor: update environment
sroener Aug 18, 2023
e42fd26
feat: add signal extration around target regions; add GC correction o…
sroener Aug 18, 2023
91495fc
feat: add case-control plots
sroener Aug 18, 2023
15d6d3d
refactor: update QC wrappers
sroener Aug 18, 2023
5332589
fix: fix small format error
sroener Aug 18, 2023
1496bfe
feat: add region data validation
sroener Aug 18, 2023
2c8e6fb
refactor: update json schema version
sroener Aug 18, 2023
b27a458
fix: correct wrong validation schema
sroener Aug 18, 2023
e226c9d
refactor: snakefmt
sroener Aug 18, 2023
18d9cfb
feat: new report sections
sroener Sep 6, 2023
4752f36
feat: add new plots to report cartegories
sroener Sep 6, 2023
c874373
feat: add options for y-axis min range; refactor
sroener Sep 6, 2023
e060239
refactor: clarify descriptions
sroener Sep 26, 2023
340b125
refactor: change default figsize to (12,9)
sroener Sep 26, 2023
2c0efb0
perf: optimize data types during data loading
sroener Sep 26, 2023
819c3db
perf: optimize memory usage in SavGol filter application
sroener Sep 26, 2023
ee53f50
perf: optimize data types during data loading
sroener Sep 26, 2023
f6cc864
perf: optimize memory usage in SavGol filter application
sroener Sep 26, 2023
9388782
style: multiple style fixes
sroener Sep 26, 2023
ba2d5c5
fix: catch KeyError in chromosome mapping
sroener Sep 26, 2023
c1f3f85
feat: set min lower and upper limits for Y axis of plots
sroener Sep 26, 2023
d4aac98
refactor: optimize plotting for report
sroener Sep 26, 2023
60b9181
fix: remove name duplication in plot legend
sroener Sep 28, 2023
fa957ac
feat: add more extensive figure description
sroener Sep 28, 2023
5a4a772
feat: add more extensive figure description
sroener Sep 28, 2023
c84a463
feat: add more extensive figure description
sroener Sep 28, 2023
eb233cc
style: increase readability
sroener Sep 28, 2023
d6d1299
perf: update GC correction to optimized version
sroener Oct 16, 2023
8a1458d
refactor: pin version numbers
sroener Oct 16, 2023
4597927
refactor: set phred-quality-encoding default to phred 33
sroener Oct 16, 2023
79ee96b
test: add testregions for integration test
sroener Oct 16, 2023
f6aabe7
test: correct path to test regions
sroener Oct 16, 2023
634ed08
refactor: whitelist testregions
sroener Oct 16, 2023
464d88b
initial commit
sroener Oct 16, 2023
a80ab10
test: test case hg38
sroener Oct 16, 2023
5af18a4
refactor: consistent logging
sroener Oct 17, 2023
9faeaa9
doc: update abstract
sroener Oct 17, 2023
91a4904
doc: update overview figure
sroener Oct 17, 2023
84253e1
doc: generalize paths
sroener Oct 17, 2023
87e576e
doc: restructure documentation
sroener Oct 17, 2023
4f8bf19
feat: disable absolute paths in multiQC reportM
sroener Oct 18, 2023
484bded
refactor: improve labeling of QC report
sroener Oct 18, 2023
9f5b1cc
refactor: improve logging
sroener Oct 18, 2023
3c374c3
fix: multiqc wrapper now uses provided files only
sroener Oct 18, 2023
b77bbab
refactor: update example report
sroener Oct 18, 2023
f94c721
doc: add figures
sroener Oct 18, 2023
f55c207
doc: fix figure
sroener Oct 18, 2023
8a406df
refactor: spacing
sroener Oct 19, 2023
c67127f
doc: increase readability with background and font color
sroener Oct 19, 2023
2719e11
refactor: update GCbias plot
sroener Oct 23, 2023
d9ad271
doc: update GC bias plots
sroener Oct 23, 2023
c285833
feat: add option for normalized spline interpolation
sroener Oct 23, 2023
9fa4720
refactor: style format with snakefmt
sroener Oct 23, 2023
3e0e613
Merge branch 'main' into develop
sroener Oct 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
!resources/
!resources/blacklists/
!resources/blacklists/**
!resources/testregions/
!resources/testregions/**

!supplement/
!supplement/**
Expand All @@ -26,9 +28,11 @@
!config/
!config/example.config.yaml
!config/example.samples.tsv
!config/example.regions.tsv
!config/multiqc_config.yaml
!config/test-config.yaml
!config/test-samples.tsv
!config/test-regions.tsv

!resources/qual_profile.txt

Expand Down
277 changes: 192 additions & 85 deletions README.md

Large diffs are not rendered by default.

66 changes: 51 additions & 15 deletions config/example.config.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config/samples.tsv"
samples: "config/example.samples.tsv"

regions: "config/example.regions.tsv"

control_name: "healthy" # name of the control samples specified in the samples.tsv. Has to match the name in the status field.

### genome build specific options ###

Expand All @@ -25,17 +28,25 @@ TMPDIR: "./" # path to directory for writing TMP files

SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation

### Utility options ###

utility:
GCbias-plot: True
GCbias-correction: True
ichorCNA: True
case-control-plot: True

### trimming ###

trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic

#### NGmerge specific options ####

length-filter:
MINLEN: 30 # min lenght of reads in additional filter steps

#### trimmomatic specific options ####

phred-quality-encoding: phred-33 # three options: empty = automatic detection, phred-33 and phred-64

# Illuminaclip takes a fasta file with adapter sequences and removes them in the trimming step.
# The adapter_files option takes either the path to a custom file </PATH/TO/CUSTOM/ADAPTER.fa>
Expand Down Expand Up @@ -66,20 +77,17 @@ trimmers:

### Mapping ###

# This option lets you add unpaired/singleton reads in the mapping step that were filtered,
# but are either not paired or not merged. Otherwise these reads are not further processed.
# This option lets you add unmerged/singleton or single-end reads in the mapping step.
# Unmerged or singleton reads are paired end reads that were filtered by samtools fastq or NGmerge.
# Single-end reads are from single end libraries.These categories can be excluded for specialised analyses.
mapping:
unmerged: True # default is True
singleton: False # default is False

### Utility ###

utility:
GCbias-plot: True
GCbias-correction: True
ichorCNA: True

paired_end:
unmerged: True # default is True. Reads not merged by NGmerge.
singleton: False # default is False. Reads that are from paired end libraries without a matching pair.
single_end:
SEreads: True # default is True. This option is essential for Single End libraries. Setting to true in PE libraries has no effect on the output.

### Utility parameters ###

#### ichorCNA ####

Expand Down Expand Up @@ -118,3 +126,31 @@ ichorCNA:
scStates: '"c(1,3)"'
txnE: 0.9999
txnStrength: 10000

#### GCbias ####

##### GCbias estimation #####

GCbias_estimation:
normalized_interpolation: True # boolean, True or False. If True, the smooth parameter is normalized such that results are invariant to xdata range and less sensitive to nonuniformity of weights and xdata clumping.

#### Signal extraction ####

minRL: 120 # minimum read length for calculating WPS
maxRL: 180 # maximum read length for calculating WPS
bpProtection: 120 # bp protection for calculating WPS
lengthSR: 76 # length of single reads, if used for calculating WPS

#### Signal processing ####

overlay_mode: "mean" # Can be either "mean" or "median". Sets overlay mode, specifying how regions should be aggregated for each sample.
smoothing: True # Activates smoothing with Savitzky-Golay filter.
smooth_window: 21 # Sets windows size used for smoothing with Savitzky-Golay filter.
smooth_polyorder: 2 # Sets order of polynomial used for smoothing with Savitzky-Golay filter.
rolling: True # Activates trend removal with a rolling median filter.
rolling_window: 1000 # Sets window size used in rolling median filter.
flank_norm: True # Activates normalization by dividing the signals by the mean coverage in flanking intervals around the region of interest.
flank: 2000 # Sets the size of the flanking intervals around the region of interest. Should be <= 0.5 of the extracted signals
signal: "coverage" # can be either "coverage" or "WPS"
display_window: [-1500,1500]
aggregate_controls: True
3 changes: 3 additions & 0 deletions config/example.regions.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
target path
region1 PATH/TO/region1.bed
region2 PATH/TO/region2.bed
6 changes: 3 additions & 3 deletions config/example.samples.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
ID sample bam fq1 fq2 genome_build library_name platform info
experiment_ID samplename1 PATH/TO/BAM - - some_library_kit Sequencing_platform healthy
experiment_ID samplename2 - PATH/TO/FQ1 PATH/TO/FQ2 some_library_kit Sequencing_platform some_condition
ID sample bam fq1 fq2 genome_build library_name platform status info
experiment_ID samplename1 PATH/TO/BAM - - some_library_kit Sequencing_platform healthy SomeAdditionalInfoForReadGroup/ID
experiment_ID samplename2 - PATH/TO/FQ1 PATH/TO/FQ2 some_library_kit Sequencing_platform some_condition SomeAdditionalInfoForReadGroup/ID
2 changes: 2 additions & 0 deletions config/multiqc_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ max_table_rows: 600
use_filename_as_sample_name:
- fastqc/zip
- fastqc/data

show_analysis_paths: False
66 changes: 53 additions & 13 deletions config/test-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config/test-samples.tsv" #"config/Delfi_samples.tsv"
samples: "config/test-samples.tsv"

regions: "config/test-regions.tsv"

control_name: "healthy" # name of the control samples specified in the samples.tsv. Has to match the name in the status field.


### genome build specific options ###
Expand All @@ -25,9 +29,17 @@ TMPDIR: "$TMPDIR" # path to directory for writing TMP files

SEED: 42 # seed for increased reproducibility. Mainly used in GCbias estimation

### Utility options ###

utility:
GCbias-plot: True
GCbias-correction: True
ichorCNA: True
case-control-plot: True

### trimming ###

trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic
PE_trimming_algorithm: "NGmerge" #can be either NGmerge or trimmomatic

#### NGmerge specific options ####

Expand All @@ -36,6 +48,7 @@ length-filter:

#### trimmomatic specific options ####

phred-quality-encoding: phred-33 # three options: empty = automatic detection, phred-33 and phred-64

# Illuminaclip takes a fasta file with adapter sequences and removes them in the trimming step.
# The adapter_files option takes either the path to a custom file </PATH/TO/CUSTOM/ADAPTER.fa>
Expand Down Expand Up @@ -66,20 +79,18 @@ trimmers:

### Mapping ###

# This option lets you add unpaired/singleton reads in the mapping step that were filtered,
# but are either not paired or not merged. Otherwise these reads are not further processed.
# This option lets you add unmerged/singleton or single-end reads in the mapping step.
# Unmerged or singleton reads are paired end reads that were filtered by samtools fastq or NGmerge.
# Single-end reads are from single end libraries.These categories can be excluded for specialised analyses.
mapping:
unmerged: True # default is True
singleton: True # default is False

### Utility ###

utility:
GCbias-plot: True
GCbias-correction: True
ichorCNA: True
paired_end:
unmerged: True # default is True. Reads not merged by NGmerge.
singleton: False # default is False. Reads that are from paired end libraries without a matching pair.
single_end:
SEreads: True # default is True. This option is essential for Single End libraries. Setting to true in PE libraries has no effect on the output.


### Utility parameters ###

#### ichorCNA ####

Expand Down Expand Up @@ -118,3 +129,32 @@ ichorCNA:
scStates: '"c(1,3)"'
txnE: 0.9999
txnStrength: 10000

#### GCbias ####

##### GCbias estimation #####

GCbias_estimation:
normalized_interpolation: True # boolean, True or False. If True, the smooth parameter is normalized such that results are invariant to xdata range and less sensitive to nonuniformity of weights and xdata clumping.


#### Signal extraction ####

minRL: 120 # minimum read length for calculating WPS
maxRL: 180 # maximum read length for calculating WPS
bpProtection: 120 # bp protection for calculating WPS
lengthSR: 76 # length of single reads, if used for calculating WPS

#### Signal processing ####

overlay_mode: "mean" # Can be either "mean" or "median". Sets overlay mode, specifying how regions should be aggregated for each sample.
smoothing: True # Activates smoothing with Savitzky-Golay filter.
smooth_window: 21 # Sets windows size used for smoothing with Savitzky-Golay filter.
smooth_polyorder: 2 # Sets order of polynomial used for smoothing with Savitzky-Golay filter.
rolling: True # Activates trend removal with a rolling median filter.
rolling_window: 1000 # Sets window size used in rolling median filter.
flank_norm: True # Activates normalization by dividing the signals by the mean coverage in flanking intervals around the region of interest.
flank: 2000 # Sets the size of the flanking intervals around the region of interest. Should be <= 0.5 of the extracted signals
signal: "coverage" # can be either "coverage" or "WPS"
display_window: [-1500,1500]
aggregate_controls: True
3 changes: 3 additions & 0 deletions config/test-regions.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
target path
LYL1 resources/testregions/LYL1.hg38.bed
GRHL2 resources/testregions/GRHL2.hg38.bed
6 changes: 3 additions & 3 deletions config/test-samples.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
ID sample bam fq1 fq2 genome_build library_name platform info
test-run test19_chr20-22 resources/testsample/testsample_hg19_1x_chr20-22.bam - - hg19 ThruPLEX_DNA-seq Illumina_NextSeq_500
test-run test38_chr20-22 resources/testsample/testsample_hg19_1x_chr20-22.bam - - hg38 ThruPLEX_DNA-seq Illumina_NextSeq_500
ID sample bam fq1 fq2 genome_build library_name platform status info
test-run test38_chr20-22 resources/testsample/testsample_hg19_1x_chr20-22.bam - - hg38 ThruPLEX_DNA-seq Illumina_NextSeq_500 healthy
test-run test38_chr20-22-case resources/testsample/testsample_hg19_1x_chr20-22.bam - - hg38 ThruPLEX_DNA-seq Illumina_NextSeq_500 healthy-case
Loading