Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add UMI Handling to the pipeline #164

Merged
merged 30 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d783ef9
REVERT CHANGES
May 10, 2022
1043932
INCLUDE UMITOOLS WORKFLOW
May 10, 2022
27fd482
ADD DOCUMENTATION
May 13, 2022
ee673b0
ADD SAMTOOLS BAM2FQ MODULE
Jun 13, 2022
0bc65e4
ADD UMITOOLS EXTRACT ARGS
Jun 14, 2022
8d14f90
UPDATE MODULES.CONFIG
Jun 15, 2022
23f96d8
INCLUDE UMITOOLS DEDUP WORKFLOW
Jun 15, 2022
944d277
INCLUDE UMITOOLS DEDUP
Jun 15, 2022
ddb3dba
ADD SAMTOOLS SORT CONFIG
CKComputomics Jun 15, 2022
b2ef66a
FIX TYPO
CKComputomics Jun 15, 2022
29ec7da
ADD DEDUP DOCUMENTATION
CKComputomics Jun 15, 2022
afa1ad7
ADD DEDUP STEP
CKComputomics Jun 15, 2022
c72ac5b
ADD UMITOOLS VERSION
CKComputomics Jun 15, 2022
f442289
MERGE DEDUPLICATED AND UNMAPPED READS AFTER DEDUPLICATION
CKComputomics Jun 20, 2022
f9ca542
ADD MISSING OPTION
CKComputomics Jun 20, 2022
b974717
ADD NEWLINE
CKComputomics Jun 20, 2022
4610be1
CLEAN CODE
CKComputomics Jun 21, 2022
67b2cac
ADD DOCUMENTATION
CKComputomics Jun 21, 2022
23fc985
ADD UMI_MERGE_UNMAPPED COMMAND
CKComputomics Jun 21, 2022
be241ea
FINALIZE DOCUMENTATION
CKComputomics Jun 21, 2022
8b433f1
UPDATE MAIL TEMPLATE
CKComputomics Jun 21, 2022
0e732ed
CHANGE DAG OUTPUT TO HTML
CKComputomics Jun 21, 2022
8f426b5
PLEASE PRETTIER
CKComputomics Jun 21, 2022
8e132fb
Merge branch 'dev' into umitools
CKComputomics Jun 21, 2022
039843f
FIX MERGE ERROR
CKComputomics Jun 21, 2022
53c097c
MAKE PRETTIER HAPPY
CKComputomics Jun 21, 2022
608c414
ADD NF-CORE CAT
CKComputomics Jun 22, 2022
6d305c2
REPLACE CUSTOM CAT WITH NF-CORE CAT
CKComputomics Jun 22, 2022
57a8dba
REMOVE UNUSED MODULE
CKComputomics Jun 22, 2022
fcc3ef0
Merge branch 'umi-handling' into umitools
apeltzer Jan 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 22 additions & 26 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Parameters

| Old parameter | New parameter |
| ------------- | ------------------------ |
| | `--mirGeneDB` |
| | `--mirGeneDB_species` |
| | `--mirGeneDB_gff` |
| | `--mirGeneDB_mature` |
| | `--mirGeneDB_hairpin` |
| | `--contamination_filter` |
| | `--rrna` |
| | `--trna` |
| | `--cdna` |
| | `--ncrna` |
| | `--pirna` |
| | `--other_contamination` |
| Old parameter | New parameter |
| ------------- | --------------------------- |
| | `--mirGeneDB` |
| | `--mirGeneDB_species` |
| | `--mirGeneDB_gff` |
| | `--mirGeneDB_mature` |
| | `--mirGeneDB_hairpin` |
| | `--contamination_filter` |
| | `--rrna` |
| | `--trna` |
| | `--cdna` |
| | `--ncrna` |
| | `--pirna` |
| | `--other_contamination` |
| | `--with_umi` |
| | `--umitools_extract_method` |
| | `--umitools_bc_pattern` |
| | `--umi_discard_read` |
| | `--save_umi_intermeds` |
| | `--umi_merge_unmapped` |

## [v2.0.0](https://github.com/nf-core/smrnaseq/releases/tag/2.0.0) - 2022-05-31 Aqua Zinc Chihuahua

Expand All @@ -48,20 +54,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Other enhancements & fixes

- [#134](https://github.com/nf-core/smrnaseq/issues/134) - Fixed colSum of zero issues for edgeR_miRBase.R script
- [#49](https://github.com/nf-core/smrnaseq/issues/49) - Integrated the existing umitools modules into the pipeline and extend the deduplication step.
- [#55](https://github.com/lpantano/seqcluster/pull/55) - update seqcluster to fix UMI-detecting bug

### Parameters

| Old parameter | New parameter |
| -------------------- | ---------------- |
| `--conda` | `--enable_conda` |
| `--clusterOptions` | |
| `--publish_dir_mode` | |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if parameter information isn't present.

### Software dependencies

Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Expand All @@ -83,6 +78,7 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
| `seqkit` | 0.16.0 | 2.0.0 |
| `trim-galore` | 0.6.6 | 0.6.7 |
| `bioconvert` | - | 0.4.3 |
| `umi_tools` | - | 1.1.2 |
| `htseq` | - | - |
| `markdown` | - | - |
| `pymdown-extensions` | - | - |
Expand Down
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,30 @@ On release, automated continuous integration tests run the pipeline on a full-si
## Pipeline summary

1. Raw read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
2. UMI barcode extraction ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
3. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
1. Insert Size calculation
2. Collapse reads ([`seqcluster`](https://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads))
3. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
4. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
5. Alignment against miRBase hairpin
4. UMI barcode deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
5. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
6. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
7. Alignment against miRBase hairpin
1. Unaligned reads from step 3 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
2. Collapsed reads from step 2.2 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
6. Post-alignment processing of miRBase hairpin
8. Post-alignment processing of miRBase hairpin
1. Basic statistics from step 3 and step 4.1 ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
2. Analysis on miRBase, or MirGeneDB hairpin counts ([`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html))
- TMM normalization and a table of top expression hairpin
- MDS plot clustering samples
- Heatmap of sample similarities
3. miRNA and isomiR annotation from step 4.1 ([`mirtop`](https://github.com/miRTop/mirtop))
7. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
9. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
1. Post-alignment processing of alignment against host reference genome ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
8. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
1. Mapping against reference genome with the mapper module
2. Known and novel miRNA discovery with the mirdeep2 module
9. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
10. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
10. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
11. Mapping against reference genome with the mapper module
12. Known and novel miRNA discovery with the mirdeep2 module
13. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
14. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))

## Quick Start

Expand Down
89 changes: 87 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ process {
//
// Read QC and trimming options
//

process {
withName: 'MIRTRACE_RUN' {
publishDir = [
Expand All @@ -89,15 +90,15 @@ process {

if (!(params.skip_fastqc || params.skip_qc)) {
process {
withName: '.*:FASTQC_TRIMGALORE:FASTQC' {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
ext.args = '--quiet'
}
}
}

if (!params.skip_trimming) {
process {
withName: '.*:FASTQC_TRIMGALORE:TRIMGALORE' {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE' {
ext.args = '--fastqc'
publishDir = [
[
Expand All @@ -115,6 +116,90 @@ if (!params.skip_trimming) {
}
}

if (params.with_umi && !params.skip_umi_extract) {
process {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:UMITOOLS_EXTRACT' {
ext.args = [
params.umitools_extract_method ? "--extract-method=${params.umitools_extract_method}" : '',
params.umitools_bc_pattern ? "--bc-pattern='${params.umitools_bc_pattern}'" : '',
].join(' ').trim()
publishDir = [
[
path: { "${params.outdir}/umitools" },
mode: params.publish_dir_mode,
pattern: "*.log"
],
[
path: { "${params.outdir}/umitools" },
mode: params.publish_dir_mode,
pattern: "*.fastq.gz",
enabled: params.save_umi_intermeds
]
]
}
}
}

//
// UMI tools deduplication
//

if (params.with_umi) {
process {
withName: '.*:DEDUPLICATE_UMIS:UMITOOLS_DEDUP' {
ext.args = { meta.single_end ? '' : '--unpaired-reads=discard --chimeric-pairs=discard' }
ext.prefix = { "${meta.id}.umi_dedup.sorted" }
publishDir = [
[
path: { "${params.outdir}/umi_dedup/umitools" },
mode: params.publish_dir_mode,
pattern: '*.tsv'
],
[
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.bam',
enabled: (
params.save_umi_intermeds
)
]
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT' {
ext.prefix = { "${meta.id}.sorted" }
publishDir = [
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.{bam}',
enabled: (
params.save_umi_intermeds
)
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_INDEX' {
ext.prefix = { "${meta.id}.sorted" }
publishDir = [
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.{bai,csi}',
enabled: (
params.save_umi_intermeds
)
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:.*' {
publishDir = [
path: { "${params.outdir}/umi_dedup/samtools_stats" },
mode: params.publish_dir_mode,
pattern: '*.{stats,flagstat,idxstats}'
]
}
}
}

//
// Quantification
//
Expand Down
30 changes: 30 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ The directories listed below will be created in the results directory after the
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - read quality control
- [UMI-tools extract](#umi-tools-extract) - UMI barcode extraction
- [TrimGalore](#trimgalore) - adapter trimming
- [UMI-tools deduplicate](#umi-tools-deduplicate) - read deduplication
- [Bowtie2](#bowtie2) - contamination filtering
- [Bowtie](#bowtie) - alignment against mature miRNAs and miRNA precursors (hairpins)
- [SAMtools](#samtools) - alignment result processing and feature counting
Expand All @@ -40,6 +42,21 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)

## UMI-tools extract

<details markdown="1">
<summary>Output files</summary>

- `umitools/`
- `*.fastq.gz`: If `--save_umi_intermeds` is specified, FastQ files **after** UMI extraction will be placed in this directory.
- `*.log`: Log file generated by the UMI-tools `extract` command.

</details>

[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name. Secondly, reads are deduplicated based on UMI identifier after mapping as highlighted in the [UMI-tools deduplicate](#umi-tools-deduplicate) section.

To facilitate processing of input data which has the UMI barcode already embedded in the read name from the start, `--skip_umi_extract` can be specified in conjunction with `--with_umi`.

## TrimGalore

[TrimGalore](http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) is used for removal of adapter contamination and trimming of low quality regions. TrimGalore uses [Cutadapt](https://github.com/marcelm/cutadapt) for adapter trimming and runs FastQC after it finishes.
Expand All @@ -59,6 +76,19 @@ This is an example of the output we can get:

![cutadapt](images/cutadapt_plot.png)

## UMI-tools deduplicate

<details markdown="1">
<summary>Output files</summary>

- `umi_dedup/`
- `*.tsv`: Results statistics files detailing the UMI deduplication results.
- `*.bam`: If `--save_umi_intermeds` is specified, the deduplicated bam files **after** UMI deduplication will be placed in this directory. In addition the sorted and indexed files will be placed there as well.
- `samtools_stats/` - `*.{stats,flagstat,idxstats}:` Statistics on the mappings underlying the UMI deduplication.
</details>

[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name as highlighted in the [UMI-tools extract](#umi-tools-extract) section. The reads are deduplicated based on an alignment against the full genome of the species. The deduplicated reads are then converted into fastq format and merged with the reads that remained unmapped in order to reduce potential reference bias. This behavior can be stopped by setting `--umi_merge_unmapped false`. The resulting fastq files are used in the remaining steps of the pipeline.

## Bowtie2

[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.
Expand Down
9 changes: 9 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
"multiqc": {
"git_sha": "49b18b1639f4f7104187058866a8fab33332bdfe"
},
"samtools/bam2fq": {
"git_sha": "5510ea39fe638594bc26ac34cadf4a84bf27d159"
},
"samtools/flagstat": {
"git_sha": "1ad73f1b2abdea9398680d6d20014838135c9a35"
},
Expand All @@ -32,6 +35,12 @@
},
"trimgalore": {
"git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d"
},
"umitools/dedup": {
"git_sha": "f425aa3cea10015fe9b345b9d6dcc2336b53155f"
},
"umitools/extract": {
"git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d"
}
}
}
Expand Down
21 changes: 21 additions & 0 deletions modules/local/join_reads.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
process JOIN_FASTQS {
tag "$meta.id"
label 'process_medium'

conda (params.enable_conda ? 'bioconda::samtools=1.13' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-ffbf83a6b0ab6ec567a336cf349b80637135bca3:40128b496751b037e2bd85f6789e83d4ff8a4837-0' :
'quay.io/biocontainers/mulled-v2-ffbf83a6b0ab6ec567a336cf349b80637135bca3:40128b496751b037e2bd85f6789e83d4ff8a4837-0' }"

input:
tuple val(meta), path(reads)
tuple val(unmapped_meta), path(unmapped_reads)

output:
tuple val(meta), path('*_merged.fq.gz'), emit: merged
script:
"""
cat ${reads} ${unmapped_reads} > ${meta.id}_merged.fq.gz
apeltzer marked this conversation as resolved.
Show resolved Hide resolved
"""

}
1 change: 0 additions & 1 deletion modules/local/mirdeep2_run.nf
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,3 @@ process MIRDEEP2_RUN {
END_VERSIONS
"""
}

56 changes: 56 additions & 0 deletions modules/nf-core/modules/samtools/bam2fq/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading