Skip to content

Commit

Permalink
Merge pull request #164 from CKComputomics/umitools
Browse files Browse the repository at this point in the history
Merging this to the umihandling branch to be able to fix remaining bits there myself more easily. Please wait for the next PR to be opened, will post a link here.
  • Loading branch information
apeltzer authored Jan 11, 2024
2 parents f2541d2 + fcc3ef0 commit 069beb1
Show file tree
Hide file tree
Showing 18 changed files with 795 additions and 47 deletions.
42 changes: 13 additions & 29 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [dev](https://github.com/nf-core/smrnaseq/branch/dev)

- _nothing yet done_
### Parameters

| Old parameter | New parameter |
| ------------- | --------------------------- |
| | `--with_umi` |
| | `--umitools_extract_method` |
| | `--umitools_bc_pattern` |
| | `--umi_discard_read` |
| | `--save_umi_intermeds` |
| | `--umi_merge_unmapped` |


## [v2.2.4](https://github.com/nf-core/smrnaseq/releases/tag/2.2.4) - 2023-11-03

Expand Down Expand Up @@ -64,22 +74,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [[#188](https://github.com/nf-core/smrnaseq/pull/188)] - Dropped TrimGalore in favor of fastp QC and adapter trimming, improved handling of adapters and trimming parameters
- [[#194](https://github.com/nf-core/smrnaseq/issues/194)] - Added default adapters file for FastP improved miRNA adapter trimming

### Parameters

| Old parameter | New parameter |
| ------------- | ------------------------ |
| | `--mirgenedb` |
| | `--mirgenedb_species` |
| | `--mirgenedb_gff` |
| | `--mirgenedb_mature` |
| | `--mirgenedb_hairpin` |
| | `--contamination_filter` |
| | `--rrna` |
| | `--trna` |
| | `--cdna` |
| | `--ncrna` |
| | `--pirna` |
| | `--other_contamination` |

## [v2.0.0](https://github.com/nf-core/smrnaseq/releases/tag/2.0.0) - 2022-05-31 Aqua Zinc Chihuahua

Expand All @@ -98,20 +92,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Other enhancements & fixes

- [#134](https://github.com/nf-core/smrnaseq/issues/134) - Fixed colSum of zero issues for edgeR_miRBase.R script
- [#49](https://github.com/nf-core/smrnaseq/issues/49) - Integrated the existing umitools modules into the pipeline and extend the deduplication step.
- [#55](https://github.com/lpantano/seqcluster/pull/55) - update seqcluster to fix UMI-detecting bug

### Parameters

| Old parameter | New parameter |
| -------------------- | ---------------- |
| `--conda` | `--enable_conda` |
| `--clusterOptions` | |
| `--publish_dir_mode` | |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if parameter information isn't present.
### Software dependencies

Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Expand All @@ -133,6 +116,7 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
| `seqkit` | 0.16.0 | 2.0.0 |
| `trim-galore` | 0.6.6 | 0.6.7 |
| `bioconvert` | - | 0.4.3 |
| `umi_tools` | - | 1.1.2 |
| `htseq` | - | - |
| `markdown` | - | - |
| `pymdown-extensions` | - | - |
Expand Down
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,30 @@ You can find numerous talks on the nf-core events page from various topics inclu
## Pipeline summary

1. Raw read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
2. UMI barcode extraction ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
3. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
1. Insert Size calculation
2. Collapse reads ([`seqcluster`](https://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads))
3. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
4. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
5. Alignment against miRBase hairpin
4. UMI barcode deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
5. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
6. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
7. Alignment against miRBase hairpin
1. Unaligned reads from step 3 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
2. Collapsed reads from step 2.2 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
6. Post-alignment processing of miRBase hairpin
8. Post-alignment processing of miRBase hairpin
1. Basic statistics from step 3 and step 4.1 ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
2. Analysis on miRBase, or MirGeneDB hairpin counts ([`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html))
- TMM normalization and a table of top expression hairpin
- MDS plot clustering samples
- Heatmap of sample similarities
3. miRNA and isomiR annotation from step 4.1 ([`mirtop`](https://github.com/miRTop/mirtop))
7. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
9. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
1. Post-alignment processing of alignment against host reference genome ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
8. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
1. Mapping against reference genome with the mapper module
2. Known and novel miRNA discovery with the mirdeep2 module
9. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
10. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
10. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
11. Mapping against reference genome with the mapper module
12. Known and novel miRNA discovery with the mirdeep2 module
13. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
14. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))

## Usage

Expand Down
89 changes: 87 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ process {
//
// Read QC and trimming options
//

process {
withName: 'MIRTRACE_RUN' {
publishDir = [
Expand All @@ -89,15 +90,15 @@ process {

if (!(params.skip_fastqc)) {
process {
withName: '.*:FASTQC_FASTP:FASTQC_.*' {
withName: '.*:FASTQC_UMITOOLS_FASTP:FASTQC_.*' {
ext.args = '--quiet'
}
}
}

if (!params.skip_fastp) {
process {
withName: 'FASTP' {
withName: '.*:FASTQC_UMITOOLS_FASTP:FASTP' {
ext.args = [ "",
params.trim_fastq ? "" : "--disable_adapter_trimming",
params.clip_r1 > 0 ? "--trim_front1 ${params.clip_r1}" : "", // Remove bp from the 5' end of read 1.
Expand Down Expand Up @@ -142,6 +143,90 @@ if (!params.skip_fastp) {
}
}

if (params.with_umi && !params.skip_umi_extract) {
process {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:UMITOOLS_EXTRACT' {
ext.args = [
params.umitools_extract_method ? "--extract-method=${params.umitools_extract_method}" : '',
params.umitools_bc_pattern ? "--bc-pattern='${params.umitools_bc_pattern}'" : '',
].join(' ').trim()
publishDir = [
[
path: { "${params.outdir}/umitools" },
mode: params.publish_dir_mode,
pattern: "*.log"
],
[
path: { "${params.outdir}/umitools" },
mode: params.publish_dir_mode,
pattern: "*.fastq.gz",
enabled: params.save_umi_intermeds
]
]
}
}
}

//
// UMI tools deduplication
//

if (params.with_umi) {
process {
withName: '.*:DEDUPLICATE_UMIS:UMITOOLS_DEDUP' {
ext.args = { meta.single_end ? '' : '--unpaired-reads=discard --chimeric-pairs=discard' }
ext.prefix = { "${meta.id}.umi_dedup.sorted" }
publishDir = [
[
path: { "${params.outdir}/umi_dedup/umitools" },
mode: params.publish_dir_mode,
pattern: '*.tsv'
],
[
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.bam',
enabled: (
params.save_umi_intermeds
)
]
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT' {
ext.prefix = { "${meta.id}.sorted" }
publishDir = [
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.{bam}',
enabled: (
params.save_umi_intermeds
)
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:SAMTOOLS_INDEX' {
ext.prefix = { "${meta.id}.sorted" }
publishDir = [
path: { "${params.outdir}/umi_dedup" },
mode: params.publish_dir_mode,
pattern: '*.{bai,csi}',
enabled: (
params.save_umi_intermeds
)
]
}

withName: '.*:DEDUPLICATE_UMIS:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:.*' {
publishDir = [
path: { "${params.outdir}/umi_dedup/samtools_stats" },
mode: params.publish_dir_mode,
pattern: '*.{stats,flagstat,idxstats}'
]
}
}
}

//
// Quantification
//
Expand Down
30 changes: 30 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ The directories listed below will be created in the results directory after the
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - read quality control
- [UMI-tools extract](#umi-tools-extract) - UMI barcode extraction
- [UMI-tools deduplicate](#umi-tools-deduplicate) - read deduplication
- [FastP](#fastp) - adapter trimming
- [Bowtie2](#bowtie2) - contamination filtering
- [Bowtie](#bowtie) - alignment against mature miRNAs and miRNA precursors (hairpins)
Expand Down Expand Up @@ -40,6 +42,21 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)

## UMI-tools extract

<details markdown="1">
<summary>Output files</summary>

- `umitools/`
- `*.fastq.gz`: If `--save_umi_intermeds` is specified, FastQ files **after** UMI extraction will be placed in this directory.
- `*.log`: Log file generated by the UMI-tools `extract` command.

</details>

[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name. Secondly, reads are deduplicated based on UMI identifier after mapping as highlighted in the [UMI-tools deduplicate](#umi-tools-deduplicate) section.

To facilitate processing of input data which has the UMI barcode already embedded in the read name from the start, `--skip_umi_extract` can be specified in conjunction with `--with_umi`.

## FastP

[FastP](https://github.com/OpenGene/fastp) is used for removal of adapter contamination and trimming of low quality regions.
Expand All @@ -55,6 +72,19 @@ Contains FastQ files with quality and adapter trimmed reads for each sample, alo

FastP can automatically detect adapter sequences when not specified directly by the user - the pipeline also comes with a feature and a supplied miRNA adapters file to ensure adapters auto-detected are more accurate. If there are needs to add more known miRNA adapters to this list, please open a pull request.

## UMI-tools deduplicate

<details markdown="1">
<summary>Output files</summary>

- `umi_dedup/`
- `*.tsv`: Results statistics files detailing the UMI deduplication results.
- `*.bam`: If `--save_umi_intermeds` is specified, the deduplicated bam files **after** UMI deduplication will be placed in this directory. In addition the sorted and indexed files will be placed there as well.
- `samtools_stats/` - `*.{stats,flagstat,idxstats}:` Statistics on the mappings underlying the UMI deduplication.
</details>

[UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name as highlighted in the [UMI-tools extract](#umi-tools-extract) section. The reads are deduplicated based on an alignment against the full genome of the species. The deduplicated reads are then converted into fastq format and merged with the reads that remained unmapped in order to reduce potential reference bias. This behavior can be stopped by setting `--umi_merge_unmapped false`. The resulting fastq files are used in the remaining steps of the pipeline.

## Bowtie2

[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.
Expand Down
1 change: 0 additions & 1 deletion modules/local/mirdeep2_run.nf
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,3 @@ process MIRDEEP2_RUN {
END_VERSIONS
"""
}

62 changes: 62 additions & 0 deletions modules/nf-core/modules/cat/cat/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 37 additions & 0 deletions modules/nf-core/modules/cat/cat/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 069beb1

Please sign in to comment.