Skip to content

Commit

Permalink
Merge pull request #156 from CKComputomics/contaminationFiltering
Browse files Browse the repository at this point in the history
Contamination filtering
  • Loading branch information
apeltzer authored Jun 17, 2022
2 parents fb058cc + 6395315 commit 8d66121
Show file tree
Hide file tree
Showing 12 changed files with 416 additions and 24 deletions.
33 changes: 22 additions & 11 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Other enhancements

- [#55](https://github.com/nf-core/smrnaseq/issues/12) - Enabled the use of `MirGeneDB` as an alternative database insted of `miRBase`
- [#113](https://github.com/nf-core/smrnaseq/issues/113) - Added a optional contamination filtering step, including MultiQC plot.

### Parameters

| Old parameter | New parameter |
| ------------- | --------------------- |
| | `--mirGeneDB` |
| | `--mirGeneDB_species` |
| | `--mirGeneDB_gff` |
| | `--mirGeneDB_mature` |
| | `--mirGeneDB_hairpin` |
| Old parameter | New parameter |
| ------------- | ------------------------ |
| | `--mirGeneDB` |
| | `--mirGeneDB_species` |
| | `--mirGeneDB_gff` |
| | `--mirGeneDB_mature` |
| | `--mirGeneDB_hairpin` |
| | `--contamination_filter` |
| | `--rrna` |
| | `--trna` |
| | `--cdna` |
| | `--ncrna` |
| | `--pirna` |
| | `--other_contamination` |

## [v2.0.0](https://github.com/nf-core/smrnaseq/releases/tag/2.0.0) - 2022-05-31 Aqua Zinc Chihuahua

Expand All @@ -44,10 +52,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Parameters

| Old parameter | New parameter |
| ------------------ | ---------------- |
| `--conda` | `--enable_conda` |
| `--clusterOptions` | |
| Old parameter | New parameter |
| -------------------- | ---------------- |
| `--conda` | `--enable_conda` |
| `--clusterOptions` | |
| `--publish_dir_mode` | |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
Expand Down Expand Up @@ -79,6 +88,8 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
| `pymdown-extensions` | - | - |
| `pygments` | - | - |
| `r-r.methodss3` | - | - |
| `bowtie2` | - | 2.4.5 |
| `blat` | - | 36 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
> **NB:** Dependency has been **added** if just the new version information is present.
Expand Down
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,24 +31,25 @@ On release, automated continuous integration tests run the pipeline on a full-si
2. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
1. Insert Size calculation
2. Collapse reads ([`seqcluster`](https://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads))
3. Alignment against miRBase or MirGeneDB mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
4. Alignment against miRBase or MirGeneDB hairpin
3. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
4. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
5. Alignment against miRBase hairpin
1. Unaligned reads from step 3 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
2. Collapsed reads from step 2.2 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
5. Post-alignment processing of miRBase, or MirGeneDB hairpin
6. Post-alignment processing of miRBase hairpin
1. Basic statistics from step 3 and step 4.1 ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
2. Analysis on miRBase, or MirGeneDB hairpin counts ([`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html))
- TMM normalization and a table of top expression hairpin
- MDS plot clustering samples
- Heatmap of sample similarities
3. miRNA and isomiR annotation from step 4.1 ([`mirtop`](https://github.com/miRTop/mirtop))
6. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
7. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
1. Post-alignment processing of alignment against host reference genome ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
7. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
8. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
1. Mapping against reference genome with the mapper module
2. Known and novel miRNA discovery with the mirdeep2 module
8. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
9. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
9. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
10. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))

## Quick Start

Expand Down
7 changes: 7 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

- [FastQC](#fastqc) - read quality control
- [TrimGalore](#trimgalore) - adapter trimming
- [Bowtie2](#bowtie2) - contamination filtering
- [Bowtie](#bowtie) - alignment against mature miRNAs and miRNA precursors (hairpins)
- [SAMtools](#samtools) - alignment result processing and feature counting
- [edgeR](#edger) - normalization, MDS plot and sample pairwise distance heatmap
Expand Down Expand Up @@ -58,6 +59,12 @@ This is an example of the output we can get:

![cutadapt](images/cutadapt_plot.png)

## Bowtie2

[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.

MultiQC reports the number of reads that were removed by each of the contaminant databases.

## Bowtie

[Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) is used for mapping adapter trimmed reads against the mature miRNAs and miRNA precursors (hairpins) of the chosen database [miRBase](http://www.mirbase.org/) or [MirGeneDB](https://mirgenedb.org/).
Expand Down
13 changes: 13 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,19 @@ If `MirGeneDB` should be used instead it needs to be specified using `--mirGeneD
- `fasta`: the reference genome FASTA file
- `bt_indices`: points to the folder containing the `bowtie2` indices for the genome reference specified by `fasta`. **Note:** if the FASTA file in `fasta` is not the same file used to generate the `bowtie2` indices, then the pipeline will fail.

### Contamination filtering

This step has, until now, only been tested for human data. Unexpected behaviour can occur when using it with a different species.

Contamination filtering of the sequencing reads is optional and can be invoked using `filter_contamination`. FASTA files with contamination sequences to use need to be supplied using the following commands. Otherwise the contamination filtering of the specific type will be omitted.

- `rrna`: Used to supply a FASTA file containing rRNA contamination sequence.
- `trna`: Used to supply a FASTA file containing tRNA contamination sequence. e.g. `http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-tRNAs.fa`
- `cdna`: Used to supply a FASTA file containing cDNA contamination sequence. e.g. `ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz` The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
- `ncrna`: Used to supply a FASTA file containing ncRNA contamination sequence. e.g. `ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz` The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
- `pirna`: Used to supply a FASTA file containing piRNA contamination sequence. e.g. The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
- `other_contamination`: Used to supply an additional filtering set. The FASTA file is first compared to the available miRNA sequences and overlaps are removed.

## Samplesheet input

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
Expand Down
59 changes: 59 additions & 0 deletions modules/local/blat_mirna.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
process BLAT_MIRNA {
tag "$fasta"
label 'process_medium'

conda (params.enable_conda ? 'bioconda::blat=36' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/blat:36--0' :
'quay.io/biocontainers/blat:36--0' }"

input:
val db_type
path mirna
path contaminants


output:
path 'filtered.fa' , emit: filtered_set
path "versions.yml" , emit: versions

script:

if ( db_type == "cdna" )
"""
echo $db_type
awk '/^>/ { x=index(\$6, "transcript_biotype:miRNA") } { if(!x) print }' $contaminants > subset.fa
blat -out=blast8 $mirna subset.fa /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' subset.fa > filtered.fa
cat <<-END_VERSIONS > versions.yml
"${task.process}":
blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
END_VERSIONS
"""

else if ( db_type == "ncrna" )
"""
echo $db_type
awk '/^>/ { x=(index(\$6, "transcript_biotype:rRNA") || index(\$6, "transcript_biotype:miRNA")) } { if(!x) print }' $contaminants > subset.fa
blat -out=blast8 $mirna subset.fa /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' subset.fa > filtered.fa
cat <<-END_VERSIONS > versions.yml
"${task.process}":
blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
END_VERSIONS
"""

else
"""
echo $db_type
blat -out=blast8 $mirna $contaminants /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' $contaminants > filtered.fa
cat <<-END_VERSIONS > versions.yml
"${task.process}":
blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
END_VERSIONS
"""

}
26 changes: 26 additions & 0 deletions modules/local/bowtie_contaminants.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
process INDEX_CONTAMINANTS {
label 'process_medium'

conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"

input:
path fasta

output:
path 'fasta_bidx*' , emit: bt_indices
path "versions.yml" , emit: versions

script:
"""
bowtie2-build ${fasta} fasta_bidx --threads ${task.cpus}
cat <<-END_VERSIONS > versions.yml
"${task.process}":
bowtie2: \$(echo \$(bowtie2 --version 2>&1) | sed 's/^.*bowtie2-align-s version //; s/ .*\$//')
END_VERSIONS
"""

}
41 changes: 41 additions & 0 deletions modules/local/bowtie_map_contaminants.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
process BOWTIE_MAP_CONTAMINANTS {
label 'process_medium'

conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"

input:
tuple val(meta), path(reads)
path index
val contaminant_type

output:
tuple val(meta), path("*sam") , emit: bam
tuple val(meta), path('*.filter.unmapped.contaminant.fastq'), emit: unmapped
path "versions.yml" , emit: versions
path "filtered.*.stats" , emit: stats

script:
def index_base = index.toString().tokenize(' ')[0].tokenize('.')[0]
"""
bowtie2 \\
--threads ${task.cpus} \\
--very-sensitive-local \\
-k 1 \\
-x $index_base \\
--un ${meta.id}.${contaminant_type}.filter.unmapped.contaminant.fastq \\
${reads} \\
-S ${meta.id}.filter.contaminant.sam > ${meta.id}.contaminant_bowtie.log 2>&1
# extracting number of reads from bowtie logs
awk -v type=${contaminant_type} 'BEGIN{tot=0} {if(NR==4 || NR == 5){tot += \$1}} END {print "\\""type"\\": "tot }' ${meta.id}.contaminant_bowtie.log | tr -d , > filtered.${meta.id}_${contaminant_type}.stats
cat <<-END_VERSIONS > versions.yml
"${task.process}":
bowtie2: \$(echo \$(bowtie2 --version 2>&1) | sed 's/^.*bowtie2-align-s version //; s/ .*\$//' | tr -d '\0')
END_VERSIONS
"""

}
25 changes: 25 additions & 0 deletions modules/local/filter_stats.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
process FILTER_STATS {
label 'process_medium'

conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"

input:
tuple val(meta), path(reads)
path stats_files

output:
path "*_mqc.yaml" , emit: stats
tuple val(meta), path('*.filtered.fastq.gz') , emit: reads

script:
"""
readnumber=\$(wc -l ${reads} | awk '{ print \$1/4 }')
cat ./filtered.${meta.id}_*.stats | \\
tr '\n' ', ' | \\
awk -v sample=${meta.id} -v readnumber=\$readnumber '{ print "id: \\"my_pca_section\\"\\nsection_name: \\"Contamination Filtering\\"\\ndescription: \\"This plot shows the amount of reads filtered by contaminant type.\\"\\nplot_type: \\"bargraph\\"\\npconfig:\\n id: \\"contamination_filter_plot\\"\\n title: \\"Contamination Plot\\"\\n ylab: \\"Number of reads\\"\\ndata:\\n "sample": {"\$0"\\"remaining reads\\": "readnumber"}" }' > ${meta.id}.contamination_mqc.yaml
gzip -c ${reads} > ${meta.id}.filtered.fastq.gz
"""
}
9 changes: 9 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ params {
save_reference = false
trim_galore_max_length = 40

// Contamination filtering
filter_contamination = false
rrna = null
trna = null
cdna = null
ncrna = null
pirna = null
other_contamination = null

// MultiQC options
multiqc_config = null
multiqc_title = null
Expand Down
46 changes: 46 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,49 @@
}
}
},
"contamination_filtering": {
"title": "Contamination filter options",
"type": "object",
"description": "Options to remove contamination from the reads.",
"fa_icon": "fas fa-cut",
"properties": {
"filter_contamination": {
"type": "boolean",
"default": false,
"description": "Enables the contamination filtering."
},
"rrna": {
"type": "string",
"format": "file-path",
"description": "Path to the rRNA fasta file to be used as contamination database."
},
"trna": {
"type": "string",
"format": "file-path",
"description": "Path to the tRNA fasta file to be used as contamination database."
},
"cdna": {
"type": "string",
"format": "file-path",
"description": "Path to the cDNA fasta file to be used as contamination database."
},
"ncrna": {
"type": "string",
"format": "file-path",
"description": "Path to the ncRNA fasta file to be used as contamination database."
},
"pirna": {
"type": "string",
"format": "file-path",
"description": "Path to the piRNA fasta file to be used as contamination database."
},
"other_contamination": {
"type": "string",
"format": "file-path",
"description": "Path to an additional fasta file to be used as contamination database."
}
}
},
"skipping_pipeline_steps": {
"title": "Skipping pipeline steps",
"type": "object",
Expand Down Expand Up @@ -411,6 +454,9 @@
{
"$ref": "#/definitions/trimming_options"
},
{
"$ref": "#/definitions/contamination_filtering"
},
{
"$ref": "#/definitions/skipping_pipeline_steps"
},
Expand Down
Loading

0 comments on commit 8d66121

Please sign in to comment.