Merge pull request #156 from CKComputomics/contaminationFiltering

Contamination filtering
nf-core · Jun 17, 2022 · 8d66121 · 8d66121
2 parents fb058cc + 6395315
commit 8d66121
Show file tree

Hide file tree

Showing 12 changed files with 416 additions and 24 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,16 +12,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Other enhancements
 
 - [#55](https://github.com/nf-core/smrnaseq/issues/12) - Enabled the use of `MirGeneDB` as an alternative database insted of `miRBase`
+- [#113](https://github.com/nf-core/smrnaseq/issues/113) - Added a optional contamination filtering step, including MultiQC plot.
 
 ### Parameters
 
-| Old parameter | New parameter         |
-| ------------- | --------------------- |
-|               | `--mirGeneDB`         |
-|               | `--mirGeneDB_species` |
-|               | `--mirGeneDB_gff`     |
-|               | `--mirGeneDB_mature`  |
-|               | `--mirGeneDB_hairpin` |
+| Old parameter | New parameter            |
+| ------------- | ------------------------ |
+|               | `--mirGeneDB`            |
+|               | `--mirGeneDB_species`    |
+|               | `--mirGeneDB_gff`        |
+|               | `--mirGeneDB_mature`     |
+|               | `--mirGeneDB_hairpin`    |
+|               | `--contamination_filter` |
+|               | `--rrna`                 |
+|               | `--trna`                 |
+|               | `--cdna`                 |
+|               | `--ncrna`                |
+|               | `--pirna`                |
+|               | `--other_contamination`  |
 
 ## [v2.0.0](https://github.com/nf-core/smrnaseq/releases/tag/2.0.0) - 2022-05-31 Aqua Zinc Chihuahua
 
@@ -44,10 +52,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Parameters
 
-| Old parameter      | New parameter    |
-| ------------------ | ---------------- |
-| `--conda`          | `--enable_conda` |
-| `--clusterOptions` |                  |
+| Old parameter        | New parameter    |
+| -------------------- | ---------------- |
+| `--conda`            | `--enable_conda` |
+| `--clusterOptions`   |                  |
+| `--publish_dir_mode` |                  |
 
 > **NB:** Parameter has been **updated** if both old and new parameter information is present.
 > **NB:** Parameter has been **added** if just the new parameter information is present.
@@ -79,6 +88,8 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
 | `pymdown-extensions` | -           | -           |
 | `pygments`           | -           | -           |
 | `r-r.methodss3`      | -           | -           |
+| `bowtie2`            | -           | 2.4.5       |
+| `blat`               | -           | 36          |
 
 > **NB:** Dependency has been **updated** if both old and new version information is present.
 > **NB:** Dependency has been **added** if just the new version information is present.

diff --git a/README.md b/README.md
@@ -31,24 +31,25 @@ On release, automated continuous integration tests run the pipeline on a full-si
 2. Adapter trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
    1. Insert Size calculation
    2. Collapse reads ([`seqcluster`](https://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads))
-3. Alignment against miRBase or MirGeneDB mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
-4. Alignment against miRBase or MirGeneDB hairpin
+3. Contamination filtering ([`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml))
+4. Alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
+5. Alignment against miRBase hairpin
    1. Unaligned reads from step 3 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
    2. Collapsed reads from step 2.2 ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
-5. Post-alignment processing of miRBase, or MirGeneDB hairpin
+6. Post-alignment processing of miRBase hairpin
    1. Basic statistics from step 3 and step 4.1 ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
    2. Analysis on miRBase, or MirGeneDB hairpin counts ([`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html))
       - TMM normalization and a table of top expression hairpin
       - MDS plot clustering samples
       - Heatmap of sample similarities
    3. miRNA and isomiR annotation from step 4.1 ([`mirtop`](https://github.com/miRTop/mirtop))
-6. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
+7. Alignment against host reference genome ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
    1. Post-alignment processing of alignment against host reference genome ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
-7. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
+8. Novel miRNAs and known miRNAs discovery ([`MiRDeep2`](https://www.mdc-berlin.de/content/mirdeep2-documentation))
    1. Mapping against reference genome with the mapper module
    2. Known and novel miRNA discovery with the mirdeep2 module
-8. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
-9. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
+9. miRNA quality control ([`mirtrace`](https://github.com/friedlanderlab/mirtrace))
+10. Present QC for raw read, alignment, and expression results ([`MultiQC`](http://multiqc.info/))
 
 ## Quick Start
 

diff --git a/docs/output.md b/docs/output.md
@@ -14,6 +14,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 - [FastQC](#fastqc) - read quality control
 - [TrimGalore](#trimgalore) - adapter trimming
+- [Bowtie2](#bowtie2) - contamination filtering
 - [Bowtie](#bowtie) - alignment against mature miRNAs and miRNA precursors (hairpins)
 - [SAMtools](#samtools) - alignment result processing and feature counting
 - [edgeR](#edger) - normalization, MDS plot and sample pairwise distance heatmap
@@ -58,6 +59,12 @@ This is an example of the output we can get:
 
 ![cutadapt](images/cutadapt_plot.png)
 
+## Bowtie2
+
+[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.
+
+MultiQC reports the number of reads that were removed by each of the contaminant databases.
+
 ## Bowtie
 
 [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) is used for mapping adapter trimmed reads against the mature miRNAs and miRNA precursors (hairpins) of the chosen database [miRBase](http://www.mirbase.org/) or [MirGeneDB](https://mirgenedb.org/).

diff --git a/docs/usage.md b/docs/usage.md
@@ -39,6 +39,19 @@ If `MirGeneDB` should be used instead it needs to be specified using `--mirGeneD
 - `fasta`: the reference genome FASTA file
 - `bt_indices`: points to the folder containing the `bowtie2` indices for the genome reference specified by `fasta`. **Note:** if the FASTA file in `fasta` is not the same file used to generate the `bowtie2` indices, then the pipeline will fail.
 
+### Contamination filtering
+
+This step has, until now, only been tested for human data. Unexpected behaviour can occur when using it with a different species.
+
+Contamination filtering of the sequencing reads is optional and can be invoked using `filter_contamination`. FASTA files with contamination sequences to use need to be supplied using the following commands. Otherwise the contamination filtering of the specific type will be omitted.
+
+- `rrna`: Used to supply a FASTA file containing rRNA contamination sequence.
+- `trna`: Used to supply a FASTA file containing tRNA contamination sequence. e.g. `http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-tRNAs.fa`
+- `cdna`: Used to supply a FASTA file containing cDNA contamination sequence. e.g. `ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz` The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
+- `ncrna`: Used to supply a FASTA file containing ncRNA contamination sequence. e.g. `ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz` The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
+- `pirna`: Used to supply a FASTA file containing piRNA contamination sequence. e.g. The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
+- `other_contamination`: Used to supply an additional filtering set. The FASTA file is first compared to the available miRNA sequences and overlaps are removed.
+
 ## Samplesheet input
 
 You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.

diff --git a/modules/local/blat_mirna.nf b/modules/local/blat_mirna.nf
@@ -0,0 +1,59 @@
+process BLAT_MIRNA {
+    tag "$fasta"
+    label 'process_medium'
+
+    conda (params.enable_conda ? 'bioconda::blat=36' : null)
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/blat:36--0' :
+        'quay.io/biocontainers/blat:36--0' }"
+
+    input:
+    val db_type
+    path mirna
+    path contaminants
+
+
+    output:
+    path 'filtered.fa'  , emit: filtered_set
+    path "versions.yml" , emit: versions
+
+    script:
+
+    if ( db_type == "cdna" )
+        """
+        echo $db_type
+        awk '/^>/ { x=index(\$6, "transcript_biotype:miRNA") } { if(!x) print }' $contaminants > subset.fa
+        blat -out=blast8 $mirna subset.fa /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
+        awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' subset.fa  > filtered.fa
+
+        cat <<-END_VERSIONS > versions.yml
+        "${task.process}":
+            blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
+        END_VERSIONS
+        """
+
+    else if ( db_type == "ncrna" )
+        """
+        echo $db_type
+        awk '/^>/ { x=(index(\$6, "transcript_biotype:rRNA") || index(\$6, "transcript_biotype:miRNA")) } { if(!x) print }' $contaminants > subset.fa
+        blat -out=blast8 $mirna subset.fa /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
+        awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' subset.fa  > filtered.fa
+
+        cat <<-END_VERSIONS > versions.yml
+        "${task.process}":
+            blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
+        END_VERSIONS
+        """
+
+    else
+        """
+        echo $db_type
+        blat -out=blast8 $mirna $contaminants /dev/stdout | awk 'BEGIN{FS="\t"}{if(\$11 < 1e-5)print \$1;}' | uniq > mirnahit.txt
+        awk 'BEGIN { while((getline<"mirnahit.txt")>0) l[">"\$1]=1 } /^>/ {x = l[\$1]} {if(!x) print }' $contaminants  > filtered.fa
+        cat <<-END_VERSIONS > versions.yml
+        "${task.process}":
+            blat: \$(echo \$(blat) | grep Standalone | awk '{ if (match(\$0,/[0-9]*[0-9]/,m)) print m[0] }')
+        END_VERSIONS
+        """
+
+}
diff --git a/modules/local/bowtie_contaminants.nf b/modules/local/bowtie_contaminants.nf
@@ -0,0 +1,26 @@
+process INDEX_CONTAMINANTS {
+    label 'process_medium'
+
+    conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
+        'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"
+
+    input:
+    path fasta
+
+    output:
+    path 'fasta_bidx*'  , emit: bt_indices
+    path "versions.yml" , emit: versions
+
+    script:
+    """
+    bowtie2-build ${fasta} fasta_bidx --threads ${task.cpus}
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        bowtie2: \$(echo \$(bowtie2 --version 2>&1) | sed 's/^.*bowtie2-align-s version //; s/ .*\$//')
+    END_VERSIONS
+    """
+
+}
diff --git a/modules/local/bowtie_map_contaminants.nf b/modules/local/bowtie_map_contaminants.nf
@@ -0,0 +1,41 @@
+process BOWTIE_MAP_CONTAMINANTS {
+    label 'process_medium'
+
+    conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
+        'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"
+
+    input:
+    tuple val(meta), path(reads)
+    path index
+    val contaminant_type
+
+    output:
+    tuple val(meta), path("*sam")                               , emit: bam
+    tuple val(meta), path('*.filter.unmapped.contaminant.fastq'), emit: unmapped
+    path "versions.yml"                                         , emit: versions
+    path "filtered.*.stats"                                     , emit: stats
+
+    script:
+    def index_base = index.toString().tokenize(' ')[0].tokenize('.')[0]
+    """
+    bowtie2 \\
+        --threads ${task.cpus} \\
+        --very-sensitive-local \\
+        -k 1 \\
+        -x $index_base \\
+        --un ${meta.id}.${contaminant_type}.filter.unmapped.contaminant.fastq \\
+        ${reads} \\
+        -S ${meta.id}.filter.contaminant.sam > ${meta.id}.contaminant_bowtie.log 2>&1
+
+    # extracting number of reads from bowtie logs
+    awk -v type=${contaminant_type} 'BEGIN{tot=0} {if(NR==4 || NR == 5){tot += \$1}} END {print "\\""type"\\": "tot }' ${meta.id}.contaminant_bowtie.log | tr -d , > filtered.${meta.id}_${contaminant_type}.stats
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        bowtie2: \$(echo \$(bowtie2 --version 2>&1) | sed 's/^.*bowtie2-align-s version //; s/ .*\$//' | tr -d '\0')
+    END_VERSIONS
+    """
+
+}
diff --git a/modules/local/filter_stats.nf b/modules/local/filter_stats.nf
@@ -0,0 +1,25 @@
+process FILTER_STATS {
+    label 'process_medium'
+
+    conda (params.enable_conda ? 'bowtie2=2.4.5' : null)
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/bowtie2:2.4.5--py39hd2f7db1_2' :
+        'quay.io/biocontainers/bowtie2:2.4.5--py36hfca12d5_2' }"
+
+    input:
+    tuple val(meta), path(reads)
+    path stats_files
+
+    output:
+    path "*_mqc.yaml"                                            , emit: stats
+    tuple val(meta), path('*.filtered.fastq.gz')                 , emit: reads
+
+    script:
+    """
+    readnumber=\$(wc -l ${reads} | awk '{ print \$1/4 }')
+    cat ./filtered.${meta.id}_*.stats | \\
+    tr '\n' ', ' | \\
+    awk -v sample=${meta.id} -v readnumber=\$readnumber '{ print "id: \\"my_pca_section\\"\\nsection_name: \\"Contamination Filtering\\"\\ndescription: \\"This plot shows the amount of reads filtered by contaminant type.\\"\\nplot_type: \\"bargraph\\"\\npconfig:\\n  id: \\"contamination_filter_plot\\"\\n  title: \\"Contamination Plot\\"\\n  ylab: \\"Number of reads\\"\\ndata:\\n    "sample": {"\$0"\\"remaining reads\\": "readnumber"}" }' > ${meta.id}.contamination_mqc.yaml
+    gzip -c ${reads} > ${meta.id}.filtered.fastq.gz
+    """
+}
diff --git a/nextflow.config b/nextflow.config
@@ -45,6 +45,15 @@ params {
     save_reference             = false
     trim_galore_max_length     = 40
 
+    // Contamination filtering
+    filter_contamination       = false
+    rrna                       = null
+    trna                       = null
+    cdna                       = null
+    ncrna                      = null
+    pirna                      = null
+    other_contamination        = null
+
     // MultiQC options
     multiqc_config             = null
     multiqc_title              = null

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -194,6 +194,49 @@
                 }
             }
         },
+        "contamination_filtering": {
+            "title": "Contamination filter options",
+            "type": "object",
+            "description": "Options to remove contamination from the reads.",
+            "fa_icon": "fas fa-cut",
+            "properties": {
+                "filter_contamination": {
+                    "type": "boolean",
+                    "default": false,
+                    "description": "Enables the contamination filtering."
+                },
+                "rrna": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to the rRNA fasta file to be used as contamination database."
+                },
+                "trna": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to the tRNA fasta file to be used as contamination database."
+                },
+                "cdna": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to the cDNA fasta file to be used as contamination database."
+                },
+                "ncrna": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to the ncRNA fasta file to be used as contamination database."
+                },
+                "pirna": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to the piRNA fasta file to be used as contamination database."
+                },
+                "other_contamination": {
+                    "type": "string",
+                    "format": "file-path",
+                    "description": "Path to an additional fasta file to be used as contamination database."
+                }
+            }
+        },
         "skipping_pipeline_steps": {
             "title": "Skipping pipeline steps",
             "type": "object",
@@ -411,6 +454,9 @@
         {
             "$ref": "#/definitions/trimming_options"
         },
+        {
+            "$ref": "#/definitions/contamination_filtering"
+        },
         {
             "$ref": "#/definitions/skipping_pipeline_steps"
         },