README filled. Check fastq replicas unstable

nf-core · Oct 11, 2024 · 83b7d48 · 83b7d48
1 parent bc96d04
commit 83b7d48
Show file tree

Hide file tree

Showing 14 changed files with 1,110 additions and 65 deletions.
diff --git a/.devcontainer/Dockerfile b/.devcontainer/Dockerfile
diff --git a/.devcontainer/environment.yml b/.devcontainer/environment.yml
diff --git a/README.md b/README.md
@@ -10,81 +10,81 @@
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
-[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+<!-- [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/) -->
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
-[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
+<!-- [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/) -->
 [![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/fastqrepair)
 
 [![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23fastqrepair-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/fastqrepair)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
 
 ## Introduction
 
-**nf-core/fastqrepair** is a bioinformatics pipeline that ...
+**nf-core/fastqrepair** is a bioinformatics pipeline that can be used to recover corrupted `FASTQ.gz` files, drop or fix uncompliant reads, remove unpaired reads, and settles reads that became disordered. It takes a `samplesheet` and FASTQ/FASTQ.gz files as input (both single-end and paired-end) and produces clean FASTQ files and a QC report.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+![pipeline_diagram](docs/images/fastqrepair-flow-diagram-v1.0.svg)
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
-
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+1. Recover reads from corrupted fastq.gz file ([`gzrt`](https://github.com/arenn/gzrt))
+2. Make recovered reads well-formed ([`fastqwiper`](https://github.com/mazzalab/fastqwiper))
+3. Drop unpaired reads ([`trimmomatic`](http://www.usadellab.org/cms/index.php?page=trimmomatic))
+4. Re-pair reads ([`bbmap/repair.sh`](https://sourceforge.net/projects/bbmap/))
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
 First, prepare a samplesheet with your input data that looks as follows:
 
-`samplesheet.csv`:
+**samplesheet.csv**:
 
-```csv
+```
 sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+mysampleA,sample_R1.fastq.gz,sample_R2.fastq.gz
+mysampleB,sample_R3.fastq.gz,sample_R4.fastq.gz
+mysampleC,sample_R5.fastq.gz
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
+Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are not allowed. Row with different sample identifiers but same file names are not allowed.
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/fastqrepair \
    -profile <docker/singularity/.../institute> \
    --input samplesheet.csv \
    --outdir <OUTDIR>
 ```
 
+optional parameters are:
+```txt
+--chunk_size <int multiple of 4>
+--qin <33/64>
+--alphabet <ACGTN>
+```
+where 
+> `chunk_size` is the number of lines of chunks of the original fastq file (caution! Too big or too small numbers may significantly impact on performance); `qin` is the ASCII offset (33=Sanger, 64=old Solexa); `alphabet` is the allowed alphabet in the SEQ line of the FASTQ file.
+
+
 > [!WARNING]
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
 > see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
 
 For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/fastqrepair/usage) and the [parameter documentation](https://nf-co.re/fastqrepair/parameters).
 
 ## Pipeline output
+This pipeline produces clean and well-formed fastq files together with short textual reports of the cleaning actions. 
 
 To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/fastqrepair/results) tab on the nf-core website pipeline page.
 For more details about the output files and reports, please refer to the
 [output documentation](https://nf-co.re/fastqrepair/output).
 
 ## Credits
 
-nf-core/fastqrepair was originally written by Tommaso Mazza.
+nf-core/fastqrepair was designed and written by [Tommaso Mazza](https://github.com/mazzalab).
 
-We thank the following people for their extensive assistance in the development of this pipeline:
+<!-- We thank the following people for their extensive assistance in the development of this pipeline: -->
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+<!-- nf-core: If applicable, make list of people who have also contributed -->
 
 ## Contributions and Support
 

diff --git a/custom.config b/custom.config
@@ -0,0 +1,6 @@
+process {
+    withName: 'NFCORE_FASTQREPAIR:FASTQREPAIR:GZRT' {
+        // docker.registry = 'registry.hub.docker.com/mazzalab/fastqwiper:latest'
+        container = 'registry.hub.docker.com/mazzalab/fastqwiper:latest'
+    }
+}
diff --git a/docs/images/fastqrepair-flow-diagram-v1.0.svg b/docs/images/fastqrepair-flow-diagram-v1.0.svg
diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png
diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png
diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png
diff --git a/modules/local/gzrt.nf b/modules/local/gzrt.nf
@@ -58,11 +58,16 @@ process GZRT {
 
 
     """
-    gzrecover -o ${filename}_recovered.fastq ${fastqgz} -v
+    ver_line=""
+    if [[ $fastqgz == *.fastq ]] || [[ $fastqgz == *.fq ]]; then
+        mv $fastqgz ${filename}_recovered.fastq
+    else
+        gzrecover -o ${filename}_recovered.fastq ${fastqgz} -v
+        ver_line="${task.process}: gzrt: \$(gzrecover -V |& sed '1!d ; s/gzrecover //')"
+    fi
 
     cat <<-END_VERSIONS > versions.yml
-    "${task.process}":
-        gzrt: \$(gzrecover -V |& sed '1!d ; s/gzrecover //')
+    "\${ver_line}"
     END_VERSIONS
     """
 

diff --git a/modules/local/wipertools/gather.nf b/modules/local/wipertools/gather.nf
@@ -2,11 +2,15 @@ process GATHER {
     tag "$meta.id"
     label 'process_single'
 
+    container 'docker.io/mazzalab/fastqrepair_nf_env:1.0.1'
+
     input:
     tuple val(filename), val(meta), path(fastq_list)
+    tuple val(filename), val(meta), path(report_list)
 
     output:
     tuple val(meta), path("*merged_wiped.fastq.gz"), emit: fastq_merged_fixed
+    path("*merged_report.txt")                     , emit: report_merged
     path "versions.yml"                            , emit: versions
 
     // when:
@@ -19,6 +23,7 @@ process GATHER {
 
     """
     cat ${fastq_list} > ${filename}_merged_wiped.fastq.gz
+    wipertools summarygather -s ${report_list} -f ${filename}_merged_report.txt
 
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":

diff --git a/modules/local/wipertools/wipe.nf b/modules/local/wipertools/wipe.nf
@@ -37,26 +37,27 @@ process WIPER {
 
     output:
     tuple val(meta), path("*_wiped.fastq.gz"), emit: fixed_fastq
+    tuple val(meta), path("*_report.txt")    , emit: report
     path "versions.yml"                      , emit: versions
 
-    // when:
-    // task.ext.when == null || task.ext.when
+    when:
+    task.ext.when == null || task.ext.when
 
     script:
     def args = task.ext.args ?: ''
     def prefix = task.ext.prefix ?: "${meta.id}"
     def filename = "${fastq.baseName}"
     def VERSION = '1.0.0' // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions.
+    def log_freq = params.chunk_size / 10 as Integer
 
     // TODO nf-core: Where possible, a command MUST be provided to obtain the version number of the software e.g. 1.10
     //               If the software is unable to output a version number on the command-line then it can be manually specified
     //               e.g. https://github.com/nf-core/modules/blob/master/modules/nf-core/homer/annotatepeaks/main.nf
     //               Each software used MUST provide the software name and version number in the YAML version file (versions.yml)
     // TODO nf-core: It MUST be possible to pass additional parameters to the tool as a command-line string via the "task.ext.args" directive
 
-    // TODO: SET THESE [-l [LOG_OUT]] [-f [LOG_FREQUENCY]] [-a [ALPHABET]] -v [to output the version]
     """
-    wipertools fastqwiper -i $fastq -o ${filename}_wiped.fastq.gz 
+    wipertools fastqwiper -i $fastq -o ${filename}_wiped.fastq.gz -f ${log_freq} -a ${params.alphabet} -l ${filename}_report.txt
 
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":

diff --git a/subworkflows/local/scatter_wipe_gather/main.nf b/subworkflows/local/scatter_wipe_gather/main.nf
@@ -21,20 +21,25 @@ workflow SCATTER_WIPE_GATHER {
         ch_wiper
     }
 
-    ch_gather = Channel.empty()
-    ch_gather = WIPER.out.fixed_fastq.map{ metaData, fastq -> tuple( (fastq.baseName =~ /(.+)_chunk/)[0][1], metaData, fastq ) }
+    ch_fastq_gather  = Channel.empty()
+    ch_report_gather = Channel.empty()
+    ch_fastq_gather  = WIPER.out.fixed_fastq.map{ metaData, fastq -> tuple( (fastq.baseName =~ /(.+)_chunk/)[0][1], metaData, fastq ) }
                                      .groupTuple()
                                      .map{ basename, metadata, fastq -> tuple(basename, metadata.first(), fastq) }
-
-    GATHER {
-        ch_gather
-    }
+    ch_report_gather = WIPER.out.report.map{ metaData, report -> tuple( (report.baseName =~ /(.+)_chunk/)[0][1], metaData, report ) }
+                                       .groupTuple()
+                                       .map{ basename, metadata, report -> tuple(basename, metadata.first(), report) }
+    GATHER(
+        ch_fastq_gather,
+        ch_report_gather
+    )
 
     ch_versions = Channel.empty()
     ch_versions = ch_versions.mix(WIPER.out.versions.first())
 
     emit:
     fixed_fastq = GATHER.out.fastq_merged_fixed     // channel: [ val(meta), [ .fastq ] ]
+    report      = GATHER.out.report_merged
     versions    = ch_versions                       // channel: [ versions.yml ]
 }
 
diff --git a/subworkflows/local/utils_nfcore_fastqrepair_pipeline/main.nf b/subworkflows/local/utils_nfcore_fastqrepair_pipeline/main.nf
@@ -98,14 +98,32 @@ workflow PIPELINE_INITIALISATION {
 
     //
     // Check that any fastq file is not analyzed multiple times
-    //
-    validateInputSamplesheet2(ch_samplesheet)
+    // .flatten()
+    temp1 = ch_samplesheet.map{ meta, fastq -> fastq}.collect()
+    // hasDuplicates = temp1.size() != temp1.toSet().size()
+    hasDuplicates = temp1.count() != temp1.unique().count()
+    temp1.view()
+    temp1.unique().view()
+    println hasDuplicates
+
+    // println(temp1.unique().count().toInteger().view() == 1)
+    // println(temp1.count().toInteger().view() == '5')
+    // if(temp1.count().toInteger().view(){ $it } == 5){
+    //     print("HEREEEE")
+    // }
+
+    // println(temp1)
+    // println(temp2)
+
+    // validateInputSamplesheet2(ch_samplesheet.map{ meta, fastq -> fastq}.collect().toList())
 
     emit:
     samplesheet = ch_samplesheet
     versions    = ch_versions
 }
 
+
+
 /*
 ========================================================================================
     SUBWORKFLOW FOR PIPELINE COMPLETION
@@ -171,15 +189,18 @@ def validateInputSamplesheet(input) {
 //
 // Same fastq files are not allowed to be analyzed multiple times in the same run
 //
-def validateInputSamplesheet2(ch_samplesheet) {
-    all_fastq_files = ch_samplesheet.map{ meta, fastq -> fastq}.collect()
-    all_fastq_files.count().view()
-    all_fastq_files_unique = all_fastq_files.unique()
-    all_fastq_files_unique.count().view()
-
-    if(all_fastq_files.count() != all_fastq_files_unique.count()){
-        error("\nPlease check input samplesheet -> Multiple runs of a fastq file are not allowed")
-    }
+def validateInputSamplesheet2(input) {
+    def all_fastq_files = input.map(m -> m)
+    def all_fastq_files_unique = all_fastq_files.unique()
+
+    println "Validation!"
+    println all_fastq_files
+    println all_fastq_files.view()
+    println all_fastq_files_unique
+
+    // if(all_fastq_files.count() != all_fastq_files_unique.size){
+    //     error("\nPlease check input samplesheet -> Multiple runs of a fastq file are not allowed")
+    // }
 }
 
 //

diff --git a/workflows/fastqrepair.nf b/workflows/fastqrepair.nf
@@ -32,7 +32,7 @@ workflow FASTQREPAIR {
     // Decouple paired-end reads
     ch_decoupled = ch_samplesheet.flatMap { metaData, filePaths -> filePaths.collect { file -> [metaData, file] } }
 
-    // Recover fastq files
+    // Recover fastq.gz and skip *.fastq or *.fq
     GZRT (
         ch_decoupled
     )
@@ -59,20 +59,10 @@ workflow FASTQREPAIR {
     BBMAPREPAIR {
         TRIMMOMATIC.out.trimmed_reads
     }
-
-
-    // SCATTER_WIPE_GATHER.out.fixed_fastq.view()
-
-
-
-    // Collect the values from both channels into lists
-    // ch_samplesheet.map { metaData, filePaths -> metaData }
-    //                 .combine(GZRT.out.fastq.toList())
-    //                 .set { ch1 }
 
-    // MODULE: Run FastQC
+    // Assess QC
     // FASTQC (
-    //     ch1
+    //     BBMAPREPAIR.out.interleaved_fastq
     // )