Skip to content

Commit

Permalink
README filled. Check fastq replicas unstable
Browse files Browse the repository at this point in the history
  • Loading branch information
mazzalab committed Oct 11, 2024
1 parent bc96d04 commit 83b7d48
Show file tree
Hide file tree
Showing 14 changed files with 1,110 additions and 65 deletions.
Empty file modified .devcontainer/Dockerfile
100755 → 100644
Empty file.
Empty file modified .devcontainer/environment.yml
100755 → 100644
Empty file.
56 changes: 28 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,81 +10,81 @@
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
<!-- [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/) -->
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
<!-- [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/) -->
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/fastqrepair)

[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23fastqrepair-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/fastqrepair)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)

## Introduction

**nf-core/fastqrepair** is a bioinformatics pipeline that ...
**nf-core/fastqrepair** is a bioinformatics pipeline that can be used to recover corrupted `FASTQ.gz` files, drop or fix uncompliant reads, remove unpaired reads, and settles reads that became disordered. It takes a `samplesheet` and FASTQ/FASTQ.gz files as input (both single-end and paired-end) and produces clean FASTQ files and a QC report.

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->
![pipeline_diagram](docs/images/fastqrepair-flow-diagram-v1.0.svg)

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
1. Recover reads from corrupted fastq.gz file ([`gzrt`](https://github.com/arenn/gzrt))
2. Make recovered reads well-formed ([`fastqwiper`](https://github.com/mazzalab/fastqwiper))
3. Drop unpaired reads ([`trimmomatic`](http://www.usadellab.org/cms/index.php?page=trimmomatic))
4. Re-pair reads ([`bbmap/repair.sh`](https://sourceforge.net/projects/bbmap/))

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:
**samplesheet.csv**:

```csv
```
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
mysampleA,sample_R1.fastq.gz,sample_R2.fastq.gz
mysampleB,sample_R3.fastq.gz,sample_R4.fastq.gz
mysampleC,sample_R5.fastq.gz
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are not allowed. Row with different sample identifiers but same file names are not allowed.

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/fastqrepair \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
```

optional parameters are:
```txt
--chunk_size <int multiple of 4>
--qin <33/64>
--alphabet <ACGTN>
```
where
> `chunk_size` is the number of lines of chunks of the original fastq file (caution! Too big or too small numbers may significantly impact on performance); `qin` is the ASCII offset (33=Sanger, 64=old Solexa); `alphabet` is the allowed alphabet in the SEQ line of the FASTQ file.

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/fastqrepair/usage) and the [parameter documentation](https://nf-co.re/fastqrepair/parameters).

## Pipeline output
This pipeline produces clean and well-formed fastq files together with short textual reports of the cleaning actions.

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/fastqrepair/results) tab on the nf-core website pipeline page.
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/fastqrepair/output).

## Credits

nf-core/fastqrepair was originally written by Tommaso Mazza.
nf-core/fastqrepair was designed and written by [Tommaso Mazza](https://github.com/mazzalab).

We thank the following people for their extensive assistance in the development of this pipeline:
<!-- We thank the following people for their extensive assistance in the development of this pipeline: -->

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
<!-- nf-core: If applicable, make list of people who have also contributed -->

## Contributions and Support

Expand Down
6 changes: 6 additions & 0 deletions custom.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
process {
withName: 'NFCORE_FASTQREPAIR:FASTQREPAIR:GZRT' {
// docker.registry = 'registry.hub.docker.com/mazzalab/fastqwiper:latest'
container = 'registry.hub.docker.com/mazzalab/fastqwiper:latest'
}
}
1,012 changes: 1,012 additions & 0 deletions docs/images/fastqrepair-flow-diagram-v1.0.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/images/mqc_fastqc_adapter.png
Binary file not shown.
Binary file removed docs/images/mqc_fastqc_counts.png
Binary file not shown.
Binary file removed docs/images/mqc_fastqc_quality.png
Binary file not shown.
11 changes: 8 additions & 3 deletions modules/local/gzrt.nf
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,16 @@ process GZRT {


"""
gzrecover -o ${filename}_recovered.fastq ${fastqgz} -v
ver_line=""
if [[ $fastqgz == *.fastq ]] || [[ $fastqgz == *.fq ]]; then
mv $fastqgz ${filename}_recovered.fastq
else
gzrecover -o ${filename}_recovered.fastq ${fastqgz} -v
ver_line="${task.process}: gzrt: \$(gzrecover -V |& sed '1!d ; s/gzrecover //')"
fi
cat <<-END_VERSIONS > versions.yml
"${task.process}":
gzrt: \$(gzrecover -V |& sed '1!d ; s/gzrecover //')
"\${ver_line}"
END_VERSIONS
"""

Expand Down
5 changes: 5 additions & 0 deletions modules/local/wipertools/gather.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@ process GATHER {
tag "$meta.id"
label 'process_single'

container 'docker.io/mazzalab/fastqrepair_nf_env:1.0.1'

input:
tuple val(filename), val(meta), path(fastq_list)
tuple val(filename), val(meta), path(report_list)

output:
tuple val(meta), path("*merged_wiped.fastq.gz"), emit: fastq_merged_fixed
path("*merged_report.txt") , emit: report_merged
path "versions.yml" , emit: versions

// when:
Expand All @@ -19,6 +23,7 @@ process GATHER {

"""
cat ${fastq_list} > ${filename}_merged_wiped.fastq.gz
wipertools summarygather -s ${report_list} -f ${filename}_merged_report.txt
cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
9 changes: 5 additions & 4 deletions modules/local/wipertools/wipe.nf
Original file line number Diff line number Diff line change
Expand Up @@ -37,26 +37,27 @@ process WIPER {

output:
tuple val(meta), path("*_wiped.fastq.gz"), emit: fixed_fastq
tuple val(meta), path("*_report.txt") , emit: report
path "versions.yml" , emit: versions

// when:
// task.ext.when == null || task.ext.when
when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def filename = "${fastq.baseName}"
def VERSION = '1.0.0' // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions.
def log_freq = params.chunk_size / 10 as Integer

// TODO nf-core: Where possible, a command MUST be provided to obtain the version number of the software e.g. 1.10
// If the software is unable to output a version number on the command-line then it can be manually specified
// e.g. https://github.com/nf-core/modules/blob/master/modules/nf-core/homer/annotatepeaks/main.nf
// Each software used MUST provide the software name and version number in the YAML version file (versions.yml)
// TODO nf-core: It MUST be possible to pass additional parameters to the tool as a command-line string via the "task.ext.args" directive

// TODO: SET THESE [-l [LOG_OUT]] [-f [LOG_FREQUENCY]] [-a [ALPHABET]] -v [to output the version]
"""
wipertools fastqwiper -i $fastq -o ${filename}_wiped.fastq.gz
wipertools fastqwiper -i $fastq -o ${filename}_wiped.fastq.gz -f ${log_freq} -a ${params.alphabet} -l ${filename}_report.txt
cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
17 changes: 11 additions & 6 deletions subworkflows/local/scatter_wipe_gather/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,25 @@ workflow SCATTER_WIPE_GATHER {
ch_wiper
}

ch_gather = Channel.empty()
ch_gather = WIPER.out.fixed_fastq.map{ metaData, fastq -> tuple( (fastq.baseName =~ /(.+)_chunk/)[0][1], metaData, fastq ) }
ch_fastq_gather = Channel.empty()
ch_report_gather = Channel.empty()
ch_fastq_gather = WIPER.out.fixed_fastq.map{ metaData, fastq -> tuple( (fastq.baseName =~ /(.+)_chunk/)[0][1], metaData, fastq ) }
.groupTuple()
.map{ basename, metadata, fastq -> tuple(basename, metadata.first(), fastq) }

GATHER {
ch_gather
}
ch_report_gather = WIPER.out.report.map{ metaData, report -> tuple( (report.baseName =~ /(.+)_chunk/)[0][1], metaData, report ) }
.groupTuple()
.map{ basename, metadata, report -> tuple(basename, metadata.first(), report) }
GATHER(
ch_fastq_gather,
ch_report_gather
)

ch_versions = Channel.empty()
ch_versions = ch_versions.mix(WIPER.out.versions.first())

emit:
fixed_fastq = GATHER.out.fastq_merged_fixed // channel: [ val(meta), [ .fastq ] ]
report = GATHER.out.report_merged
versions = ch_versions // channel: [ versions.yml ]
}

43 changes: 32 additions & 11 deletions subworkflows/local/utils_nfcore_fastqrepair_pipeline/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -98,14 +98,32 @@ workflow PIPELINE_INITIALISATION {

//
// Check that any fastq file is not analyzed multiple times
//
validateInputSamplesheet2(ch_samplesheet)
// .flatten()
temp1 = ch_samplesheet.map{ meta, fastq -> fastq}.collect()
// hasDuplicates = temp1.size() != temp1.toSet().size()
hasDuplicates = temp1.count() != temp1.unique().count()
temp1.view()
temp1.unique().view()
println hasDuplicates

// println(temp1.unique().count().toInteger().view() == 1)
// println(temp1.count().toInteger().view() == '5')
// if(temp1.count().toInteger().view(){ $it } == 5){
// print("HEREEEE")
// }

// println(temp1)
// println(temp2)

// validateInputSamplesheet2(ch_samplesheet.map{ meta, fastq -> fastq}.collect().toList())

emit:
samplesheet = ch_samplesheet
versions = ch_versions
}



/*
========================================================================================
SUBWORKFLOW FOR PIPELINE COMPLETION
Expand Down Expand Up @@ -171,15 +189,18 @@ def validateInputSamplesheet(input) {
//
// Same fastq files are not allowed to be analyzed multiple times in the same run
//
def validateInputSamplesheet2(ch_samplesheet) {
all_fastq_files = ch_samplesheet.map{ meta, fastq -> fastq}.collect()
all_fastq_files.count().view()
all_fastq_files_unique = all_fastq_files.unique()
all_fastq_files_unique.count().view()

if(all_fastq_files.count() != all_fastq_files_unique.count()){
error("\nPlease check input samplesheet -> Multiple runs of a fastq file are not allowed")
}
def validateInputSamplesheet2(input) {
def all_fastq_files = input.map(m -> m)
def all_fastq_files_unique = all_fastq_files.unique()

println "Validation!"
println all_fastq_files
println all_fastq_files.view()
println all_fastq_files_unique

// if(all_fastq_files.count() != all_fastq_files_unique.size){
// error("\nPlease check input samplesheet -> Multiple runs of a fastq file are not allowed")
// }
}

//
Expand Down
16 changes: 3 additions & 13 deletions workflows/fastqrepair.nf
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ workflow FASTQREPAIR {
// Decouple paired-end reads
ch_decoupled = ch_samplesheet.flatMap { metaData, filePaths -> filePaths.collect { file -> [metaData, file] } }

// Recover fastq files
// Recover fastq.gz and skip *.fastq or *.fq
GZRT (
ch_decoupled
)
Expand All @@ -59,20 +59,10 @@ workflow FASTQREPAIR {
BBMAPREPAIR {
TRIMMOMATIC.out.trimmed_reads
}


// SCATTER_WIPE_GATHER.out.fixed_fastq.view()



// Collect the values from both channels into lists
// ch_samplesheet.map { metaData, filePaths -> metaData }
// .combine(GZRT.out.fastq.toList())
// .set { ch1 }

// MODULE: Run FastQC
// Assess QC
// FASTQC (
// ch1
// BBMAPREPAIR.out.interleaved_fastq
// )


Expand Down

0 comments on commit 83b7d48

Please sign in to comment.