Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

Remapping #717

Merged
merged 17 commits into from
Feb 20, 2019
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
## [Unreleased]

### `Added`

- [#712](https://github.com/SciLifeLab/Sarek/pull/712), [#718](https://github.com/SciLifeLab/Sarek/pull/718) - Added possibilities to run Sarek with `conda`

### `Changed`

- [#710](https://github.com/SciLifeLab/Sarek/pull/710) - Improve release checklist and script
- [#711](https://github.com/SciLifeLab/Sarek/pull/711) - Improve configuration priorities
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
- [#716](https://github.com/SciLifeLab/Sarek/pull/716) - Update paths to containers and iGenomes
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `mapping` step can now map BAM files too
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `fastqFiles` renamed to `inputFiles`
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `MapReads` can now convert BAM to FASTQ and feed it to BWA on the fly
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `checkFileExtension` has changed to `hasExtension`, and now only verify if file has extension
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - Update documentation
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
- [#724](https://github.com/SciLifeLab/Sarek/pull/724) - Improved AwsBatch configuration

### `Added`
Expand Down
30 changes: 16 additions & 14 deletions docs/INPUT.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# TSV file for sample

Input files for Sarek can be specified using a tsv file given to the `--sample` parameter. The tsv file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2` or `subject gender status sample bam bai`.
Input files for Sarek can be specified using a TSV file given to the `--sample` parameter.
The TSV file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2`, `subject gender status sample lane bam` or `subject gender status sample bam bai`.
The content of these columns should be quite straight-forward:

- `subject` designate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample.
Expand All @@ -13,11 +14,14 @@ The content of these columns should be quite straight-forward:
- `bam` is the bam file
- `bai` is the index

All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it was a single normal sample instead of a normal/tumor pair.
All examples are given for a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a single normal sample instead of a normal/tumor pair.

# Example TSV file for a normal/tumor pair with FASTQ files

In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also. Note, the delimiter is the tab (\t) character:
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also.
Note, the delimiter is the tab (\t) character:

```
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
Expand All @@ -29,23 +33,21 @@ G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.

# Example TSV file for a normal/tumor pair with BAM files

On the other hand, if you have BAMs (T/N pairs that were not realigned together) and their indexes, you should use a structure like:
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
It is recommended to add the absolute path of BAM files, but relative path should work also.
Note, the delimiter is the tab (\t) character:

```
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
```

All the files will be created in the Preprocessing/NonRealigned/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the preprocessed files should be done by:

```bash
nextflow run SciLifeLab/Sarek/main.nf --sample Preprocessing/NonRealigned/mysample.tsv --step realign
nextflow run SciLifeLab/Sarek/somaticVC.nf --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFAC_1.bam
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFAC_2.bam
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFAC_3.bam
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
```

# Example TSV file for a normal/tumor pair with recalibrated BAM files

The same way, if you have recalibrated BAMs (T/N pairs that were realigned together) and their indexes, you should use a structure like:
The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:

```
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
Expand Down
14 changes: 7 additions & 7 deletions lib/SarekUtils.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@ import nextflow.Channel

class SarekUtils {

// Check file extension
static def checkFileExtension(it, extension) {
if (!it.toString().toLowerCase().endsWith(extension.toLowerCase())) exit 1, "File: ${it} has the wrong extension: ${extension} see --help for more information"
}

// Check if a row has the expected number of item
static def checkNumberOfItem(row, number) {
if (row.size() != number) exit 1, "Malformed row in TSV file: ${row}, see --help for more information"
Expand Down Expand Up @@ -173,8 +168,8 @@ class SarekUtils {
def bamFile = SarekUtils.returnFile(row[4])
def baiFile = SarekUtils.returnFile(row[5])

SarekUtils.checkFileExtension(bamFile,".bam")
SarekUtils.checkFileExtension(baiFile,".bai")
if (!SarekUtils.hasExtension(bamFile,".bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(baiFile,".bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"

if (mode == "germline") return [ idPatient, status, idSample, bamFile, baiFile ]
else return [ idPatient, gender, status, idSample, bamFile, baiFile ]
Expand All @@ -193,6 +188,11 @@ class SarekUtils {
[genders, channel]
}

// Check file extension
static def hasExtension(it, extension) {
it.toString().toLowerCase().endsWith(extension.toLowerCase())
}

// Compare params to list of verified params
static def isAllowedParams(params) {
def test = true
Expand Down
88 changes: 58 additions & 30 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,20 @@ if (!params.sample && !params.sampleDir) {
if (params.test || step != 'mapping') tsvPath = tsvPaths[step]
}

// Set up the fastqFiles and bamFiles channels. One of them remains empty
fastqFiles = Channel.empty()
// Set up the inputFiles and bamFiles channels. One of them will remain empty
inputFiles = Channel.empty()
bamFiles = Channel.empty()
if (tsvPath) {
tsvFile = file(tsvPath)
switch (step) {
case 'mapping': fastqFiles = extractFastq(tsvFile); break
case 'mapping': inputFiles = extractSample(tsvFile); break
case 'recalibrate': bamFiles = extractRecal(tsvFile); break
default: exit 1, "Unknown step ${step}"
}
} else if (params.sampleDir) {
if (step != 'mapping') exit 1, '--sampleDir does not support steps other than "mapping"'
fastqFiles = extractFastqFromDir(params.sampleDir)
(fastqFiles, fastqTmp) = fastqFiles.into(2)
inputFiles = extractFastqFromDir(params.sampleDir)
(inputFiles, fastqTmp) = inputFiles.into(2)
fastqTmp.toList().subscribe onNext: {
if (it.size() == 0) {
exit 1, "No FASTQ files found in --sampleDir directory '${params.sampleDir}'"
Expand All @@ -102,8 +102,8 @@ if (tsvPath) {
tsvFile = params.sampleDir // used in the reports
} else exit 1, 'No sample were defined, see --help'

if (step == 'mapping') (patientGenders, fastqFiles) = SarekUtils.extractGenders(fastqFiles)
else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
if (step == 'recalibrate') (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
else (patientGenders, inputFiles) = SarekUtils.extractGenders(inputFiles)

/*
================================================================================
Expand All @@ -113,9 +113,9 @@ else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)

startMessage()

(fastqFiles, fastqFilesforFastQC) = fastqFiles.into(2)
(inputFiles, inputFilesforFastQC) = inputFiles.into(2)

if (params.verbose) fastqFiles = fastqFiles.view {
if (params.verbose) inputFiles = inputFiles.view {
"FASTQs to preprocess:\n\
ID : ${it[0]}\tStatus: ${it[1]}\tSample: ${it[2]}\tRun : ${it[3]}\n\
Files : [${it[4].fileName}, ${it[5].fileName}]"
Expand All @@ -133,16 +133,17 @@ process RunFastQC {
publishDir "${directoryMap.fastQC}/${idRun}", mode: params.publishDirMode

input:
set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFilesforFastQC
set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFilesforFastQC

output:
file "*_fastqc.{zip,html}" into fastQCreport

when: step == 'mapping' && !params.noReports

script:
inputFiles = SarekUtils.hasExtension(inputFile1,"fastq.gz") ? "${inputFile1} ${inputFile2}" : "${inputFile1}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably allow fq.gz as well right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can think of it in the extractSample() function, but currently I can't think of anyone asking me about this particular .fq.gz extension.

"""
fastqc -t 2 -q ${fastqFile1} ${fastqFile2}
fastqc -t 2 -q ${inputFiles}
"""
}

Expand All @@ -155,7 +156,7 @@ process MapReads {
tag {idPatient + "-" + idRun}

input:
set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFiles
set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFiles
set file(genomeFile), file(bwaIndex) from Channel.value([referenceMap.genomeFile, referenceMap.bwaIndex])

output:
Expand All @@ -168,11 +169,31 @@ process MapReads {
readGroup = "@RG\\tID:${idRun}\\t${CN}PU:${idRun}\\tSM:${idSample}\\tLB:${idSample}\\tPL:illumina"
// adjust mismatch penalty for tumor samples
extra = status == 1 ? "-B 3" : ""
"""
bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
${genomeFile} ${fastqFile1} ${fastqFile2} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
if (SarekUtils.hasExtension(inputFile1,"fastq.gz"))
"""
bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
${genomeFile} ${inputFile1} ${inputFile2} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
else if (SarekUtils.hasExtension(inputFile1,"bam"))
// -K is an hidden option, used to fix the number of reads processed by bwa mem
// Chunk size can affect bwa results, if not specified, the number of threads can change
// which can give not deterministic result.
// cf https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md
// and https://github.com/gatk-workflows/gatk4-data-processing/blob/8ffa26ff4580df4ac3a5aa9e272a4ff6bab44ba2/processing-for-variant-discovery-gatk4.b37.wgs.inputs.json#L29
"""
gatk --java-options -Xmx${task.memory.toGiga()}g \
SamToFastq \
--INPUT=${inputFile1} \
--FASTQ=/dev/stdout \
--INTERLEAVE=true \
--NON_PF=true \
| \
bwa mem -K 100000000 -p -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${genomeFile} \
/dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2) \
| \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
}

if (params.verbose) mappedBam = mappedBam.view {
Expand Down Expand Up @@ -343,7 +364,7 @@ process CreateRecalibrationTable {
set idPatient, status, idSample, file("${idSample}.recal.table") into recalibrationTable
set idPatient, status, idSample, val("${idSample}_${status}.md.bam"), val("${idSample}_${status}.md.bai"), val("${idSample}.recal.table") into recalibrationTableTSV

when: ( step == 'mapping' ) && !params.onlyQC
when: step == 'mapping' && !params.onlyQC

script:
known = knownIndels.collect{ "--known-sites ${it}" }.join(' ')
Expand Down Expand Up @@ -533,25 +554,32 @@ def defineStepList() {
]
}

def extractFastq(tsvFile) {
// Channeling the TSV file containing FASTQ.
def extractSample(tsvFile) {
// Channeling the TSV file containing FASTQ or BAM
// Format is: "subject gender status sample lane fastq1 fastq2"
// or: "subject gender status sample lane bam"
Channel.from(tsvFile)
.splitCsv(sep: '\t')
.map { row ->
SarekUtils.checkNumberOfItem(row, 7)
def idPatient = row[0]
def gender = row[1]
def status = SarekUtils.returnStatus(row[2].toInteger())
def idSample = row[3]
def idRun = row[4]
def fastqFile1 = SarekUtils.returnFile(row[5])
def fastqFile2 = SarekUtils.returnFile(row[6])

SarekUtils.checkFileExtension(fastqFile1,".fastq.gz")
SarekUtils.checkFileExtension(fastqFile2,".fastq.gz")
def file1 = SarekUtils.returnFile(row[5])
def file2 = file("null")
if (file1.toString().toLowerCase().endsWith(".fastq.gz")) {
SarekUtils.checkNumberOfItem(row, 7)
file2 = SarekUtils.returnFile(row[6])
if (!SarekUtils.hasExtension(file2,"fastq.gz")) exit 1, "File: ${file2} has the wrong extension. See --help for more information"
}
else if (file1.toString().toLowerCase().endsWith(".bam")) {
SarekUtils.checkNumberOfItem(row, 6)
if (!SarekUtils.hasExtension(file1,"bam")) exit 1, "File: ${file1} has the wrong extension. See --help for more information"
}
else "No recognisable extention for input file: ${file1}"

[idPatient, gender, status, idSample, idRun, fastqFile1, fastqFile2]
[idPatient, gender, status, idSample, idRun, file1, file2]
}
}

Expand Down Expand Up @@ -603,9 +631,9 @@ def extractRecal(tsvFile) {
def baiFile = SarekUtils.returnFile(row[5])
def recalTable = SarekUtils.returnFile(row[6])

SarekUtils.checkFileExtension(bamFile,".bam")
SarekUtils.checkFileExtension(baiFile,".bai")
SarekUtils.checkFileExtension(recalTable,".recal.table")
if (!SarekUtils.hasExtension(bamFile,"bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(baiFile,"bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(recalTable,"recal.table")) exit 1, "File: ${recalTable} has the wrong extension. See --help for more information"

[ idPatient, gender, status, idSample, bamFile, baiFile, recalTable ]
}
Expand Down