Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

Commit

Permalink
Merge pull request #717 from MaxUlysse/remapping
Browse files Browse the repository at this point in the history
Remapping
  • Loading branch information
Szilveszter Juhos authored Feb 20, 2019
2 parents 7fdab77 + 5af588e commit e331795
Show file tree
Hide file tree
Showing 4 changed files with 89 additions and 53 deletions.
10 changes: 8 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
## [Unreleased]

### `Added`

- [#712](https://github.com/SciLifeLab/Sarek/pull/712), [#718](https://github.com/SciLifeLab/Sarek/pull/718) - Added possibilities to run Sarek with `conda`

### `Changed`

- [#710](https://github.com/SciLifeLab/Sarek/pull/710) - Improve release checklist and script
- [#711](https://github.com/SciLifeLab/Sarek/pull/711) - Improve configuration priorities
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
- [#716](https://github.com/SciLifeLab/Sarek/pull/716) - Update paths to containers and iGenomes
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `mapping` step can now map BAM files too
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `fastqFiles` renamed to `inputFiles`
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `MapReads` can now convert BAM to FASTQ and feed it to BWA on the fly
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `checkFileExtension` has changed to `hasExtension`, and now only verify if file has extension
- [#717](https://github.com/SciLifeLab/Sarek/pull/717) - Update documentation
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
- [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
- [#722](https://github.com/SciLifeLab/Sarek/pull/722) - Update `Sarek-data` submodule
- [#723](https://github.com/SciLifeLab/Sarek/pull/723), [#725](https://github.com/SciLifeLab/Sarek/pull/725) - Update docs
- [#724](https://github.com/SciLifeLab/Sarek/pull/724) - Improved AwsBatch configuration
Expand Down
30 changes: 16 additions & 14 deletions docs/INPUT.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# TSV file for sample

Input files for Sarek can be specified using a tsv file given to the `--sample` parameter. The tsv file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2` or `subject gender status sample bam bai`.
Input files for Sarek can be specified using a TSV file given to the `--sample` parameter.
The TSV file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2`, `subject gender status sample lane bam` or `subject gender status sample bam bai`.
The content of these columns should be quite straight-forward:

- `subject` designate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample.
Expand All @@ -13,11 +14,14 @@ The content of these columns should be quite straight-forward:
- `bam` is the bam file
- `bai` is the index

All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it was a single normal sample instead of a normal/tumor pair.
All examples are given for a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a single normal sample instead of a normal/tumor pair.

# Example TSV file for a normal/tumor pair with FASTQ files

In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also. Note, the delimiter is the tab (\t) character:
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also.
Note, the delimiter is the tab (\t) character:

```
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
Expand All @@ -29,23 +33,21 @@ G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.

# Example TSV file for a normal/tumor pair with BAM files

On the other hand, if you have BAMs (T/N pairs that were not realigned together) and their indexes, you should use a structure like:
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
It is recommended to add the absolute path of BAM files, but relative path should work also.
Note, the delimiter is the tab (\t) character:

```
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
```

All the files will be created in the Preprocessing/NonRealigned/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the preprocessed files should be done by:

```bash
nextflow run SciLifeLab/Sarek/main.nf --sample Preprocessing/NonRealigned/mysample.tsv --step realign
nextflow run SciLifeLab/Sarek/somaticVC.nf --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFAC_1.bam
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFAC_2.bam
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFAC_3.bam
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
```

# Example TSV file for a normal/tumor pair with recalibrated BAM files

The same way, if you have recalibrated BAMs (T/N pairs that were realigned together) and their indexes, you should use a structure like:
The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:

```
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
Expand Down
14 changes: 7 additions & 7 deletions lib/SarekUtils.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@ import nextflow.Channel

class SarekUtils {

// Check file extension
static def checkFileExtension(it, extension) {
if (!it.toString().toLowerCase().endsWith(extension.toLowerCase())) exit 1, "File: ${it} has the wrong extension: ${extension} see --help for more information"
}

// Check if a row has the expected number of item
static def checkNumberOfItem(row, number) {
if (row.size() != number) exit 1, "Malformed row in TSV file: ${row}, see --help for more information"
Expand Down Expand Up @@ -173,8 +168,8 @@ class SarekUtils {
def bamFile = SarekUtils.returnFile(row[4])
def baiFile = SarekUtils.returnFile(row[5])

SarekUtils.checkFileExtension(bamFile,".bam")
SarekUtils.checkFileExtension(baiFile,".bai")
if (!SarekUtils.hasExtension(bamFile,".bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(baiFile,".bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"

if (mode == "germline") return [ idPatient, status, idSample, bamFile, baiFile ]
else return [ idPatient, gender, status, idSample, bamFile, baiFile ]
Expand All @@ -193,6 +188,11 @@ class SarekUtils {
[genders, channel]
}

// Check file extension
static def hasExtension(it, extension) {
it.toString().toLowerCase().endsWith(extension.toLowerCase())
}

// Compare params to list of verified params
static def isAllowedParams(params) {
def test = true
Expand Down
88 changes: 58 additions & 30 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,20 @@ if (!params.sample && !params.sampleDir) {
if (params.test || step != 'mapping') tsvPath = tsvPaths[step]
}

// Set up the fastqFiles and bamFiles channels. One of them remains empty
fastqFiles = Channel.empty()
// Set up the inputFiles and bamFiles channels. One of them will remain empty
inputFiles = Channel.empty()
bamFiles = Channel.empty()
if (tsvPath) {
tsvFile = file(tsvPath)
switch (step) {
case 'mapping': fastqFiles = extractFastq(tsvFile); break
case 'mapping': inputFiles = extractSample(tsvFile); break
case 'recalibrate': bamFiles = extractRecal(tsvFile); break
default: exit 1, "Unknown step ${step}"
}
} else if (params.sampleDir) {
if (step != 'mapping') exit 1, '--sampleDir does not support steps other than "mapping"'
fastqFiles = extractFastqFromDir(params.sampleDir)
(fastqFiles, fastqTmp) = fastqFiles.into(2)
inputFiles = extractFastqFromDir(params.sampleDir)
(inputFiles, fastqTmp) = inputFiles.into(2)
fastqTmp.toList().subscribe onNext: {
if (it.size() == 0) {
exit 1, "No FASTQ files found in --sampleDir directory '${params.sampleDir}'"
Expand All @@ -102,8 +102,8 @@ if (tsvPath) {
tsvFile = params.sampleDir // used in the reports
} else exit 1, 'No sample were defined, see --help'

if (step == 'mapping') (patientGenders, fastqFiles) = SarekUtils.extractGenders(fastqFiles)
else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
if (step == 'recalibrate') (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
else (patientGenders, inputFiles) = SarekUtils.extractGenders(inputFiles)

/*
================================================================================
Expand All @@ -113,9 +113,9 @@ else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)

startMessage()

(fastqFiles, fastqFilesforFastQC) = fastqFiles.into(2)
(inputFiles, inputFilesforFastQC) = inputFiles.into(2)

if (params.verbose) fastqFiles = fastqFiles.view {
if (params.verbose) inputFiles = inputFiles.view {
"FASTQs to preprocess:\n\
ID : ${it[0]}\tStatus: ${it[1]}\tSample: ${it[2]}\tRun : ${it[3]}\n\
Files : [${it[4].fileName}, ${it[5].fileName}]"
Expand All @@ -133,16 +133,17 @@ process RunFastQC {
publishDir "${directoryMap.fastQC}/${idRun}", mode: params.publishDirMode

input:
set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFilesforFastQC
set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFilesforFastQC

output:
file "*_fastqc.{zip,html}" into fastQCreport

when: step == 'mapping' && !params.noReports

script:
inputFiles = SarekUtils.hasExtension(inputFile1,"fastq.gz") ? "${inputFile1} ${inputFile2}" : "${inputFile1}"
"""
fastqc -t 2 -q ${fastqFile1} ${fastqFile2}
fastqc -t 2 -q ${inputFiles}
"""
}

Expand All @@ -155,7 +156,7 @@ process MapReads {
tag {idPatient + "-" + idRun}

input:
set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFiles
set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFiles
set file(genomeFile), file(bwaIndex) from Channel.value([referenceMap.genomeFile, referenceMap.bwaIndex])

output:
Expand All @@ -168,11 +169,31 @@ process MapReads {
readGroup = "@RG\\tID:${idRun}\\t${CN}PU:${idRun}\\tSM:${idSample}\\tLB:${idSample}\\tPL:illumina"
// adjust mismatch penalty for tumor samples
extra = status == 1 ? "-B 3" : ""
"""
bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
${genomeFile} ${fastqFile1} ${fastqFile2} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
if (SarekUtils.hasExtension(inputFile1,"fastq.gz"))
"""
bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
${genomeFile} ${inputFile1} ${inputFile2} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
else if (SarekUtils.hasExtension(inputFile1,"bam"))
// -K is an hidden option, used to fix the number of reads processed by bwa mem
// Chunk size can affect bwa results, if not specified, the number of threads can change
// which can give not deterministic result.
// cf https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md
// and https://github.com/gatk-workflows/gatk4-data-processing/blob/8ffa26ff4580df4ac3a5aa9e272a4ff6bab44ba2/processing-for-variant-discovery-gatk4.b37.wgs.inputs.json#L29
"""
gatk --java-options -Xmx${task.memory.toGiga()}g \
SamToFastq \
--INPUT=${inputFile1} \
--FASTQ=/dev/stdout \
--INTERLEAVE=true \
--NON_PF=true \
| \
bwa mem -K 100000000 -p -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${genomeFile} \
/dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2) \
| \
samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
"""
}

if (params.verbose) mappedBam = mappedBam.view {
Expand Down Expand Up @@ -343,7 +364,7 @@ process CreateRecalibrationTable {
set idPatient, status, idSample, file("${idSample}.recal.table") into recalibrationTable
set idPatient, status, idSample, val("${idSample}_${status}.md.bam"), val("${idSample}_${status}.md.bai"), val("${idSample}.recal.table") into recalibrationTableTSV

when: ( step == 'mapping' ) && !params.onlyQC
when: step == 'mapping' && !params.onlyQC

script:
known = knownIndels.collect{ "--known-sites ${it}" }.join(' ')
Expand Down Expand Up @@ -533,25 +554,32 @@ def defineStepList() {
]
}

def extractFastq(tsvFile) {
// Channeling the TSV file containing FASTQ.
def extractSample(tsvFile) {
// Channeling the TSV file containing FASTQ or BAM
// Format is: "subject gender status sample lane fastq1 fastq2"
// or: "subject gender status sample lane bam"
Channel.from(tsvFile)
.splitCsv(sep: '\t')
.map { row ->
SarekUtils.checkNumberOfItem(row, 7)
def idPatient = row[0]
def gender = row[1]
def status = SarekUtils.returnStatus(row[2].toInteger())
def idSample = row[3]
def idRun = row[4]
def fastqFile1 = SarekUtils.returnFile(row[5])
def fastqFile2 = SarekUtils.returnFile(row[6])

SarekUtils.checkFileExtension(fastqFile1,".fastq.gz")
SarekUtils.checkFileExtension(fastqFile2,".fastq.gz")
def file1 = SarekUtils.returnFile(row[5])
def file2 = file("null")
if (file1.toString().toLowerCase().endsWith(".fastq.gz")) {
SarekUtils.checkNumberOfItem(row, 7)
file2 = SarekUtils.returnFile(row[6])
if (!SarekUtils.hasExtension(file2,"fastq.gz")) exit 1, "File: ${file2} has the wrong extension. See --help for more information"
}
else if (file1.toString().toLowerCase().endsWith(".bam")) {
SarekUtils.checkNumberOfItem(row, 6)
if (!SarekUtils.hasExtension(file1,"bam")) exit 1, "File: ${file1} has the wrong extension. See --help for more information"
}
else "No recognisable extention for input file: ${file1}"

[idPatient, gender, status, idSample, idRun, fastqFile1, fastqFile2]
[idPatient, gender, status, idSample, idRun, file1, file2]
}
}

Expand Down Expand Up @@ -603,9 +631,9 @@ def extractRecal(tsvFile) {
def baiFile = SarekUtils.returnFile(row[5])
def recalTable = SarekUtils.returnFile(row[6])

SarekUtils.checkFileExtension(bamFile,".bam")
SarekUtils.checkFileExtension(baiFile,".bai")
SarekUtils.checkFileExtension(recalTable,".recal.table")
if (!SarekUtils.hasExtension(bamFile,"bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(baiFile,"bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"
if (!SarekUtils.hasExtension(recalTable,"recal.table")) exit 1, "File: ${recalTable} has the wrong extension. See --help for more information"

[ idPatient, gender, status, idSample, bamFile, baiFile, recalTable ]
}
Expand Down

0 comments on commit e331795

Please sign in to comment.