Merge pull request #717 from MaxUlysse/remapping

Remapping
SciLifeLab · Feb 20, 2019 · e331795 · e331795
2 parents 7fdab77 + 5af588e
commit e331795
Show file tree

Hide file tree

Showing 4 changed files with 89 additions and 53 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,15 +8,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
 
 ### `Added`
+
 -   [#712](https://github.com/SciLifeLab/Sarek/pull/712), [#718](https://github.com/SciLifeLab/Sarek/pull/718) - Added possibilities to run Sarek with `conda`
 
 ### `Changed`
 
 -   [#710](https://github.com/SciLifeLab/Sarek/pull/710) - Improve release checklist and script
 -   [#711](https://github.com/SciLifeLab/Sarek/pull/711) - Improve configuration priorities
--   [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
--   [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
 -   [#716](https://github.com/SciLifeLab/Sarek/pull/716) - Update paths to containers and iGenomes
+-   [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `mapping` step can now map BAM files too
+-   [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `fastqFiles` renamed to `inputFiles`
+-   [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `MapReads` can now convert BAM to FASTQ and feed it to BWA on the fly
+-   [#717](https://github.com/SciLifeLab/Sarek/pull/717) - `checkFileExtension` has changed to `hasExtension`, and now only verify if file has extension
+-   [#717](https://github.com/SciLifeLab/Sarek/pull/717) - Update documentation
+-   [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `snpeff` and `vep` containers are now built with conda
+-   [#719](https://github.com/SciLifeLab/Sarek/pull/719) - `vepCacheVersion` is now defined in `conf/genomes.config` or `conf/igenomes.config`
 -   [#722](https://github.com/SciLifeLab/Sarek/pull/722) - Update `Sarek-data` submodule
 -   [#723](https://github.com/SciLifeLab/Sarek/pull/723), [#725](https://github.com/SciLifeLab/Sarek/pull/725) - Update docs
 -   [#724](https://github.com/SciLifeLab/Sarek/pull/724) - Improved AwsBatch configuration

diff --git a/docs/INPUT.md b/docs/INPUT.md
@@ -1,6 +1,7 @@
 # TSV file for sample
 
-Input files for Sarek can be specified using a tsv file given to the `--sample` parameter. The tsv file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2` or `subject gender status sample bam bai`.
+Input files for Sarek can be specified using a TSV file given to the `--sample` parameter.
+The TSV file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2`, `subject gender status sample lane bam` or `subject gender status sample bam bai`.
 The content of these columns should be quite straight-forward:
 
 - `subject` designate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample.
@@ -13,11 +14,14 @@ The content of these columns should be quite straight-forward:
 - `bam` is the bam file
 - `bai` is the index
 
-All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it was a single normal sample instead of a normal/tumor pair.
+All examples are given for a normal/tumor pair.
+If no tumors are listed in the TSV file, then the workflow will proceed as if it is a single normal sample instead of a normal/tumor pair.
 
 # Example TSV file for a normal/tumor pair with FASTQ files
 
-In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also. Note, the delimiter is the tab (\t) character:
+In this sample for the normal case there are 3 read groups, and 2 for the tumor.
+It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also.
+Note, the delimiter is the tab (\t) character:
 
 ```
 G15511    XX    0    C09DFN    C09DF_1    pathToFiles/C09DFACXX111207.1_1.fastq.gz    pathToFiles/C09DFACXX111207.1_2.fastq.gz
@@ -29,23 +33,21 @@ G15511    XX    1    D0ENMT    D0ENM_2    pathToFiles/D0ENMACXX111207.2_1.fastq.
 
 # Example TSV file for a normal/tumor pair with BAM files
 
-On the other hand, if you have BAMs (T/N pairs that were not realigned together) and their indexes, you should use a structure like:
+In this sample for the normal case there are 3 read groups, and 2 for the tumor.
+It is recommended to add the absolute path of BAM files, but relative path should work also.
+Note, the delimiter is the tab (\t) character:
 
 ```
-G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.real.bam    pathToFiles/G15511.C09DFN.md.real.bai
-G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
-```
-
-All the files will be created in the Preprocessing/NonRealigned/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the preprocessed files should be done by:
-
-```bash
-nextflow run SciLifeLab/Sarek/main.nf --sample Preprocessing/NonRealigned/mysample.tsv --step realign
-nextflow run SciLifeLab/Sarek/somaticVC.nf --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
+G15511    XX    0    C09DFN    C09DF_1    pathToFiles/C09DFAC_1.bam
+G15511    XX    0    C09DFN    C09DF_2    pathToFiles/C09DFAC_2.bam
+G15511    XX    0    C09DFN    C09DF_3    pathToFiles/C09DFAC_3.bam
+G15511    XX    1    D0ENMT    D0ENM_1    pathToFiles/D0ENMAC_1.bam
+G15511    XX    1    D0ENMT    D0ENM_2    pathToFiles/D0ENMAC_2.bam
 ```
 
 # Example TSV file for a normal/tumor pair with recalibrated BAM files
 
-The same way, if you have recalibrated BAMs (T/N pairs that were realigned together) and their indexes, you should use a structure like:
+The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
 
 ```
 G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.real.bam    pathToFiles/G15511.C09DFN.md.real.bai

diff --git a/lib/SarekUtils.groovy b/lib/SarekUtils.groovy
@@ -3,11 +3,6 @@ import nextflow.Channel
 
 class SarekUtils {
 
-  // Check file extension
-  static def checkFileExtension(it, extension) {
-    if (!it.toString().toLowerCase().endsWith(extension.toLowerCase())) exit 1, "File: ${it} has the wrong extension: ${extension} see --help for more information"
-  }
-
   // Check if a row has the expected number of item
   static def checkNumberOfItem(row, number) {
     if (row.size() != number) exit 1, "Malformed row in TSV file: ${row}, see --help for more information"
@@ -173,8 +168,8 @@ class SarekUtils {
         def bamFile   = SarekUtils.returnFile(row[4])
         def baiFile   = SarekUtils.returnFile(row[5])
 
-        SarekUtils.checkFileExtension(bamFile,".bam")
-        SarekUtils.checkFileExtension(baiFile,".bai")
+        if (!SarekUtils.hasExtension(bamFile,".bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
+        if (!SarekUtils.hasExtension(baiFile,".bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"
 
         if (mode == "germline") return [ idPatient, status, idSample, bamFile, baiFile ]
         else return [ idPatient, gender, status, idSample, bamFile, baiFile ]
@@ -193,6 +188,11 @@ class SarekUtils {
     [genders, channel]
   }
 
+  // Check file extension
+  static def hasExtension(it, extension) {
+    it.toString().toLowerCase().endsWith(extension.toLowerCase())
+  }
+
   // Compare params to list of verified params
   static def isAllowedParams(params) {
     def test = true

diff --git a/main.nf b/main.nf
@@ -80,20 +80,20 @@ if (!params.sample && !params.sampleDir) {
   if (params.test || step != 'mapping') tsvPath = tsvPaths[step]
 }
 
-// Set up the fastqFiles and bamFiles channels. One of them remains empty
-fastqFiles = Channel.empty()
+// Set up the inputFiles and bamFiles channels. One of them will remain empty
+inputFiles = Channel.empty()
 bamFiles = Channel.empty()
 if (tsvPath) {
   tsvFile = file(tsvPath)
   switch (step) {
-    case 'mapping': fastqFiles = extractFastq(tsvFile); break
+    case 'mapping': inputFiles = extractSample(tsvFile); break
     case 'recalibrate': bamFiles = extractRecal(tsvFile); break
     default: exit 1, "Unknown step ${step}"
   }
 } else if (params.sampleDir) {
   if (step != 'mapping') exit 1, '--sampleDir does not support steps other than "mapping"'
-  fastqFiles = extractFastqFromDir(params.sampleDir)
-  (fastqFiles, fastqTmp) = fastqFiles.into(2)
+  inputFiles = extractFastqFromDir(params.sampleDir)
+  (inputFiles, fastqTmp) = inputFiles.into(2)
   fastqTmp.toList().subscribe onNext: {
     if (it.size() == 0) {
       exit 1, "No FASTQ files found in --sampleDir directory '${params.sampleDir}'"
@@ -102,8 +102,8 @@ if (tsvPath) {
   tsvFile = params.sampleDir  // used in the reports
 } else exit 1, 'No sample were defined, see --help'
 
-if (step == 'mapping') (patientGenders, fastqFiles) = SarekUtils.extractGenders(fastqFiles)
-else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
+if (step == 'recalibrate') (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
+else (patientGenders, inputFiles) = SarekUtils.extractGenders(inputFiles)
 
 /*
 ================================================================================
@@ -113,9 +113,9 @@ else (patientGenders, bamFiles) = SarekUtils.extractGenders(bamFiles)
 
 startMessage()
 
-(fastqFiles, fastqFilesforFastQC) = fastqFiles.into(2)
+(inputFiles, inputFilesforFastQC) = inputFiles.into(2)
 
-if (params.verbose) fastqFiles = fastqFiles.view {
+if (params.verbose) inputFiles = inputFiles.view {
   "FASTQs to preprocess:\n\
   ID    : ${it[0]}\tStatus: ${it[1]}\tSample: ${it[2]}\tRun   : ${it[3]}\n\
   Files : [${it[4].fileName}, ${it[5].fileName}]"
@@ -133,16 +133,17 @@ process RunFastQC {
   publishDir "${directoryMap.fastQC}/${idRun}", mode: params.publishDirMode
 
   input:
-    set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFilesforFastQC
+    set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFilesforFastQC
 
   output:
     file "*_fastqc.{zip,html}" into fastQCreport
 
   when: step == 'mapping' && !params.noReports
 
   script:
+  inputFiles = SarekUtils.hasExtension(inputFile1,"fastq.gz") ? "${inputFile1} ${inputFile2}" : "${inputFile1}"
   """
-  fastqc -t 2 -q ${fastqFile1} ${fastqFile2}
+  fastqc -t 2 -q ${inputFiles}
   """
 }
 
@@ -155,7 +156,7 @@ process MapReads {
   tag {idPatient + "-" + idRun}
 
   input:
-    set idPatient, status, idSample, idRun, file(fastqFile1), file(fastqFile2) from fastqFiles
+    set idPatient, status, idSample, idRun, file(inputFile1), file(inputFile2) from inputFiles
     set file(genomeFile), file(bwaIndex) from Channel.value([referenceMap.genomeFile, referenceMap.bwaIndex])
 
   output:
@@ -168,11 +169,31 @@ process MapReads {
   readGroup = "@RG\\tID:${idRun}\\t${CN}PU:${idRun}\\tSM:${idSample}\\tLB:${idSample}\\tPL:illumina"
   // adjust mismatch penalty for tumor samples
   extra = status == 1 ? "-B 3" : ""
-  """
-  bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
-  ${genomeFile} ${fastqFile1} ${fastqFile2} | \
-  samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
-  """
+  if (SarekUtils.hasExtension(inputFile1,"fastq.gz"))
+    """
+    bwa mem -R \"${readGroup}\" ${extra} -t ${task.cpus} -M \
+    ${genomeFile} ${inputFile1} ${inputFile2} | \
+    samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
+    """
+  else if (SarekUtils.hasExtension(inputFile1,"bam"))
+  // -K is an hidden option, used to fix the number of reads processed by bwa mem
+  // Chunk size can affect bwa results, if not specified, the number of threads can change
+  // which can give not deterministic result.
+  // cf https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md
+  // and https://github.com/gatk-workflows/gatk4-data-processing/blob/8ffa26ff4580df4ac3a5aa9e272a4ff6bab44ba2/processing-for-variant-discovery-gatk4.b37.wgs.inputs.json#L29
+    """
+    gatk --java-options -Xmx${task.memory.toGiga()}g \
+    SamToFastq \
+    --INPUT=${inputFile1} \
+    --FASTQ=/dev/stdout \
+    --INTERLEAVE=true \
+    --NON_PF=true \
+    | \
+    bwa mem -K 100000000 -p -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${genomeFile} \
+    /dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2) \
+    | \
+    samtools sort --threads ${task.cpus} -m 2G - > ${idRun}.bam
+    """
 }
 
 if (params.verbose) mappedBam = mappedBam.view {
@@ -343,7 +364,7 @@ process CreateRecalibrationTable {
     set idPatient, status, idSample, file("${idSample}.recal.table") into recalibrationTable
     set idPatient, status, idSample, val("${idSample}_${status}.md.bam"), val("${idSample}_${status}.md.bai"), val("${idSample}.recal.table") into recalibrationTableTSV
 
-  when: ( step == 'mapping' ) && !params.onlyQC
+  when: step == 'mapping' && !params.onlyQC
 
   script:
   known = knownIndels.collect{ "--known-sites ${it}" }.join(' ')
@@ -533,25 +554,32 @@ def defineStepList() {
   ]
 }
 
-def extractFastq(tsvFile) {
-  // Channeling the TSV file containing FASTQ.
+def extractSample(tsvFile) {
+  // Channeling the TSV file containing FASTQ or BAM
   // Format is: "subject gender status sample lane fastq1 fastq2"
+  // or: "subject gender status sample lane bam"
   Channel.from(tsvFile)
   .splitCsv(sep: '\t')
   .map { row ->
-    SarekUtils.checkNumberOfItem(row, 7)
     def idPatient  = row[0]
     def gender     = row[1]
     def status     = SarekUtils.returnStatus(row[2].toInteger())
     def idSample   = row[3]
     def idRun      = row[4]
-    def fastqFile1 = SarekUtils.returnFile(row[5])
-    def fastqFile2 = SarekUtils.returnFile(row[6])
-
-    SarekUtils.checkFileExtension(fastqFile1,".fastq.gz")
-    SarekUtils.checkFileExtension(fastqFile2,".fastq.gz")
+    def file1      = SarekUtils.returnFile(row[5])
+    def file2      = file("null")
+    if (file1.toString().toLowerCase().endsWith(".fastq.gz")) {
+      SarekUtils.checkNumberOfItem(row, 7)
+      file2 = SarekUtils.returnFile(row[6])
+      if (!SarekUtils.hasExtension(file2,"fastq.gz")) exit 1, "File: ${file2} has the wrong extension. See --help for more information"
+    }
+    else if (file1.toString().toLowerCase().endsWith(".bam")) {
+      SarekUtils.checkNumberOfItem(row, 6)
+      if (!SarekUtils.hasExtension(file1,"bam")) exit 1, "File: ${file1} has the wrong extension. See --help for more information"
+    }
+    else "No recognisable extention for input file: ${file1}"
 
-    [idPatient, gender, status, idSample, idRun, fastqFile1, fastqFile2]
+    [idPatient, gender, status, idSample, idRun, file1, file2]
   }
 }
 
@@ -603,9 +631,9 @@ def extractRecal(tsvFile) {
     def baiFile    = SarekUtils.returnFile(row[5])
     def recalTable = SarekUtils.returnFile(row[6])
 
-    SarekUtils.checkFileExtension(bamFile,".bam")
-    SarekUtils.checkFileExtension(baiFile,".bai")
-    SarekUtils.checkFileExtension(recalTable,".recal.table")
+    if (!SarekUtils.hasExtension(bamFile,"bam")) exit 1, "File: ${bamFile} has the wrong extension. See --help for more information"
+    if (!SarekUtils.hasExtension(baiFile,"bai")) exit 1, "File: ${baiFile} has the wrong extension. See --help for more information"
+    if (!SarekUtils.hasExtension(recalTable,"recal.table")) exit 1, "File: ${recalTable} has the wrong extension. See --help for more information"
 
     [ idPatient, gender, status, idSample, bamFile, baiFile, recalTable ]
   }