Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

To process targeted sequencing with a target BED #635

Merged
merged 36 commits into from
Sep 21, 2018
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
a7e54aa
removing spurious VEP directory
Aug 16, 2018
e89d52d
Merge remote-tracking branch 'upstream/master'
Sep 3, 2018
ae126ae
Strelka targeted is working fine
Sep 4, 2018
5082a7a
changes to ConcatVCF to accomodate bcftools isec in germline targets
Sep 6, 2018
f980e11
concatenateVCF.sh is now a separate script to avoid code duplication
Sep 6, 2018
df4d8d1
added concatOptions and Strelka fix
Sep 6, 2018
60d15cc
added getopts, fixed existence check and set +u
Sep 6, 2018
9ca0c93
Somatic ConcatVCF also simplified
Sep 6, 2018
f09345e
Merge remote-tracking branch 'upstream/master'
Sep 6, 2018
fa8aec8
updated documentation
Sep 6, 2018
4bd45cf
\n or \n that is the question
Sep 10, 2018
738c137
exclamation misplaced
Sep 10, 2018
2f69352
killing me softly with a VEP line
Sep 10, 2018
210d050
Merge branch 'dev' into master
Sep 10, 2018
1331a79
Merge remote-tracking branch 'upstream/dev'
Sep 10, 2018
fba72df
concatVCF.sh moved to bin
Sep 11, 2018
ce331ea
Added --cpus directive
Sep 11, 2018
b0a0da2
Merge branch 'master' of github.com:szilvajuhos/Sarek
Sep 11, 2018
70b40d2
Sarek-data updated
Sep 11, 2018
43ea430
putting concatenateVCF.sh to bin
Sep 12, 2018
d237b8f
adding targetBED to tests and wrapper
Sep 12, 2018
5d5407a
temporary fix for vepgrch37 container path problem
Sep 12, 2018
def35d8
added target report at the end, targetBED=false in base.config
Sep 12, 2018
213b07e
resolved merge conflict
Sep 12, 2018
39fdfd4
falling back to tiny.tsv
Sep 12, 2018
72c0c2c
simplified tests without singularity
Sep 13, 2018
a227630
Merge branch 'dev' into master
Sep 13, 2018
be3cc7f
typo fix
Sep 13, 2018
ae0add8
Merge branch 'master' of github.com:szilvajuhos/Sarek
Sep 13, 2018
813ba52
CHANGELOG changes changed
Sep 13, 2018
8204c7a
even less somatic test
Sep 13, 2018
aa056c9
fixed spacing
Sep 13, 2018
d8c35d5
Fixing fixed fixes in CHANGELOG
Sep 13, 2018
fef3c1f
Zenodo REST API to upload data
Sep 14, 2018
44c7ffa
Conflict resolve
Sep 21, 2018
60da87c
Zenodo tests
Sep 21, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ env:
global:
- NXF_VER=0.31.0 SGT_VER=2.5.1
matrix:
- CE=singularity TEST=SOMATIC
- CE=docker TEST=SOMATIC
- CE=docker TEST=ANNOTATEVEP
- CE=singularity TEST=ANNOTATESNPEFF
- CE=docker TEST=ANNOTATESNPEFF
- CE=singularity TEST=GERMLINE
- CE=docker TEST=GERMLINE


maxulysse marked this conversation as resolved.
Show resolved Hide resolved
install:
# Install Nextflow (and Singularity if needed)
- "./scripts/install.sh --engine $CE"
Expand Down
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- [#613](https://github.com/SciLifeLab/Sarek/pull/613) - Add Issue Templates (bug report and feature request)
- [#614](https://github.com/SciLifeLab/Sarek/pull/614) - Add PR Template
- [#615](https://github.com/SciLifeLab/Sarek/pull/615) - Add presentation
- [#615](https://github.com/SciLifeLab/Sarek/pull/615) - Update documentation
- [#616](https://github.com/SciLifeLab/Sarek/pull/616) - Update documentation
- [#620](https://github.com/SciLifeLab/Sarek/pull/620) - Add `tmp/` to `.gitignore`
- [#625](https://github.com/SciLifeLab/Sarek/pull/625) - Add [`pathfindr`](https://github.com/NBISweden/pathfindr) as a submodule
- [#639](https://github.com/SciLifeLab/Sarek/pull/639) - Add a complete example analysis to docs
- [#629](https://github.com/SciLifeLab/Sarek/pull/629) - Add a complete example analysis to docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this one really was the #639

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should keep the PR names in the description lines, even if they are long. Now I am confused since both #629 and #639 are docs-related PRs, I guess it is #639 here since that is "Added" not only updated, is it right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has not to be the PR name.
It could be a shorter description, since a PR can fall into Added, Changed, Removedor evenFixedcategory at the same time. For meAddedis anything new andChanged` is anything updated

- [#635](https://github.com/SciLifeLab/Sarek/pull/635) - To process targeted sequencing with a target BED
- [#640](https://github.com/SciLifeLab/Sarek/pull/640) - Add helper script for changing version number

### `Changed`
Expand All @@ -31,7 +32,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- [#637](https://github.com/SciLifeLab/Sarek/pull/637) - Update tool version gathering
- [#638](https://github.com/SciLifeLab/Sarek/pull/638) - Use correct `.simg` extension for Singularity images
- [#639](https://github.com/SciLifeLab/Sarek/pull/639) - Smaller refactoring of the docs
- [#640](https://github.com/SciLifeLab/Sarek/pull/640) - Update RELEASE_CHECKLIST
- [#640](https://github.com/SciLifeLab/Sarek/pull/640) - Update RELEASE\_CHECKLIST

### `Removed`

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# [![Sarek](https://raw.githubusercontent.com/SciLifeLab/Sarek/master/docs/images/Sarek_logo.png "Sarek")](http://sarek.scilifelab.se/)

#### An open-source analysis pipeline to detect germline or somatic variants from whole genome sequencing
#### An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing
maxulysse marked this conversation as resolved.
Show resolved Hide resolved

[![Nextflow version][nextflow-badge]][nextflow-link]
[![Travis build status][travis-badge]][travis-link]
Expand Down
2 changes: 1 addition & 1 deletion Sarek-data
Submodule Sarek-data updated 2 files
+7 −18 README.md
+2 −0 testdata/target.bed
2 changes: 1 addition & 1 deletion annotate.nf
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ process RunVEP {
finalannotator = annotator == "snpeff" ? 'merge' : 'vep'
genome = params.genome == 'smallGRCh37' ? 'GRCh37' : params.genome
"""
vep --dir /opt/vep/.vep/ \
/opt/vep/src/ensembl-vep/vep --dir /opt/vep/.vep/ \
-i ${vcf} \
-o ${vcf.simpleName}_VEP.ann.vcf \
--assembly ${genome} \
Expand Down
85 changes: 85 additions & 0 deletions bin/concatenateVCFs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env bash
# this script concatenates all VCFs that are in the local directory: the
# purpose is to make a single VCF from all the VCFs that were created from different intervals

usage() { echo "Usage: $0 [-i genome_index_file] [-o output.file.no.gz.extension] <-t target.bed> <-c cpus>" 1>&2; exit 1; }

while getopts "i:c:o:t:" p; do
case "${p}" in
i)
genomeIndex=${OPTARG}
;;
c)
cpus=${OPTARG}
;;
o)
outputFile=${OPTARG}
;;
t)
targetBED=${OPTARG}
;;
*)
usage
;;
esac
done
shift $((OPTIND-1))

if [ -z ${genomeIndex} ]; then echo "Missing index file "; usage; fi
if [ -z ${cpus} ]; then echo "No CPUs defined: setting to 1"; cpus=1; fi
if [ -z ${outputFile} ]; then echo "Missing output file name"; usage; fi

set -euo pipefail

# first make a header from one of the VCF intervals
# get rid of interval information only from the GATK command-line, but leave the rest
FIRSTVCF=$(ls *.vcf | head -n 1)
sed -n '/^[^#]/q;p' $FIRSTVCF | \
awk '!/GATKCommandLine/{print}/GATKCommandLine/{for(i=1;i<=NF;i++){if($i!~/intervals=/ && $i !~ /out=/){printf("%s ",$i)}}printf("\n")}' \
> header

# Get list of contigs from the FASTA index (.fai). We cannot use the ##contig
# header in the VCF as it is optional (FreeBayes does not save it, for example)
CONTIGS=($(cut -f1 ${genomeIndex}))

# concatenate VCFs in the correct order
(
cat header

for chr in "${CONTIGS[@]}"; do
# Skip if globbing would not match any file to avoid errors such as
# "ls: cannot access chr3_*.vcf: No such file or directory" when chr3
# was not processed.
pattern="${chr}_*.vcf"
if ! compgen -G "${pattern}" > /dev/null; then continue; fi

# ls -v sorts by numeric value ("version"), which means that chr1_100_
# is sorted *after* chr1_99_.
for vcf in $(ls -v ${pattern}); do
# Determine length of header.
# The 'q' command makes sed exit when it sees the first non-header
# line, which avoids reading in the entire file.
L=$(sed -n '/^[^#]/q;p' ${vcf} | wc -l)

# Then print all non-header lines. Since tail is very fast (nearly as
# fast as cat), this is way more efficient than using a single sed,
# awk or grep command.
tail -n +$((L+1)) ${vcf}
done
done
) | bgzip -@${cpus} > rawcalls.vcf.gz
tabix rawcalls.vcf.gz

set +u

# now we have the concatenated VCF file, check for WES/panel targets, and generate a subset if there is a BED provided
echo "target is $targetBED"
if [ ! -z ${targetBED+x} ]; then
echo "Selecting subset..."
bcftools isec --targets-file ${targetBED} rawcalls.vcf.gz | bgzip -@${cpus} > ${outputFile}.gz
tabix ${outputFile}.gz
else
# simply rename the raw calls as WGS results
for f in rawcalls*; do mv -v $f ${outputFile}${f#rawcalls.vcf}; done
fi

1 change: 1 addition & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ params {
step = 'mapping' // Default step is mapping
strelkaBP = false // Don't use Manta's candidate indels as input to Strelka
tag = 'latest' // Default tag is latest, to be overwritten by --tag <version>
targetBED = false // no targets by default
test = false // Not testing by default
verbose = false // Enable for more verbose information
version = '2.1.0' // Workflow version
Expand Down
11 changes: 10 additions & 1 deletion docs/USE_CASES.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,16 @@ SUBJECT_ID XX 0 SAMPLEIDN /samples/SAMPLEIDN.bam /samples/SAMPLEIDN
SUBJECT_ID XX 1 SAMPLEIDT /samples/SAMPLEIDT.bam /samples/SAMPLEIDT.bai
SUBJECT_ID XX 1 SAMPLEIDR /samples/SAMPLEIDR.bam /samples/SAMPLEIDR.bai
```

If you want to restart a previous run of the pipeline, you may not have a recalibrated BAM file.
This is the case if HaplotypeCaller was the only tool (recalibration is done on-the-fly with HaplotypeCaller to improve performance and save space).
In this case, you need to start with `--step=recalibrate` (see previous section).

## Processing targeted (whole exome or panel) sequencing data

The recommended flow for thrgeted sequencing data is to use the whole genome workflow as it is, but also provide a BED file containing targets for variant calling.
The Strelka part of the workflow will pick up these intervals, and activate the `--exome` flag to process deeper coverage. It is adviced to pad the variant calling
regions (exons or the target) to some extent before submitting to the workflow. To add the target BED file configure the flow like:

```bash
nextflow run SciLifeLab/Sarek/germlineVC.nf --tools haplotypecaller,strelka,mutect2 --targetBED targets.bed --sample my_panel.tsv
```
96 changes: 37 additions & 59 deletions germlineVC.nf
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,8 @@ process ConcatVCF {
file(genomeIndex) from Channel.value(referenceMap.genomeIndex)

output:
set variantCaller, idPatient, idSampleNormal, idSampleTumor, file("*.vcf.gz"), file("*.vcf.gz.tbi") into vcfConcatenated
// we have this funny *_* pattern to avoid copying the raw calls to publishdir
set variantCaller, idPatient, idSampleNormal, idSampleTumor, file("*_*.vcf.gz"), file("*_*.vcf.gz.tbi") into vcfConcatenated


when: ( 'haplotypecaller' in tools || 'mutect2' in tools || 'freebayes' in tools ) && !params.onlyQC
Expand All @@ -373,47 +374,14 @@ process ConcatVCF {
else if (variantCaller == 'gvcf-hc') outputFile = "haplotypecaller_${idSampleNormal}.g.vcf"
else outputFile = "${variantCaller}_${idSampleTumor}_vs_${idSampleNormal}.vcf"

"""
set -euo pipefail
# first make a header from one of the VCF intervals
# get rid of interval information only from the GATK command-line, but leave the rest
FIRSTVCF=\$(ls *.vcf | head -n 1)
sed -n '/^[^#]/q;p' \$FIRSTVCF | \
awk '!/GATKCommandLine/{print}/GATKCommandLine/{for(i=1;i<=NF;i++){if(\$i!~/intervals=/ && \$i !~ /out=/){printf("%s ",\$i)}}printf("\\n")}' \
> header

# Get list of contigs from the FASTA index (.fai). We cannot use the ##contig
# header in the VCF as it is optional (FreeBayes does not save it, for example)
CONTIGS=(\$(cut -f1 ${genomeIndex}))

# concatenate VCFs in the correct order
(
cat header

for chr in "\${CONTIGS[@]}"; do
# Skip if globbing would not match any file to avoid errors such as
# "ls: cannot access chr3_*.vcf: No such file or directory" when chr3
# was not processed.
pattern="\${chr}_*.vcf"
if ! compgen -G "\${pattern}" > /dev/null; then continue; fi

# ls -v sorts by numeric value ("version"), which means that chr1_100_
# is sorted *after* chr1_99_.
for vcf in \$(ls -v \${pattern}); do
# Determine length of header.
# The 'q' command makes sed exit when it sees the first non-header
# line, which avoids reading in the entire file.
L=\$(sed -n '/^[^#]/q;p' \${vcf} | wc -l)

# Then print all non-header lines. Since tail is very fast (nearly as
# fast as cat), this is way more efficient than using a single sed,
# awk or grep command.
tail -n +\$((L+1)) \${vcf}
done
done
) | bgzip > ${outputFile}.gz
tabix ${outputFile}.gz
"""
if(params.targetBED) // targeted
concatOptions = "-i ${genomeIndex} -c ${task.cpus} -o ${outputFile} -t ${params.targetBED}"
else // WGS
concatOptions = "-i ${genomeIndex} -c ${task.cpus} -o ${outputFile} "

"""
concatenateVCFs.sh ${concatOptions}
"""
}

if (params.verbose) vcfConcatenated = vcfConcatenated.view {
Expand Down Expand Up @@ -441,23 +409,32 @@ process RunSingleStrelka {
when: 'strelka' in tools && !params.onlyQC

script:
"""
configureStrelkaGermlineWorkflow.py \
--bam ${bam} \
--referenceFasta ${genomeFile} \
--runDir Strelka

python Strelka/runWorkflow.py -m local -j ${task.cpus}

mv Strelka/results/variants/genome.*.vcf.gz \
Strelka_${idSample}_genome.vcf.gz
mv Strelka/results/variants/genome.*.vcf.gz.tbi \
Strelka_${idSample}_genome.vcf.gz.tbi
mv Strelka/results/variants/variants.vcf.gz \
Strelka_${idSample}_variants.vcf.gz
mv Strelka/results/variants/variants.vcf.gz.tbi \
Strelka_${idSample}_variants.vcf.gz.tbi
"""
"""
if [ ! -s "${params.targetBED}" ]; then
# do WGS
configureStrelkaGermlineWorkflow.py \
--bam ${bam} \
--referenceFasta ${genomeFile} \
--runDir Strelka
else
# WES or targeted
bgzip --threads ${task.cpus} -c ${params.targetBED} > call_targets.bed.gz
tabix call_targets.bed.gz
configureStrelkaGermlineWorkflow.py \
--bam ${bam} \
--referenceFasta ${genomeFile} \
--exome \
--callRegions call_targets.bed.gz \
--runDir Strelka
fi

# always run this part
python Strelka/runWorkflow.py -m local -j ${task.cpus}
mv Strelka/results/variants/genome.*.vcf.gz Strelka_${idSample}_genome.vcf.gz
mv Strelka/results/variants/genome.*.vcf.gz.tbi Strelka_${idSample}_genome.vcf.gz.tbi
mv Strelka/results/variants/variants.vcf.gz Strelka_${idSample}_variants.vcf.gz
mv Strelka/results/variants/variants.vcf.gz.tbi Strelka_${idSample}_variants.vcf.gz.tbi
"""
}

if (params.verbose) singleStrelkaOutput = singleStrelkaOutput.view {
Expand Down Expand Up @@ -675,6 +652,7 @@ def minimalInformationMessage() {
log.info "TSV file : " + tsvFile
log.info "Genome : " + params.genome
log.info "Genome_base : " + params.genome_base
log.info "Target BED : " + params.targetBED
log.info "Tools : " + tools.join(', ')
log.info "Containers"
if (params.repository != "") log.info " Repository : " + params.repository
Expand Down
2 changes: 1 addition & 1 deletion lib/QC.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ class QC {
// Get VEP version
static def getVersionVEP() {
"""
vep --help > v_vep.txt
/opt/vep/src/ensembl-vep/vep --help > v_vep.txt
"""
}
}
2 changes: 2 additions & 0 deletions lib/SarekUtils.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ class SarekUtils {
'strelka-BP',
'strelkaBP',
'tag',
'target-BED',
'targetBED',
'test',
'tools',
'total-memory',
Expand Down
25 changes: 17 additions & 8 deletions scripts/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ PROFILE=singularity
SAMPLE=Sarek-data/testdata/tsv/tiny.tsv
TEST=ALL
TRAVIS=${TRAVIS:-false}
CPUS=2

TMPDIR=`pwd`/tmp
mkdir -p $TMPDIR
Expand Down Expand Up @@ -51,14 +52,18 @@ do
BUILD=true
shift # past value
;;
-c|--cpus)
CPUS=$2
shift # past value
;;
*) # unknown option
shift # past argument
;;
esac
done

function run_wrapper() {
./scripts/wrapper.sh $@ --profile $PROFILE --genome $GENOME --genomeBase $PWD/References/$GENOME --verbose
./scripts/wrapper.sh $@ --profile $PROFILE --genome $GENOME --genomeBase $PWD/References/$GENOME --verbose --cpus ${CPUS}
}

function clean_repo() {
Expand All @@ -84,18 +89,22 @@ then
fi
fi


if [[ ALL,GERMLINE =~ $TEST ]]
then
run_wrapper --germline --sampleDir Sarek-data/testdata/tiny/normal --variantCalling --tools HaplotypeCaller
run_wrapper --germline --step recalibrate --noReports
clean_repo
# Added Strelka to germline test (no Strelka best practices test for this small data) and not asking for reports
run_wrapper --germline --sampleDir Sarek-data/testdata/tiny/normal --variantCalling --tools HaplotypeCaller,Strelka
run_wrapper --germline --sampleDir Sarek-data/testdata/tiny/normal --variantCalling --tools HaplotypeCaller,Strelka --bed `pwd`/Sarek-data/testdata/target.bed --noReports
run_wrapper --germline --step recalibrate --noReports
clean_repo
fi

if [[ ALL,SOMATIC =~ $TEST ]]
then
run_wrapper --somatic --sample Sarek-data/testdata/tsv/tiny-manta.tsv --variantCalling --tools FreeBayes,HaplotypeCaller,Manta,Mutect2 --noReports
run_wrapper --somatic --sample Sarek-data/testdata/tsv/tiny-manta.tsv --variantCalling --tools Manta,Strelka --noReports --strelkaBP
clean_repo
run_wrapper --somatic --sample Sarek-data/testdata/tsv/tiny-manta.tsv --variantCalling --tools FreeBayes,HaplotypeCaller,Manta,Mutect2 --noReports
run_wrapper --somatic --sample Sarek-data/testdata/tsv/tiny-manta.tsv --variantCalling --tools Manta,Strelka --noReports --strelkaBP
run_wrapper --somatic --sample Sarek-data/testdata/tsv/tiny.tsv --variantCalling --tools FreeBayes,HaplotypeCaller,Mutect2,Strelka --bed `pwd`/Sarek-data/testdata/target.bed
clean_repo
fi

if [[ ALL,ANNOTATEALL,ANNOTATESNPEFF,ANNOTATEVEP =~ $TEST ]]
Expand All @@ -115,7 +124,7 @@ then
clean_repo
fi

if [[ ALL,BUILDCONTAINERS =~ $TEST ]] && [[ $PROFILE == docker ]]
if [[ BUILDCONTAINERS =~ $TEST ]] && [[ $PROFILE == docker ]]
then
./scripts/do_all.sh --genome $GENOME
fi
Loading