Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentieon docs #1131

Merged
merged 23 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 112 additions & 8 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [BWA](#bwa)
- [BWA-mem2](#bwa-mem2)
- [DragMap](#dragmap)
- [Sentieon bwa mem](#sentieon-bwa-mem)
- [Duplicate Marking](#mark-duplicates)
- [GATK MarkDuplicates (Spark)](#gatk-markduplicates-spark)
- [Sentieon LocusCollector and Dedup](#sentieon-locuscollector-dedup)
- [Base Quality Score Recalibration](#base-quality-score-recalibration)
- [GATK BaseRecalibrator (Spark)](#gatk-baserecalibrator-spark)
- [GATK ApplyBQSR (Spark)](#gatk-applybqsr-spark)
Expand All @@ -29,6 +31,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [DeepVariant](#deepvariant)
- [FreeBayes](#freebayes)
- [GATK HaplotypeCaller](#gatk-haplotypecaller)
- [Sentieon Haplotyper](#sentieon-haplotyper)
- [GATK Mutect2](#gatk-mutect2)
- [bcftools](#bcftools)
- [Strelka2](#strelka2)
Expand All @@ -49,6 +52,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [FastQC](#fastqc)
- [FastP](#fastp)
- [GATK MarkDuplicates reports](#gatk-markduplicates-reports)
- [Sentieon Dedup reports](#sentieon-dedup-reports)
- [mosdepth](#mosdepth)
- [samtools stats](#samtools-stats)
- [bcftools stats](#bcftools-stats)
Expand Down Expand Up @@ -150,30 +154,34 @@ These files are intermediate and by default not placed in the output-folder kept

[BWA](https://github.com/lh3/bwa) is a software package for mapping low-divergent sequences against a large reference genome. The aligned reads are then coordinate-sorted (or name-sorted if [`GATK MarkDuplicatesSpark`](https://gatk.broadinstitute.org/hc/en-us/articles/5358833264411-MarkDuplicatesSpark) is used for duplicate marking) with [samtools](https://www.htslib.org/doc/samtools.html).

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set `--save_mapped` to enable publishing in CRAM format, furthermore add the flag `save_output_as_bam` for publishing in BAM format.

#### BWA-mem2

[BWA-mem2](https://github.com/bwa-mem2/bwa-mem2) is a software package for mapping low-divergent sequences against a large reference genome.The aligned reads are then coordinate-sorted (or name-sorted if [`GATK MarkDuplicatesSpark`](https://gatk.broadinstitute.org/hc/en-us/articles/5358833264411-MarkDuplicatesSpark) is used for duplicate marking) with [samtools](https://www.htslib.org/doc/samtools.html).

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set `--save_mapped` to enable publishing, furthermore add the flag `save_output_as_bam` for publishing in BAM format.

#### DragMap

[DragMap](https://github.com/Illumina/dragmap) is an open-source software implementation of the DRAGEN mapper, which the Illumina team created so that we would have an open-source way to produce the same results as their proprietary DRAGEN hardware. The aligned reads are then coordinate-sorted (or name-sorted if [`GATK MarkDuplicatesSpark`](https://gatk.broadinstitute.org/hc/en-us/articles/5358833264411-MarkDuplicatesSpark) is used for duplicate marking) with [samtools](https://www.htslib.org/doc/samtools.html).

These files are intermediate and by default not placed in the output-folder kept in the final files delivered to users. Set `--save_mapped` to enable publishing, furthermore add the flag `save_output_as_bam` for publishing in BAM format.

#### Sentieon BWA mem

Sentieon [bwa mem](https://support.sentieon.com/manual/usages/general/#bwa-mem-syntax) is a subroutine for mapping low-divergent sequences against a large reference genome. It is part of the proprietary software package [DNAseq](https://www.sentieon.com/detailed-description-of-pipelines/#dnaseq) from [Sentieon](https://www.sentieon.com/).

The aligned reads are coordinate-sorted with Sentieon.

<details markdown="1">
<summary>Output files for all mappers and samples</summary>

The alignment files (BAM or CRAM) produced by the chosen aligner are not published by default. CRAM output files will not be saved in the output-folder (`outdir`), unless the flag `--save_mapped` is used. BAM output can be selected by setting the flag `--save_output_as_bam`.

**Output directory: `{outdir}/preprocessing/mapped/<sample>/`**

- if `--save_mapped`: `<sample>.cram` and `<sample>.cram.crai`
- if `--save_mapped`: `<sample>.sorted.cram` and `<sample>.sorted.cram.crai`
asp8200 marked this conversation as resolved.
Show resolved Hide resolved

- CRAM file and index

- if `--save_mapped --save_output_as_bam`: `<sample>.bam` and `<sample>.bam.bai`
- if `--save_mapped --save_output_as_bam`: `<sample>.sorted.bam` and `<sample>.sorted.bam.bai`
- BAM file and index
</details>

Expand Down Expand Up @@ -203,6 +211,26 @@ The resulting CRAM files are delivered to the users.

</details>

### Sentieon LocusCollector and Dedup

The subroutines LocusCollector and Dedup are part of Sentieon DNAseq packages with speedup versions of the standard GATK tools, and together those two subroutines correspond to GATK's MarkDuplicates.

The subroutine [LocusCollector](https://support.sentieon.com/manual/usages/general/#driver-algorithm-syntax) collects read information that will be used for removing or tagging duplicate reads; its output is the score file indicating which reads are likely duplicates.

The subroutine [Dedup](https://support.sentieon.com/manual/usages/general/#dedup-algorithm) marks or removes duplicate reads based on the score file supplied by LocusCollector, and produces a BAM or CRAM file.

<details markdown="1">
<summary>Output files for all samples</summary>

**Output directory: `{outdir}/preprocessing/sentieon_dedup/<sample>/`**

- `<sample>.dedup.cram` and `<sample>.dedup.cram.crai`
- CRAM file and index
- if `--save_output_as_bam`:
- `<sample>.dedup.bam` and `<sample>.dedup.bam.bai`

</details>

### Base Quality Score Recalibration

During Base Quality Score Recalibration, systematic errors in the base quality scores are corrected by applying machine learning to detect and correct for them. This is important for evaluating the correct call of a variant during the variant discovery process. However, this is not needed for all combinations of tools in Sarek. Notably, this should be turned off when having UMI tagged reads or using DragMap (see [here](https://gatk.broadinstitute.org/hc/en-us/articles/4407897446939--How-to-Run-germline-single-sample-short-variant-discovery-in-DRAGEN-mode)) as mapper.
Expand Down Expand Up @@ -248,7 +276,7 @@ The resulting recalibrated CRAM files are delivered to the user. Recalibrated CR

The CSV files are auto-generated and can be used by Sarek for further processing and/or variant calling.

See the [`--input`](usage.md#--input) section in the usage documentation for further reading and documentation on how to make the most of them.
See the [`input`](usage#input-sample-sheet-configurations) section in the usage documentation for further reading and documentation on how to make the most of them.

<details markdown="1">
<summary>Output files:</summary>
Expand All @@ -268,6 +296,10 @@ See the [`--input`](usage.md#--input) section in the usage documentation for fur
- CSV containing an entry for each sample with the columns `patient,sample,vcf`
</details>

#### Sentieon QualCal (BQSR)

Currently, Sentieon's version of BQSR, QualCal, is not available in Sarek. Recent Illumina sequencers tend to provide well-calibrated BQs, so BQSR may not provide much benefit. By default Sarek runs GATK's BQSR; that can be skipped by adding the option `--skip_tools baserecalibrator`.

## Variant Calling

The results regarding variant calling are collected in `{outdir}/variantcalling/`.
Expand Down Expand Up @@ -358,12 +390,20 @@ If the haplotype-called VCF files are not filtered, then Sarek should be run wit

[GATK Joint germline Variant Calling](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) uses Haplotypecaller per sample in `gvcf` mode. Next, the gVCFs are consolidated from multiple samples into a [GenomicsDB](https://gatk.broadinstitute.org/hc/en-us/articles/5358869876891-GenomicsDBImport) datastore. After joint [genotyping](https://gatk.broadinstitute.org/hc/en-us/articles/5358906861083-GenotypeGVCFs), [VQSR](https://gatk.broadinstitute.org/hc/en-us/articles/5358906115227-VariantRecalibrator) is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

<details markdown="1">
<summary>Output files from joint germline variant calling</summary>

**Output directory: `{outdir}/variantcalling/haplotypecaller/<sample>/`**

- `<sample>.haplotypecaller.g.vcf.gz` and `<sample>.haplotypecaller.g.vcf.gz.tbi`
SusiJo marked this conversation as resolved.
Show resolved Hide resolved
- VCF with tabix index

**Output directory: `{outdir}/variantcalling/sentieon_haplotyper/joint_variant_calling/`**

- `joint_germline.vcf.gz` and `joint_germline.vcf.gz.tbi`
- VCF with tabix index
- `joint_germline_recalibrated.vcf.gz` and `joint_germline_recalibrated.vcf.gz.tbi`
- variant recalibrated VCF with tabix index
- variant recalibrated VCF with tabix index (if VQSR is applied)

</details>

Expand Down Expand Up @@ -399,6 +439,57 @@ Files created:

</details>

#### Sentieon Haplotyper

[Sentieon Haplotyper](https://support.sentieon.com/manual/usages/general/#haplotyper-algorithm) is Sention's speedup version of GATK's Haplotypecaller (see above).

<details markdown="1">
<summary>Unfiltered VCF-files for normal samples</summary>

**Output directory: `{outdir}/variantcalling/sentieon_haplotyper/<sample>/`**

- `<sample>.haplotyper.unfiltered.vcf.gz` and `<sample>.haplotyper.unfiltered.vcf.gz.tbi`
- VCF with tabix index

</details>

The output from Sentieon's Haplotyper can be controlled through the option `--sentieon_haplotyper_emit_mode` for Sarek, see [Basic usage of Sentieon functions in Sarek](https://github.com/nf-core/sarek/blob/sentieon_docs/docs/usage.md#basic-usage-of-sentieon-functions-in-sarek).

Unless `haplotyper_filter` is listed under `--skip_tools` in the nextflow command, GATK's CNNScoreVariants and FilterVariantTranches (see above) is applied to the unfiltered VCF-files in order to obtain filtered VCF-files.

<details markdown="1">
<summary>Filtered VCF-files for normal samples</summary>

**Output directory: `{outdir}/variantcalling/sentieon_haplotyper/<sample>/`**

- `<sample>.haplotyper.filtered.vcf.gz` and `<sample>.haplotyper.filtered.vcf.gz.tbi`
- VCF with tabix index

</details>

##### Sentieon Joint Germline Variant Calling

In Sentieon's package DNAseq, joint germline variant calling is done by first running Sentieon's Haplotyper in emit-mode `gvcf` for each sample and then running Sentieon's [GVCFtyper](https://support.sentieon.com/manual/usages/general/#gvcftyper-algorithm) on the set of gVCF-files. See [Basic usage of Sentieon functions in Sarek](https://github.com/nf-core/sarek/blob/sentieon_docs/docs/usage.md#basic-usage-of-sentieon-functions-in-sarek) for information on how joint germline variant calling can be done in Sarek using Sentieon's DNAseq.

Sarek's implementation of joint germline variant calling using DNAseq does not include the usage of [GenomicsDB](https://gatk.broadinstitute.org/hc/en-us/articles/5358869876891-GenomicsDBImport) datastore. After joint genotyping, Sentieon's version of VQSR ([VarCal](https://support.sentieon.com/manual/usages/general/#varcal-algorithm) and [ApplyVarCal](https://support.sentieon.com/manual/usages/general/#applyvarcal-algorithm)) is applied for filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

<details markdown="1">
<summary>Output files from joint germline variant calling</summary>

**Output directory: `{outdir}/variantcalling/sentieon_haplotyper/<sample>/`**

- `<sample>.haplotypecaller.g.vcf.gz` and `<sample>.haplotypecaller.g.vcf.gz.tbi`
SusiJo marked this conversation as resolved.
Show resolved Hide resolved
- VCF with tabix index

**Output directory: `{outdir}/variantcalling/sentieon_haplotyper/joint_variant_calling/`**

- `joint_germline.vcf.gz` and `joint_germline.vcf.gz.tbi`
- VCF with tabix index
- `joint_germline_recalibrated.vcf.gz` and `joint_germline_recalibrated.vcf.gz.tbi`
- variant recalibrated VCF with tabix index (if VarCal is applied)

</details>

#### Strelka2

[Strelka2](https://github.com/Illumina/strelka) is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. For further reading and documentation see the [Strelka2 user guide](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md). If [Strelka2](https://github.com/Illumina/strelka) is used for somatic variant calling and [Manta](https://github.com/Illumina/manta) is also specified in tools, the output candidate indels from [Manta](https://github.com/Illumina/manta) are used according to [Strelka Best Practices](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md#somatic-configuration-example).
Expand Down Expand Up @@ -831,6 +922,19 @@ The plot will show:
- file used by [MultiQC](https://multiqc.info/)
</details>

#### Sentieon Dedup reports

Sentieon's DNAseq subroutine Dedup produces a metrics report much like the one produced by GATK's MarkDuplicates. The Dedup metrics are imported into MultiQC as custom content and displayed in a table.

<details markdown="1">
<summary>Output files for all samples</summary>

**Output directory: `{outdir}/reports/sentieon_dedup/<sample>`**

- `<sample>.dedup.cram.metrics`
- file used by [MultiQC](https://multiqc.info/).
</details>

#### samtools stats

[samtools stats](https://www.htslib.org/doc/samtools.html) collects statistics from CRAM files and outputs in a text format.
Expand Down
40 changes: 35 additions & 5 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -305,9 +305,43 @@ test,sample4_vs_sample3,manta,sample4_vs_sample3.diploid_sv.vcf.gz
test,sample4_vs_sample3,manta,sample4_vs_sample3.somatic_sv.vcf.gz
```

## Sentieon

[Sentieon](https://www.sentieon.com/) is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional high accuracy, and 100% consistency.

In particular, Sentieon contains what may be view as speedup version of some standard GATK tools, like bwamem and haplotyper. Sarek now contains support for some of modules of functionality from Sentieon. In order to use the Sentieon modules of Sarek, the user will need to supply the Sarek pipeline with a license for Sentieon.

### Setup of Sentieon license for Sarek

Sentieon supply license in the form of a string-value (a url) or a file. It should be base64-encoded and stored in a nextflow secret named `SENTIEON_LICENSE_BASE64`. If a license string (url) is supplied, then the nextflow secret should be set like this:

```bash
nextflow secret set SENTIEON_LICENSE_BASE64 $(echo -n <sentieon_license_string> | base64 -w 0)
```

If a license file is supplied, then the nextflow secret should be set like this:

```bash
nextflow secrets set SENTIEON_LICENSE_BASE64 \$(cat <sentieon_license_file.lic> | base64 -w 0)
```

### Available Sentieon functions

Sarek contains the following Sentieon functions [bwa mem](https://support.sentieon.com/manual/usages/general/#bwa-mem-syntax), [LocusCollector](https://support.sentieon.com/manual/usages/general/#locuscollector-algorithm) + [Dedup](https://support.sentieon.com/manual/usages/general/#dedup-algorithm), [Haplotyper](https://support.sentieon.com/manual/usages/general/#haplotyper-algorithm), [GVCFtyper](https://support.sentieon.com/manual/usages/general/#gvcftyper-algorithm) and [VarCal](https://support.sentieon.com/manual/usages/general/#varcal-algorithm) + [ApplyVarCal](https://support.sentieon.com/manual/usages/general/#applyvarcal-algorithm), so the basic processing of alignment of fastq-files to VCF-files can be done using speedup Sentieon functions.

### Basic usage of Sentieon functions in Sarek

To use Sentieon's aligner `bwa mem`, set the aligner option `sentieon-bwamem`. (This can, for example, be done by adding `--aligner sentieon-bwamem` to the nextflow run command.)

To use Sentieon's function `Dedup`, specify `sentieon_dedup` as one of the tools. (This can, for example, be done by adding `--tools sentieon_dedup` to the nextflow run command.)

To use Sentieon's function `Haplotyper`, specify `sentieon_haplotyper` as one of the tools. This can, for example, be done by adding `--tools sentieon_haplotyper` to the nextflow run command. In order to skip the GATK-based variant-filter, one may add `--skip_tools haplotyper_filter` to the nextflow run command. Sarek also provides the option `sentieon_haplotyper_emit_mode` which can be used to set the [emit-mode](https://support.sentieon.com/manual/usages/general/#haplotyper-algorithm) of Sentieon's haplotyper. Sentieon's haplotyper can output both a vcf-file and a gvcf-file in the same run; this is achieved by setting `sentieon_haplotyper_emit_mode` to `<vcf_emit_mode>,gvcf`, where `<vcf_emit_mode>` is `variant`, `confident` or `all`.

To use Sentieon's function `GVCFtyper` along with Sention's version of VQSR (`VarCal` and `ApplyVarCal`) for joint-germline genotyping, specify `sentieon_haplotyper` as one of the tools, set the option `sentieon_haplotyper_emit_mode` to `gvcf`, and add the option `joint_germline`. This can, for example, be done by adding `--tools sentieon_haplotyper --joint_germline --sentieon_haplotyper_emit_mode gvcf` to the nextflow run command.

## Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
When you launch a pipeline from the command-line with `nextflow run nf-core/sarek -profile docker -params-file params.yaml`, Nextflow will automatically pull the pipeline code from GitHub and store it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

```bash
nextflow pull nf-core/sarek
Expand Down Expand Up @@ -1006,7 +1040,3 @@ ERRORS: Some errors were detected
Error type Number of errors
ERROR_CHROMOSOME_NOT_FOUND 17522411
```

## How to set up sarek to use sentieon

Sarek is currently not supporting sentieon. It is planned for the upcoming release 3.3. In the meantime, please revert to the last release 2.7.2.
asp8200 marked this conversation as resolved.
Show resolved Hide resolved