From f68f5aa00a57a53269738f6f580d2b8dc7893a7f Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Mon, 10 Sep 2018 13:30:31 +0200 Subject: [PATCH 1/7] Renamed tsv to input --- docs/{TSV.md => INPUT.md} | 41 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) rename docs/{TSV.md => INPUT.md} (73%) diff --git a/docs/TSV.md b/docs/INPUT.md similarity index 73% rename from docs/TSV.md rename to docs/INPUT.md index 084adb1c93..4468f02320 100644 --- a/docs/TSV.md +++ b/docs/INPUT.md @@ -57,3 +57,44 @@ All the files will be in he Preprocessing/Recalibrated/ directory, and by defaul ```bash nextflow run SciLifeLab/Sarek/somaticVC.nf --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka ``` + +## Input FASTQ file name best practices + +The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. +All fastq files for that sample should be collected here. + +``` +ID ++--sample1 ++------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz ++------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz ++------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz ++------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz ++--sample2 ++------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz ++------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz ++--sample3 ++------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz ++------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz ++------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz ++------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz +``` + +Fastq filename structure: + +- `sample_lib_flowcell-index_lane_R1_1000.fastq.gz` and +- `sample_lib_flowcell-index_lane_R2_1000.fastq.gz` + +Where: + +- `sample` = sample id +- `lib` = indentifier of libaray preparation +- `flowcell` = identifyer of flow cell for the sequencing run +- `lane` = identifier of the lane of the sequencing run + +Read group information will be parsed from fastq file names according to this: + +- `RGID` = "sample_lib_flowcell_index_lane" +- `RGPL` = "Illumina" +- `PU` = sample +- `RGLB` = lib From 9e224081a0817c10d6083a23b8fa578a814cae50 Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Mon, 10 Sep 2018 13:34:03 +0200 Subject: [PATCH 2/7] Started with the beginners docs --- docs/USAGE.md | 41 ++++------------------------------------- 1 file changed, 4 insertions(+), 37 deletions(-) diff --git a/docs/USAGE.md b/docs/USAGE.md index b1474d0f31..e7ab0b66f7 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -9,47 +9,14 @@ The workflow is started for a sample, or a set of samples from the same Individu Each different physical samples is identified by its own ID. For example in a Tumour/Normal settings, this ID could correspond to "Normal", "Tumour_1", "Tumour_2" etc. corresponding to all physical samples from the same patient. -## Input FASTQ file name best practices - -The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. -All fastq files for that sample should be collected here. +## Preparing to run Sarek +Sarek will start the analysis by parsing a supplied input file in tsv format. +This file contains all the necessary information about the data and should have at least one tab-separated line: ``` -ID -+--sample1 -+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz -+--sample2 -+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz -+--sample3 -+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz -+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz +SUBJECT_ID XX 0 SAMPLEID 1 /samples/normal_1.fastq.gz /samples/normal_2.fastq.gz ``` -Fastq filename structure: - -- `sample_lib_flowcell-index_lane_R1_1000.fastq.gz` and -- `sample_lib_flowcell-index_lane_R2_1000.fastq.gz` - -Where: - -- `sample` = sample id -- `lib` = indentifier of libaray preparation -- `flowcell` = identifyer of flow cell for the sequencing run -- `lane` = identifier of the lane of the sequencing run - -Read group information will be parsed from fastq file names according to this: - -- `RGID` = "sample_lib_flowcell_index_lane" -- `RGPL` = "Illumina" -- `PU` = sample -- `RGLB` = lib - ## Scripts Sarek uses several scripts, a wrapper is currently being made to simplify the command lines. From cfffbf90679d7936f48f9aafa1610358b9e40117 Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Wed, 12 Sep 2018 16:31:02 +0200 Subject: [PATCH 3/7] Updated the config docs --- docs/CONFIG.md | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/docs/CONFIG.md b/docs/CONFIG.md index 85cae631a8..3f973ee90e 100644 --- a/docs/CONFIG.md +++ b/docs/CONFIG.md @@ -5,7 +5,8 @@ For more informations on how to use configuration files, have a look at the [Nex For more informations about profiles, have a look at the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html#config-profiles) We provides several configuration files and profiles for Sarek. -The standard ones are designed to work on a Swedish UPPMAX clusters, and can be modified and tailored to your own need. +The standard ones are designed to work on a Swedish UPPMAX cluster, but can be modified and tailored to your own need. + ## Configuration files @@ -51,10 +52,14 @@ To be used for Travis (2 cpus) or on small computer for testing purpose Slurm configuration for a UPPMAX cluster Will run the workflow on `/scratch` using the Nextflow [`scratch`](https://www.nextflow.io/docs/latest/process.html#scratch) directive -## profiles +## Profiles +A profile is a convenient way of specifying which set of configuration files to use. +The default profile is `standard`, but Sarek has multiple predefined profiles which are listed below that can be specified by specifying `-profile `: + +```bash +nextflow run SciLifeLab/Sarek --sample mysample.tsv -profile myprofile +``` -Every profile can be modified for your own use. -To use a profile, you'll need to specify `-profile ` ### `docker` @@ -82,3 +87,14 @@ Singularity images will be pulled automatically. This is the profile for Singularity testing on a small machine, or on Travis CI. Singularity images will be pulled automatically. + +## Customisation +The recommended way to use custom settings is to supply Sarek with an additional configuration file. You can use the files in the [`conf/`](https://github.com/SciLifeLab/Sarek/tree/master/conf) directory as an inspiration to make this new `.config` file and specify it using the `-c` flag: + +```bash +nextflow run SciLifeLab/Sarek --sample mysample.tsv -c conf/personal.config +``` + +Any configuration field specified in this file has precedence over the predefined configurations but any field left out from the file will be set by the normal configuration files included in the specified (or `standard`) profile. + +Furthermore, to find out which configuration files take action for the different profiles, the profiles are defined in the file [`nextflow.config`](https://github.com/SciLifeLab/Sarek/blob/master/nextflow.config). From 1258ad029e39eeadcd02adad85af0d72d64116b2 Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Wed, 12 Sep 2018 16:32:06 +0200 Subject: [PATCH 4/7] Whitespace change on INPUT docs --- docs/INPUT.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/INPUT.md b/docs/INPUT.md index 4468f02320..edab568bff 100644 --- a/docs/INPUT.md +++ b/docs/INPUT.md @@ -3,7 +3,7 @@ Input files for Sarek can be specified using a tsv file given to the `--sample` parameter. The tsv file is a Tab Separated Value file with columns: `subject gender status sample lane fastq1 fastq2` or `subject gender status sample bam bai`. The content of these columns should be quite straight-forward: -- `subject` designate the subject, it should be the ID of the Patient, or if you don't have one, il could be the Normal ID Sample. +- `subject` designate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample. - `gender` is the gender of the Patient, (XX or XY) - `status` is the status of the Patient, (0 for Normal or 1 for Tumor) - `sample` designate the Sample, it should be the ID of the Sample (it is possible to have more than one tumor sample for each patient) From d888fb9994f0fdc02e854385ee37969822f9aa1f Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Wed, 12 Sep 2018 16:32:57 +0200 Subject: [PATCH 5/7] Major rewrite of the usage docs, from a beginners perspective --- docs/PARAMETERS.md | 139 ++++++++++++++++++++ docs/USAGE.md | 315 +++++++++++++++++++++------------------------ 2 files changed, 288 insertions(+), 166 deletions(-) create mode 100644 docs/PARAMETERS.md diff --git a/docs/PARAMETERS.md b/docs/PARAMETERS.md new file mode 100644 index 0000000000..399fade15c --- /dev/null +++ b/docs/PARAMETERS.md @@ -0,0 +1,139 @@ +# Parameters + +A list of all possible parameter that can be used for the different scripts included in Sarek. + +## Common for all scripts + +### --help + +Display help + +### --noReports + +Disable all QC tools and MultiQC. + +### --outDir + +Choose an output directory + +### --project `ProjectID` + +Specify a project number ID on a UPPMAX cluster. +(optional if not on such a cluster) + +### --sample `file.tsv` + +Use the given TSV file as sample (cf [TSV documentation](TSV.md)). +Is not used for `annotate.nf` and `runMultiQC.nf`. + +### --tools `tool1[,tool2,tool3...]` + +Choose which tools will be used in the workflow. +Different tools to be separated by commas. +Possible values are: + +- haplotypecaller (use `HaplotypeCaller` for VC) (germlineVC.nf) +- manta (use `Manta` for SV) (germlineVC.nf,somaticVC.nf) +- strelka (use `Strelka` for VC) (germlineVC.nf,somaticVC.nf) +- ascat (use `ASCAT` for CNV) (somaticVC.nf) +- mutect2 (use `MuTect2` for VC) (somaticVC.nf) +- snpeff (use `snpEff` for Annotation) (annotate.nf) +- vep (use `VEP` for Annotation) (annotate.nf) + +`--tools` option is case insensitive to avoid easy introduction of errors when choosing tools. +So you can write `--tools mutect2,ascat` or `--tools MuTect2,ASCAT` without worrying about case sensitivity. + +### --verbose + +Display more information about files being processed. + +## Preprocessing script (`main.nf`) +### --step `step` + +Choose from wich step the workflow will start. +Choose only one step. +Possible values are: + +- mapping (default, will start workflow with FASTQ files) +- recalibrate (will start workflow with BAM files and Recalibration Tables + +`--step` option is case insensitive to avoid easy introduction of errors when choosing a step. + +### --test + +Test run Sarek on a smaller dataset, that way you don't have to specify `--sample Sarek-data/testdata/tsv/tiny.tsv` + +### --onlyQC + +Run only QC tools and MultiQC to generate a HTML report. + + +## Annotate script (`annotate.nf`) + +### --annotateTools `tool1[,tool2,tool3...]` + +Choose which tools to annotate. +Different tools to be separated by commas. +Possible values are: +- haplotypecaller (Annotate `HaplotypeCaller` output) +- manta (Annotate `Manta` output) +- mutect2 (Annotate `MuTect2` output) +- strelka (Annotate `Strelka` output) + +### --annotateVCF `file1[,file2,file3...]` + +Choose vcf to annotate. +Different vcfs to be separated by commas. + + +## MultiQC script (`runMultiQC.nf`) +### --callName `Name` + +Specify a name for MultiQC report (optional) + +### --contactMail `email` + +Specify an email for MultiQC report (optional) + + +## References + +For most use cases, the reference information is already in the configuration file [`conf/genomes.config`](https://github.com/SciLifeLab/Sarek/blob/master/conf/genomes.config). +However, if needed, you can specify any reference file at the command line. + +### --acLoci `acLoci file` + +### --bwaIndex `bwaIndex file` + +### --cosmic `cosmic file` + +### --cosmicIndex `cosmicIndex file` + +### --dbsnp `dbsnp file` + +### --dbsnpIndex `dbsnpIndex file` + +### --genomeDict `genomeDict file` + +### --genomeFile `genomeFile file` + +### --genomeIndex `genomeIndex file` + +### --intervals `intervals file` + +### --knownIndels `knownIndels file` + +### --knownIndelsIndex `knownIndelsIndex file` + +### --snpeffDb `snpeffDb file` + +## Hardware Parameters + +For most use cases, the reference information is already in the appropriate [configuration files](https://github.com/SciLifeLab/Sarek/blob/master/conf/). +However, it is still possible to specify these parameters at the command line as well. + +### --runTime `time` + +### --singleCPUMem `memory` + +### --totalMemory `memory` diff --git a/docs/USAGE.md b/docs/USAGE.md index c4e32bd9c4..4a4ff7c5a3 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -1,201 +1,184 @@ -# Usage +# How to run Sarek -I would recommend to run Nextflow within a [screen](https://www.gnu.org/software/screen/) or [tmux](https://tmux.github.io/) session. +This guide will take you through your first run of Sarek. +It is divided into two steps corresponding to the two main types of analysis offered by Sarek: + - Run a Germline Analysis + - Run a Somatic Analysis -## Project folder structure +This guide assumes you have internet access on the server where the analysis will take place. If you do not have that, please look into the [installation instructions](INSTALL_BIANCA.md) for the restricted access server Bianca at Uppmax, which should give an idea on how to adjust the following examples accordingly. -The workflow is started for a sample, or a set of samples from the same Individual. -Each different physical samples is identified by its own ID. -For example in a Tumour/Normal settings, this ID could correspond to "Normal", "Tumour_1", "Tumour_2" etc. corresponding to all physical samples from the same patient. +It is recommended to run Sarek within a [screen](https://www.gnu.org/software/screen/) or [tmux](https://tmux.github.io/) session. +This helps Sarek run uninterrupted until the analysis has finished. +Furthermore, Sarek is designed to be run on a single sample for a germline analysis or a set of samples from the same individual for a somatic analysis. +If more than one individual will be analysed, it is recommended that this is done in separate directories which is analysed separately. -## Preparing to run Sarek -Sarek will start the analysis by parsing a supplied input file in tsv format. -This file contains all the necessary information about the data and should have at least one tab-separated line: -``` -SUBJECT_ID XX 0 SAMPLEID 1 /samples/normal_1.fastq.gz /samples/normal_2.fastq.gz -``` - -## Scripts +## Update to latest version -Sarek uses several scripts, a wrapper is currently being made to simplify the command lines. -Currently the typical reduced command lines are: +To make sure that you have the latest version of Sarek, use: ```bash -nextflow run SciLifeLab/Sarek/main.nf --sample --step -nextflow run SciLifeLab/Sarek/germlineVC.nf --sample --tools -nextflow run SciLifeLab/Sarek/somaticVC.nf --sample --tools -nextflow run SciLifeLab/Sarek/annotate.nf --tools (--annotateTools ||--annotateVCF ) -nextflow run SciLifeLab/Sarek/runMultiQC.nf +nextflow pull SciLifeLab/Sarek ``` -All parameters, options and variables can be specified with configuration files and profile (cf [configuration documentation](#profiles)). - -## Options - -### --callName `Name` - -Specify a name for MultiQC report (optional) - -### --contactMail `email` - -Specify an email for MultiQC report (optional) - -### --help - -Display help - -### --noReports - -Disable all QC tools and MultiQC to generate a HTML report. - -### --onlyQC - -Run only QC tools and MultiQC to generate a HTML report. - -### --outDir - -Choose an output directory - -### --project `ProjectID` - -Specify a project number ID on a UPPMAX cluster. -(optional if not on such a cluster) - -### --sample `file.tsv` - -Use the given TSV file as sample (cf [TSV documentation](TSV.md)). - -### --step `step` - -Choose from wich step the workflow will start. -Choose only one step. -Possible values are: - -- mapping (default, will start workflow with FASTQ files) -- recalibrate (will start workflow with BAM files and Recalibration Tables - -`--step` option is case insensitive to avoid easy introduction of errors when choosing a step. - -### --test - -Test run Sarek on a smaller dataset, that way you don't have to specify `--sample data/tsv/tiny.tsv` - -### --tools `tool1[,tool2,tool3...]` - -Choose which tools will be used in the workflow. -Different tools to be separated by commas. -Possible values are: - -- haplotypecaller (use `HaplotypeCaller` for VC) (germlineVC) -- manta (use `Manta` for SV) (germlineVC,somaticVC) -- strelka (use `Strelka` for VC) (germlineVC,somaticVC) -- ascat (use `ASCAT` for CNV) (somaticVC) -- mutect2 (use `MuTect2` for VC) (somaticVC) -- snpeff (use `snpEff` for Annotation) (annotate) -- vep (use `VEP` for Annotation) (annotate) - -`--tools` option is case insensitive to avoid easy introduction of errors when choosing tools. -So you can write `--tools mutect2,ascat` or `--tools MuTect2,ASCAT` without worrying about case sensitivity. - -### --annotateTools `tool1[,tool2,tool3...]` - -Choose which tools to annotate. -Different tools to be separated by commas. -Possible values are: -- haplotypecaller (Annotate `HaplotypeCaller` output) -- manta (Annotate `Manta` output) -- mutect2 (Annotate `MuTect2` output) -- strelka (Annotate `Strelka` output) - -### --annotateVCF `file1[,file2,file3...]` - -Choose vcf to annotate. -Different vcfs to be separated by commas. - -### --verbose - -Display more information about files being processed. - -## Containers - -### --containerPath `Path to the singularity containers (default=containers/)` - -### --repository `Docker-hub repository (default=maxulysse)` - -### --tag `tag of the containers to use (default=current version)` - -## References - -If needed, you can specify each reference file by command line. - -### --acLoci `acLoci file` - -### --bwaIndex `bwaIndex file` - -### --cosmic `cosmic file` - -### --cosmicIndex `cosmicIndex file` - -### --dbsnp `dbsnp file` +## Run the latest version -### --dbsnpIndex `dbsnpIndex file` +If there is a feature or bugfix you want to use in a resumed or re-analyzed run, you have to update the workflow to the latest version. +By default it is not updated automatically, so use something like: -### --genomeDict `genomeDict file` +```bash +nextflow run -latest SciLifeLab/Sarek/main.nf ... -resume +``` -### --genomeFile `genomeFile file` +## Not on Uppmax +The commands used in this guide is suitable on how to run on a cluster at Uppmax. +To run these examples on a different infrastructure, there are a few things that needs to be changed. -### --genomeIndex `genomeIndex file` + - Most likely, the `slurm` profile is not suitable to use. + Find a more suitable one (or design your own) using the [configuration documentation](CONFIG.md) + - The path for where reference genomes are located (specified in the `--genome_base` parameter) need to be modified. + Use the instructions in the [reference documentation](REFERENCES.md) to make sure all the reference files are available. -### --intervals `intervals file` -### --knownIndels `knownIndels file` +## Run a Germline Analysis +This section presents a complete instruction to run a germline analysis using Sarek on a single sample. +Sarek will start the analysis by parsing a supplied input file in TSV format. +This file contains all the necessary information about the data and for the germline analysis it should have at least one line. +For more detailed information about how to construct TSV files for custom data, see [input documentation](INPUT.md). -### --knownIndelsIndex `knownIndelsIndex file` +For example, the file can be called `samples_germline.tsv` with the content (corresponding to columns: `subject gender status sample lane fastq1 fastq2`): -### --snpeffDb `snpeffDb file` +``` +SUBJECT_ID XX 0 SAMPLEID 1 /samples/normal_1.fastq.gz /samples/normal_2.fastq.gz +``` -## Parameters +The first workflow that will be run is contained in the `main.nf` file and performs the preprocessing step consisting of mapping, marking of duplicates and base recalibration. Running this command will launch a nextflow process in the terminal which in turn submits jobs (processes) to the SLURM queue. +``` +nextflow run SciLifeLab/Sarek/main.nf \ +--sample samples_germline.tsv \ +-profile slurm \ +--project \ +--genome_base /sw/data/uppnex/ToolBox/hg38bundle \ +--genome GRCh38 +``` -Simpler to specify in the configuration files, but it's still possible to specify every thing in the command line. +When the workflow has finished successfully it should print something similar to this: +``` +Completed at: Fri Aug 31 05:10:07 CEST 2018 +Duration : 1d 13h 24m 51s +Success : true +Exit status : 0 +``` +Make sure to check that the output states `Success : true` and not `Success : false`. +The results of the first step is located in the `Preprocessing` directory. +These files will be used in the next step, where the actual variant calling takes place. +Among other things, the preprocessing step should have created a new TSV file which is intended to be used as input for the variant calling step: +``` +nextflow run SciLifeLab/Sarek/germlineVC.nf \ +--sample Preprocessing/Recalibrated/recalibrated.tsv \ +-profile slurm \ +--project \ +--genome_base /sw/data/uppnex/ToolBox/hg38bundle \ +--genome GRCh38 \ +--tools HaplotypeCaller +``` +When successful (`Success : true`), this step should produce vcf file(s) within a `VariantCalling` directory. +The next workflow will annotate the found variants. +It is possible to specify the tools used for annotation (here VEP) and the variant-calling tools to use as input for annotation (here HaplotypeCaller). +``` +nextflow run SciLifeLab/Sarek/annotate.nf \ +--annotateTools HaplotypeCaller \ +-profile slurm \ +--project \ +--genome_base ~/Sarek/References/smallGRCh37 \ +--tools VEP +``` -### --runTime `time` +Finally, run MultiQC to get an easily accessible report of all your analysis. +``` +nextflow run SciLifeLab/Sarek/runMultiQC.nf \ +-profile slurm +--project \ +``` +## Run a Somatic Analysis -### --singleCPUMem `memory` +This section presents a complete instruction on how to run a somatic analysis using Sarek on two samples from the same individual. In this case one normal sample and one tumour sample will be used. However, Sarek can also accept more than one tumour sample (i.e. relapses) for the same individual. -### --totalMemory `memory` +Note: Four out of five of the steps included in this example are identical or very similar to the steps included in the germline analysis example. Therefore, much of the information in this example is redundant compared to the first example. -## Configuration and profiles +Sarek will start the analysis by parsing a supplied input file in TSV format. +This file contains all the necessary information about the data and for the somatic analysis it should have at least two lines. +These lines have columns corresonding to `subject gender status sample lane fastq1 fastq2`. +For more detailed information about how to construct TSV files for custom data, see [input documentation](INPUT.md). -More informations on the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html). -The default profile is `standard`. -You can use your own profile: +For example, the file can be called `samples_somatic.tsv` with the content: -```bash -nextflow run SciLifeLab/Sarek --sample mysample.tsv -profile myprofile ``` - -A standard profile is defined in [`nextflow.config`](https://github.com/SciLifeLab/Sarek/blob/master/nextflow.config). -You can use the files in the [`conf/`](https://github.com/SciLifeLab/Sarek/tree/master/conf) directory as a base to make a new `.config` file that you can specify directly (or add as a profile): - -```bash -nextflow run SciLifeLab/Sarek --sample mysample.tsv -c conf/personnal.config +SUBJECT_ID XX 0 SAMPLEID1 1 /samples/normal_1.fastq.gz /samples/normal_2.fastq.gz +SUBJECT_ID XX 1 SAMPLEID2 1 /samples/tumour_1.fastq.gz /samples/tumour_2.fastq.gz +``` +The first workflow that will be run is contained in the `main.nf` file and performs the preprocessing step consisting of mapping, marking of duplicates and base recalibration. Running this command will launch a nextflow process in the terminal which in turn submits jobs (processes) to the SLURM queue. +``` +nextflow run SciLifeLab/Sarek/main.nf \ +--sample samples_somatic.tsv \ +-profile slurm \ +--project \ +--genome_base /sw/data/uppnex/ToolBox/hg38bundle \ +--genome GRCh38 ``` -## Update to latest version - -To update workflow to the latest version use: - -```bash -nextflow pull SciLifeLab/Sarek +When the workflow has finished successfully it should print something similar to this: +``` +Completed at: Fri Aug 31 05:10:07 CEST 2018 +Duration : 1d 13h 24m 51s +Success : true +Exit status : 0 ``` -## Run the latest version +Make sure to check that the output states `Success : true` and not `Success : false`. +The results of the first step is located in the `Preprocessing` directory. +These files will be used in the next two steps, where the actual variant calling takes place. +Among other things, the preprocessing step should have created a new TSV file which is intended to be used as input for the variant calling steps: -If there is a feature or bugfix you want to use in a resumed or re-analyzed run, you have to update the workflow to the latest version. -By default it is not updated automatically, so use something like: +``` +nextflow run SciLifeLab/Sarek/germlineVC.nf \ +--sample Preprocessing/Recalibrated/recalibrated.tsv \ +-profile slurm \ +--project \ +--genome_base /sw/data/uppnex/ToolBox/hg38bundle \ +--genome GRCh38 \ +--tools HaplotypeCaller +``` +When successful (`Success : true`), this step should produce vcf file(s) within a `VariantCalling` directory. +The first variant calling step is actually the one from the germline analysis. +This is included here since information regarding germline variants is still useful for analysis of somatic variants. +The next variant calling step is the somatic specific analysis: +``` +nextflow run SciLifeLab/Sarek/somaticVC.nf \ +--sample Preprocessing/Recalibrated/recalibrated.tsv \ +-profile slurm \ +--project \ +--genome_base /sw/data/uppnex/ToolBox/hg38bundle \ +--genome GRCh38 \ +--tools Strelka +``` +When successful (`Success : true`), this step should produce vcf file(s) within the `VariantCalling` directory separate from the germline vcf file. +The next workflow will annotate the found variants. +It is possible to specify the tools used for annotation (here VEP) and the variant-calling tools to use as input for annotation (here HaplotypeCaller and Strelka). +``` +nextflow run SciLifeLab/Sarek/annotate.nf \ +--annotateTools HaplotypeCaller,Strelka \ +-profile slurm \ +--project \ +--genome_base ~/Sarek/References/smallGRCh37 \ +--containerPath \ +--tools VEP +``` -```bash -nextflow run -latest SciLifeLab/Sarek/main.nf ... -resume +Finally, run MultiQC to get an easily accessible report of all your analysis. +``` +nextflow run SciLifeLab/Sarek/runMultiQC.nf \ +-profile slurm +--project \ ``` From afabfcc767d1460c3f4e4bcf1cc63c8ae8737f9c Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Wed, 12 Sep 2018 16:52:30 +0200 Subject: [PATCH 6/7] Updated changelog --- CHANGELOG.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9b7750e652..93585f8f9e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0. - [#615](https://github.com/SciLifeLab/Sarek/pull/615) - Update documentation - [#620](https://github.com/SciLifeLab/Sarek/pull/620) - Add `tmp/` to `.gitignore` - [#625](https://github.com/SciLifeLab/Sarek/pull/625) - Add [`pathfindr`](https://github.com/NBISweden/pathfindr) as a submodule +- [#639](https://github.com/SciLifeLab/Sarek/pull/639) - Add a complete example analysis to docs ### `Changed` - [#608](https://github.com/SciLifeLab/Sarek/pull/608) - Update Nextflow required version @@ -24,6 +25,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0. - [#632](https://github.com/SciLifeLab/Sarek/pull/632) - Use 2 threads and 2 cpus FastQC processes - [#637](https://github.com/SciLifeLab/Sarek/pull/637) - Update tool version gathering - [#638](https://github.com/SciLifeLab/Sarek/pull/638) - Use correct `.simg` extension for Singularity images +- [#639](https://github.com/SciLifeLab/Sarek/pull/639) - Smaller refactoring of the docs ### `Removed` - [#616](https://github.com/SciLifeLab/Sarek/pull/616) - Remove old Issue Template From 6d6c4d3a1798a91a214a45e2204355bd949cf5c4 Mon Sep 17 00:00:00 2001 From: Johannes Alneberg Date: Wed, 12 Sep 2018 16:52:43 +0200 Subject: [PATCH 7/7] Updated README --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index d479dbd4ea..a6f1fccf1a 100644 --- a/README.md +++ b/README.md @@ -82,12 +82,13 @@ The Sarek pipeline comes with documentation in the `docs/` directory: 06. [Configuration and profiles documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/CONFIG.md) 07. [Intervals documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/INTERVALS.md) 08. [Running the pipeline](https://github.com/SciLifeLab/Sarek/blob/master/docs/USAGE.md) -09. [Examples](https://github.com/SciLifeLab/Sarek/blob/master/docs/USE_CASES.md) -10. [TSV file documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/TSV.md) -11. [Processes documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/PROCESS.md) -12. [Documentation about containers](https://github.com/SciLifeLab/Sarek/blob/master/docs/CONTAINERS.md) -13. [More information about ASCAT](https://github.com/SciLifeLab/Sarek/blob/master/docs/ASCAT.md) -14. [Output documentation structure](https://github.com/SciLifeLab/Sarek/blob/master/docs/OUTPUT.md) +09. [Command line parameters](https://github.com/SciLifeLab/Sarek/blob/master/docs/PARAMETERS.md) +10. [Examples](https://github.com/SciLifeLab/Sarek/blob/master/docs/USE_CASES.md) +11. [Input files documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/INPUT.md) +12. [Processes documentation](https://github.com/SciLifeLab/Sarek/blob/master/docs/PROCESS.md) +13. [Documentation about containers](https://github.com/SciLifeLab/Sarek/blob/master/docs/CONTAINERS.md) +14. [More information about ASCAT](https://github.com/SciLifeLab/Sarek/blob/master/docs/ASCAT.md) +15. [Output documentation structure](https://github.com/SciLifeLab/Sarek/blob/master/docs/OUTPUT.md) ## Contributions & Support