From c5ae7ecdb30abdeb1203a7c4a65945400f864f7f Mon Sep 17 00:00:00 2001 From: ggabernet Date: Wed, 15 Apr 2020 18:45:11 +0200 Subject: [PATCH 01/12] improve imput.md --- docs/input.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/input.md b/docs/input.md index 5443c8bc90..edea6b14ad 100644 --- a/docs/input.md +++ b/docs/input.md @@ -6,17 +6,17 @@ Input files for Sarek can be specified using a TSV file given to the `--input` c The TSV file is a Tab Separated Value file with columns: - `subject sex status sample lane fastq1 fastq2` for step `mapping` with paired-end FASTQs -- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs -- `subject sex status sample bam bai recaltable` for step `recalibrate` with BAMs +- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs (uBAMs) +- `subject sex status sample bam bai recaltable` for step `recalibrate` with mapped BAMs - `subject sex status sample bam bai` for step `variantcalling` with BAMs The content of these columns is quite straight-forward: -- `subject` designate the subject, it should be the ID of the Patient, and it must design only one patient +- `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient - `sex` are the sex chromosomes of the Patient, (XX or XY) -- `status` is the status of the Patient, (0 for Normal or 1 for Tumor) -- `sample` designate the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must design only one sample -- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample +- `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) +- `sample` designates the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample +- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one letter - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file - `bam` is the bam file @@ -24,12 +24,12 @@ The content of these columns is quite straight-forward: - `recaltable` is the recalibration table It is recommended to add the absolute path of the files, but relative path should work also. -Note, the delimiter is the tab (`\t`) character: +Note, the delimiter is the tab (`\t`) character. All examples are given for a normal/tumor pair. -If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair. +If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline variant calling results only. -Sarek will output results is a different directory for each sample. +Sarek will output results in a different directory for each sample. If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples. Multiple TSV files can be specified if the path is enclosed in quotes. @@ -117,6 +117,8 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table ``` +When starting Sarek from the mapping step, a TSV file is generated automatically after the MarkDuplicates process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files, giving it as `--input` and setting the step `--step recalibrate`. + ## Example TSV file for a normal/tumor pair with recalibrated BAM files (step variantcalling) The same way, if you have recalibrated BAMs and their indexes, you should use a structure like: @@ -126,6 +128,8 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToF G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai ``` +When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files, giving it as `--input` and setting the step `--step variantcalling`. + ## VCF files for annotation Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the `annotate` step. From 28cd6da0db6113f7cbc274dbab7af0e61af9132b Mon Sep 17 00:00:00 2001 From: ggabernet Date: Wed, 15 Apr 2020 18:49:13 +0200 Subject: [PATCH 02/12] update changelog --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 90ba06b799..98f7ab620c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -56,6 +56,7 @@ Piellorieppe is one of the main massif in the Sarek National Park. - [#152](https://github.com/nf-core/sarek/pull/152), [#158](https://github.com/nf-core/sarek/pull/158), [#164](https://github.com/nf-core/sarek/pull/164), [#174](https://github.com/nf-core/sarek/pull/174) - Update docs - [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0` - [#180](https://github.com/nf-core/sarek/pull/180) - Improve minimal setting +- [#183](https://github.com/nf-core/sarek/pull/183) - Update input.md documentation ### Fixed - [2.6dev] From 97980407d81c94f7614172d89bba123739a1df3e Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 09:19:31 +0200 Subject: [PATCH 03/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index edea6b14ad..05006035f2 100644 --- a/docs/input.md +++ b/docs/input.md @@ -7,7 +7,7 @@ The TSV file is a Tab Separated Value file with columns: - `subject sex status sample lane fastq1 fastq2` for step `mapping` with paired-end FASTQs - `subject sex status sample lane bam` for step `mapping` with unmapped BAMs (uBAMs) -- `subject sex status sample bam bai recaltable` for step `recalibrate` with mapped BAMs +- `subject sex status sample bam bai recaltable` for step `recalibrate` with mapped BAMs and corresponding recalibration table - `subject sex status sample bam bai` for step `variantcalling` with BAMs The content of these columns is quite straight-forward: From 58a7cd2dd6446a27c013b507f2188601f93cb43d Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 09:20:47 +0200 Subject: [PATCH 04/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 05006035f2..5172345fe3 100644 --- a/docs/input.md +++ b/docs/input.md @@ -16,7 +16,7 @@ The content of these columns is quite straight-forward: - `sex` are the sex chromosomes of the Patient, (XX or XY) - `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) - `sample` designates the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample -- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one letter +- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file - `bam` is the bam file From e14459edfd752923fd8a65244b9ff7eb584486ea Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 09:21:16 +0200 Subject: [PATCH 05/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 5172345fe3..17821813aa 100644 --- a/docs/input.md +++ b/docs/input.md @@ -117,7 +117,7 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table ``` -When starting Sarek from the mapping step, a TSV file is generated automatically after the MarkDuplicates process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files, giving it as `--input` and setting the step `--step recalibrate`. +When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files, giving it as `--input` and setting the step `--step recalibrate`. ## Example TSV file for a normal/tumor pair with recalibrated BAM files (step variantcalling) From 695f52fdc9c00aef4b59632b5f114ecfa1304216 Mon Sep 17 00:00:00 2001 From: ggabernet Date: Thu, 16 Apr 2020 10:11:42 +0200 Subject: [PATCH 06/12] more refactoring input.md --- docs/input.md | 64 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 20 deletions(-) diff --git a/docs/input.md b/docs/input.md index 17821813aa..4886e7725e 100644 --- a/docs/input.md +++ b/docs/input.md @@ -1,16 +1,10 @@ # Input Documentation -## Information about the TSV files +## General information about the TSV files Input files for Sarek can be specified using a TSV file given to the `--input` command. -The TSV file is a Tab Separated Value file with columns: - -- `subject sex status sample lane fastq1 fastq2` for step `mapping` with paired-end FASTQs -- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs (uBAMs) -- `subject sex status sample bam bai recaltable` for step `recalibrate` with mapped BAMs and corresponding recalibration table -- `subject sex status sample bam bai` for step `variantcalling` with BAMs - -The content of these columns is quite straight-forward: +There are different kinds of TSV files that can be used as input, depending on the input files available (fastq, uBAM, BAM...). +For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to: - `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient - `sex` are the sex chromosomes of the Patient, (XX or XY) @@ -35,7 +29,19 @@ Multiple TSV files can be specified if the path is enclosed in quotes. Somatic variant calling output will be in a specific directory for each normal/tumor pair. -## Example TSV file for a normal/tumor pair with FASTQ files (step mapping) +## Starting from the mapping step + +When starting from the mapping step (`--step mapping`), the first step of Sarek, the input can have three different forms: + +- A TSV file containing the sample metadata and the path to the fastq files. +- The Path to a directory containing the fastq files +- A TSV file containing the sample metadata and the path to the unmapped BAM (uBAM) files. + +### Providing a TSV file with the path to FASTQ files + +The TSV file to start with the step mapping with paired-end FASTQs should contain the columns: + +`subject sex status sample lane fastq1 fastq2` In this sample for the normal case there are 3 read groups, and 2 for the tumor. @@ -47,7 +53,7 @@ G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq. G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz ``` -## Path to a FASTQ directory for a single normal sample (step mapping) +### Providing the path to a FASTQ directory Input files for Sarek can be specified using the path to a FASTQ directory given to the `--input` command only with the `mapping` step. @@ -55,9 +61,9 @@ Input files for Sarek can be specified using the path to a FASTQ directory given nextflow run nf-core/sarek --input pathToDirectory ... ``` -### Input FASTQ file name best practices +#### Input FASTQ file name best practices -The input folder, containing the FASTQ files for one individual (ID) should be organized into one sub-folder for every sample. +The input folder, containing the FASTQ files for one subject (ID) should be organized into one sub-folder for every sample. All fastq files for that sample should be collected here. ```text @@ -96,7 +102,11 @@ Read group information will be parsed from fastq file names according to this: - `PU` = sample - `RGLB` = lib -## Example TSV file for a normal/tumor pair with uBAM files (step mapping) +### Providing a TSV file with the paths to uBAM files + +The TSV (Tab Separated Values) file for starting the mapping from uBAM files should contain the columns: + +- `subject sex status sample lane bam` In this sample for the normal case there are 3 read groups, and 2 for the tumor. @@ -108,7 +118,12 @@ G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam ``` -## Example TSV file for a normal/tumor pair with non recalibrated BAM files (step recalibrate) +## Starting from the BAM recalibration step + +To start from the recalibration step (`--step recalibrate`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files. +The TSV needs to contain the following columns: + +- `subject sex status sample bam bai recaltable` The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like: @@ -117,22 +132,31 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table ``` -When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files, giving it as `--input` and setting the step `--step recalibrate`. +When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step recalibrate` will automatically take this file as input. + +Additionally, individual TSV files for each sample (`duplicateMarked_[SAMPLE].tsv`) can be found in the same directory. -## Example TSV file for a normal/tumor pair with recalibrated BAM files (step variantcalling) +## Starting from the variant calling step -The same way, if you have recalibrated BAMs and their indexes, you should use a structure like: +A TSV file for a normal/tumor pair with recalibrated BAM files and their indexes can be provided to start Sarek from the variant calling step (`--step variantcalling`). +The TSV file should contain the columns: + +- `subject sex status sample bam bai` + +Here is an example for two samples from the same subject: ```text G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToFiles/G15511.C09DFN.md.recal.bai G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai ``` -When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files, giving it as `--input` and setting the step `--step variantcalling`. +When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files. Setting the step `--step variantcalling` will automatically take this file as input. + +Additionally, individual TSV files for each sample (`recalibrated_[SAMPLE].tsv`) can be found in the same directory. ## VCF files for annotation -Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the `annotate` step. +Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the annotation step (`--step annotate`). As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted. Multiple VCF files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. For example: From d903bb333f3ea4417ae6b0345a7ab9636f90ee30 Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 12:24:30 +0200 Subject: [PATCH 07/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 4886e7725e..a2322209f8 100644 --- a/docs/input.md +++ b/docs/input.md @@ -9,7 +9,7 @@ For all possible TSV files, described in the next sections, here is an explanati - `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient - `sex` are the sex chromosomes of the Patient, (XX or XY) - `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) -- `sample` designates the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample +- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample - `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file From 69164c26c6ea412c748733b8b065c25b9bea3c13 Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 12:25:09 +0200 Subject: [PATCH 08/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index a2322209f8..f62b7dca5a 100644 --- a/docs/input.md +++ b/docs/input.md @@ -13,7 +13,7 @@ For all possible TSV files, described in the next sections, here is an explanati - `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file -- `bam` is the bam file +- `bam` is the path to the bam file - `bai` is the bam index file - `recaltable` is the recalibration table From 82d7b467daae90da668719ef4edc258ca6bbb936 Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 12:25:15 +0200 Subject: [PATCH 09/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index f62b7dca5a..3d0831e355 100644 --- a/docs/input.md +++ b/docs/input.md @@ -14,7 +14,7 @@ For all possible TSV files, described in the next sections, here is an explanati - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file - `bam` is the path to the bam file -- `bai` is the bam index file +- `bai` is the path to the bam index file - `recaltable` is the recalibration table It is recommended to add the absolute path of the files, but relative path should work also. From 847d78e4058b3c8fb452cb0a04ad1227f6379a66 Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 12:25:21 +0200 Subject: [PATCH 10/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 3d0831e355..5d6b62bb98 100644 --- a/docs/input.md +++ b/docs/input.md @@ -15,7 +15,7 @@ For all possible TSV files, described in the next sections, here is an explanati - `fastq2` is the path to the second pair of the fastq file - `bam` is the path to the bam file - `bai` is the path to the bam index file -- `recaltable` is the recalibration table +- `recaltable` is the path to the recalibration table It is recommended to add the absolute path of the files, but relative path should work also. Note, the delimiter is the tab (`\t`) character. From c55087611fe5c649191f67dae88dd215510cc603 Mon Sep 17 00:00:00 2001 From: Gisela Gabernet Date: Thu, 16 Apr 2020 12:35:37 +0200 Subject: [PATCH 11/12] Update docs/input.md Co-Authored-By: Maxime Garcia --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 5d6b62bb98..9460111254 100644 --- a/docs/input.md +++ b/docs/input.md @@ -9,7 +9,7 @@ For all possible TSV files, described in the next sections, here is an explanati - `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient - `sex` are the sex chromosomes of the Patient, (XX or XY) - `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) -- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample +- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged) - `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character - `fastq1` is the path to the first pair of the fastq file - `fastq2` is the path to the second pair of the fastq file From 4f5c2d035c731aa5c5651a0b1a085de3746f8c5d Mon Sep 17 00:00:00 2001 From: Maxime Garcia Date: Thu, 16 Apr 2020 12:37:24 +0200 Subject: [PATCH 12/12] Update docs/input.md Co-Authored-By: Gisela Gabernet --- docs/input.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index 9460111254..6ce8f34bea 100644 --- a/docs/input.md +++ b/docs/input.md @@ -6,7 +6,7 @@ Input files for Sarek can be specified using a TSV file given to the `--input` c There are different kinds of TSV files that can be used as input, depending on the input files available (fastq, uBAM, BAM...). For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to: -- `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient +- `subject` designates the subject, it should be the ID of the patient, and it must be unique for each patient, but one patient can have multiple samples (e.g. normal and tumor) - `sex` are the sex chromosomes of the Patient, (XX or XY) - `status` is the status of the measured sample, (0 for Normal or 1 for Tumor) - `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)