From b4f54a32ba5d3396cdf1a334d8070420642181ae Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 12:08:44 +0100 Subject: [PATCH 1/8] Added how to generate ASCAT reference files for exome/targeted sequencing --- docs/usage.md | 111 +++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 106 insertions(+), 5 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 445cd9730a..a778b1d974 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -608,14 +608,115 @@ This list is by no means exhaustive and it will depend on the specific analysis While the ASCAT implementation in sarek is capable of running with whole-exome sequencing data, the needed references are currently not provided with the igenomes.config. According to the [developers](https://github.com/VanLoo-lab/ascat/issues/97) of ASCAT, loci and allele files (one file per chromosome) can be downloaded directly from the [Battenberg repository](https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52). -The GC correction file needs to be derived, so one has to concatenate all chromosomes into a single file and modify the header so it fits [this example](https://github.com/VanLoo-lab/ascat/tree/master/LogRcorrection#gc-correction-file-creation). +Please note that: +* Row names (for GC and RT correction files) should be `${chr}_${position}` (there is no SNP/probe ID for HTS data). +* All row names in GC and RT correction files should also appear in the loci files +* Loci and allele files must contain the same set of SNPs +* ASCAT developers strongly recommend using a BED file for WES/TS data. This prevents considering SNPs covered by off-target reads that would add noise to log/BAF tracks. +* The total number of GC correction loci in a sample must be at least 10% of the number of loci with logR values. If the number of GC correction loci is too small compared to the total number of loci, ASCAT will throw an error. -The RT correction file is missing for hg38 but can be derived using [ASCAT scripts](https://github.com/VanLoo-lab/ascat/tree/master/LogRcorrection#replication-timing-correction-file-creation) for hg19. For hg38, one needs to lift-over hg38 to hg19, run the script on hg19 positions and set coordinates back to hg38. +From 'Reference files' https://github.com/VanLoo-lab/ascat: -Please note that: +> For WES and targeted sequencing, we recommend using the reference files (loci, allele and logR correction files) as part of the Battenberg package. Because they require a high-resolution input, our reference files for WGS are not suitable for WES and targeted sequencing. For WES, loci and allele files from the Battenberg package can be fed into ascat.prepareHTS. For targeted sequencing, allele files from the Battenberg package can be fed into ascat.prepareTargetedSeq, which will generate cleaned loci and allele files that can be fed into ascat.prepareHTS. + +### How to generate ASCAT resources for exome or targeted sequencing + +1. Fetch the GC content correction and replication timing (RT) correction files from the (Dropbox links provided by the ASCAT developers)[https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WGS] and intersect the SNP coordinates with the exome target coordinates. If the target file has 'chr' prefixes, make a copy with these removed first. Extract the GC and RT information for only the on target SNPs and zip the results. + +``` +sed -e 's/chr//' targets_with_chr.bed > targets.bed + +for t in GC RT +do + unzip ${t}_G1000_hg38.zip + + cut -f 1-3 ${t}_G1000_hg38.txt > ascat_${t}_snps_hg38.txt + tail -n +2 ascat_${t}_snps_hg38.txt | awk '{ print $2 "\t" $3-1 "\t" $3 "\t" $1 }' > ascat_${t}_snps_hg38.bed + bedtools intersect -a ascat_${t}_snps_hg38.bed -b targets.bed | awk '{ print $1 "_" $3 }' > ascat_${t}_snps_on_target_hg38.txt + + head -n 1 ${t}_G1000_hg38.txt > ${t}_G1000_on_target_hg38.txt + grep -f ascat_${t}_snps_on_target_hg38.txt ${t}_G1000_hg38.txt >> ${t}_G1000_on_target_hg38.txt + zip ${t}_G1000_on_target_hg38.zip ${t}_G1000_on_target_hg38.txt + + rm ${t}_G1000_hg38.zip +done +``` + +2. Download the Battenberg 1000G loci and alleles files. The steps below follow downloading from the (Battenberg repository at the Oxford University Research Archive)[https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52]. The files are also available via Dropbox links from the same page as the GC and RT correction files above. + +``` +wget https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52/files/rt435gd52w +mv rt345gd52w battenberg.zip +jar xf battenberg.zip + +unzip 1000G_loci_hg38.zip +cd 1000G_loci_hg38 +mkdir battenberg_alleles_on_target_hg38 +mv *allele* battenberg_alleles_on_target_hg38/ +mkdir battenberg_loci_on_target_hg38 +mv *loci* battenberg_loci_on_target_hg38/ +``` + +3. Copy the `targets.bed` and `ascat_GC_snps_on_target_hg38.txt` files into the newly created `battenberg_loci_on_target_hg38` folder before running the next set of steps. ASCAT generates a list of GC correction loci with sufficient coverage in a sample, then intersects that with the list of all loci with tumour logR values in that sample. If the intersection is <10% the size of the latter, it will fail with an error. Because the Battenberg loci/allele sets are very dense, subsetting to on-target regions is still too many loci. This script ensures that all SNPs with GC correction information are included in the loci list, plus a random sample of another 30% of all on target loci. You may need to vary this proportion depending on your set of targets. A good rule of thumb is that the size of your GC correction loci list should be about 15% the size of your total loci list. This allows for a margin of error. + +``` +cd battenberg_loci_on_target_hg38/ +rm *chrstring* +rm 1kg.phase3.v5a_GRCh38nounref_loci_chr23.txt +for i in {1..22} X +do + awk '{ print $1 "\t" $2-1 "\t" $2 }' 1kg.phase3.v5a_GRCh38nounref_loci_chr${i}.txt > chr${i}.bed + grep "^${i}_" ascat_GC_snps_on_target_hg38.txt > chr${i}.txt + bedtools intersect -a chr${i}.bed -b targets.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt + n=`wc -l chr${i}_on_target.txt | awk '{ print $1 }'` + count=$((n * 3 / 10)) + grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp + shuf -n $count chr${i}_on_target.txt >> chr${i}.temp + sort -n -k2 -t '_' chr${i}.temp | uniq > battenberg_loci_on_target_hg38_chr${i}.txt +done +zip battenberg_loci_on_target_hg38.zip battenberg_loci_on_target_hg38_chr*.txt +``` + +4. Extract the alleles for the same set of SNPs. Uses a short R script defined below. -Row names (for GC and RT correction files) should be `${chr}_${position}` (there is no SNP/probe ID for HTS data). -ASCAT developers strongly recommend using a BED file for WES/TS data. This prevents considering SNPs covered by off-targeted reads that would add noise to log/BAF tracks. +``` +cd ../battenberg_alleles_on_target_hg38/ +rm 1kg.phase3.v5a_GRCh38nounref_allele_index_chr23.txt +for i in {1..22} X +do + Rscript intersect_ascat_alleles.R ../battenberg_loci_on_target_hg38/battenberg_loci_on_target_hg38_chr${i}.txt \ + 1kg.phase3.v5a_GRCh38nounref_allele_index_chr${i}.txt battenberg_alleles_on_target_hg38_chr${i}.txt +done +zip battenberg_alleles_on_target_hg38.zip battenberg_alleles_on_target_hg38_chr*.txt +``` + +Rscript `intersect_ascat_alleles.R` + +``` +#!/usr/bin/env Rscript + +args = commandArgs(trailingOnly=TRUE) + +loci = read.table(args[1], header=F, sep="_", stringsAsFactors=F) +alleles = read.table(args[2], header=T, sep="\t", stringsAsFactors=F) + +i = intersect(loci$V2, alleles$position) + +out = subset(alleles, alleles$position %in% i) +write.table(out, args[3], col.names=T, row.names=F, quote=F, sep="\t") +``` + +5. Move or copy all of the zip files you've created to a suitable location. Specify these in your parameters, e.g. + +``` +{ + "ascat_alleles": "/path/to/battenberg_alleles_on_target_hg38.zip", + "ascat_loci": "/path/to/battenberg_loci_on_target_hg38.zip", + "ascat_loci_gc": "/path/to/GC_G1000_on_target_hg38.zip", + "ascat_loci_rt": "/path/to/RT_G1000_on_target_hg38.zip" +} + +``` ## What are the bwa/bwa-mem2 parameters? From 7cf1d3b90dace0ecfe902638866b12e939c1eaa9 Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 12:12:24 +0100 Subject: [PATCH 2/8] Fixed link formatting --- docs/usage.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index a778b1d974..f177760071 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -621,7 +621,7 @@ From 'Reference files' https://github.com/VanLoo-lab/ascat: ### How to generate ASCAT resources for exome or targeted sequencing -1. Fetch the GC content correction and replication timing (RT) correction files from the (Dropbox links provided by the ASCAT developers)[https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WGS] and intersect the SNP coordinates with the exome target coordinates. If the target file has 'chr' prefixes, make a copy with these removed first. Extract the GC and RT information for only the on target SNPs and zip the results. +1. Fetch the GC content correction and replication timing (RT) correction files from the [Dropbox links provided by the ASCAT developers](https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WGS) and intersect the SNP coordinates with the exome target coordinates. If the target file has 'chr' prefixes, make a copy with these removed first. Extract the GC and RT information for only the on target SNPs and zip the results. ``` sed -e 's/chr//' targets_with_chr.bed > targets.bed @@ -642,7 +642,7 @@ do done ``` -2. Download the Battenberg 1000G loci and alleles files. The steps below follow downloading from the (Battenberg repository at the Oxford University Research Archive)[https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52]. The files are also available via Dropbox links from the same page as the GC and RT correction files above. +2. Download the Battenberg 1000G loci and alleles files. The steps below follow downloading from the [Battenberg repository at the Oxford University Research Archive](https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52). The files are also available via Dropbox links from the same page as the GC and RT correction files above. ``` wget https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52/files/rt435gd52w From a214a4dab29220e9c5dbd820103252e464f5c2d2 Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 13:34:11 +0100 Subject: [PATCH 3/8] Tab delimited output rather than underscore for loci files --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index f177760071..99552fdea7 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -672,7 +672,7 @@ do count=$((n * 3 / 10)) grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp shuf -n $count chr${i}_on_target.txt >> chr${i}.temp - sort -n -k2 -t '_' chr${i}.temp | uniq > battenberg_loci_on_target_hg38_chr${i}.txt + sort -n -k2 -t '_' chr${i}.temp | uniq | awk 'BEGIN { FS="_" } ; { print $1 "\t" $2 }' > battenberg_loci_on_target_hg38_chr${i}.txt done zip battenberg_loci_on_target_hg38.zip battenberg_loci_on_target_hg38_chr*.txt ``` From cb1972bfcf987c369f74ba69958b27ab83da80a4 Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 14:15:12 +0100 Subject: [PATCH 4/8] Use 'chr' prefix for loci and alleles --- docs/usage.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 99552fdea7..eec56f4e3a 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -628,7 +628,7 @@ sed -e 's/chr//' targets_with_chr.bed > targets.bed for t in GC RT do - unzip ${t}_G1000_hg38.zip + unzip ${t}_G1000_hg38.zip cut -f 1-3 ${t}_G1000_hg38.txt > ascat_${t}_snps_hg38.txt tail -n +2 ascat_${t}_snps_hg38.txt | awk '{ print $2 "\t" $3-1 "\t" $3 "\t" $1 }' > ascat_${t}_snps_hg38.bed @@ -649,7 +649,7 @@ wget https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52/file mv rt345gd52w battenberg.zip jar xf battenberg.zip -unzip 1000G_loci_hg38.zip +unzip 1000G_loci_hg38_chr.zip cd 1000G_loci_hg38 mkdir battenberg_alleles_on_target_hg38 mv *allele* battenberg_alleles_on_target_hg38/ @@ -657,7 +657,7 @@ mkdir battenberg_loci_on_target_hg38 mv *loci* battenberg_loci_on_target_hg38/ ``` -3. Copy the `targets.bed` and `ascat_GC_snps_on_target_hg38.txt` files into the newly created `battenberg_loci_on_target_hg38` folder before running the next set of steps. ASCAT generates a list of GC correction loci with sufficient coverage in a sample, then intersects that with the list of all loci with tumour logR values in that sample. If the intersection is <10% the size of the latter, it will fail with an error. Because the Battenberg loci/allele sets are very dense, subsetting to on-target regions is still too many loci. This script ensures that all SNPs with GC correction information are included in the loci list, plus a random sample of another 30% of all on target loci. You may need to vary this proportion depending on your set of targets. A good rule of thumb is that the size of your GC correction loci list should be about 15% the size of your total loci list. This allows for a margin of error. +3. Copy the `targets_with_chr.bed` and `GC_G1000_on_target_hg38.txt` files into the newly created `battenberg_loci_on_target_hg38` folder before running the next set of steps. ASCAT generates a list of GC correction loci with sufficient coverage in a sample, then intersects that with the list of all loci with tumour logR values in that sample. If the intersection is <10% the size of the latter, it will fail with an error. Because the Battenberg loci/allele sets are very dense, subsetting to on-target regions is still too many loci. This script ensures that all SNPs with GC correction information are included in the loci list, plus a random sample of another 30% of all on target loci. You may need to vary this proportion depending on your set of targets. A good rule of thumb is that the size of your GC correction loci list should be about 15% the size of your total loci list. This allows for a margin of error. ``` cd battenberg_loci_on_target_hg38/ @@ -666,8 +666,8 @@ rm 1kg.phase3.v5a_GRCh38nounref_loci_chr23.txt for i in {1..22} X do awk '{ print $1 "\t" $2-1 "\t" $2 }' 1kg.phase3.v5a_GRCh38nounref_loci_chr${i}.txt > chr${i}.bed - grep "^${i}_" ascat_GC_snps_on_target_hg38.txt > chr${i}.txt - bedtools intersect -a chr${i}.bed -b targets.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt + grep "^${i}_" GC_G1000_on_target_hg38.txt | awk '{ print "chr" $1 }' > chr${i}.txt + bedtools intersect -a chr${i}.bed -b targets_with_chr.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt n=`wc -l chr${i}_on_target.txt | awk '{ print $1 }'` count=$((n * 3 / 10)) grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp @@ -697,7 +697,7 @@ Rscript `intersect_ascat_alleles.R` args = commandArgs(trailingOnly=TRUE) -loci = read.table(args[1], header=F, sep="_", stringsAsFactors=F) +loci = read.table(args[1], header=F, sep="\t", stringsAsFactors=F) alleles = read.table(args[2], header=T, sep="\t", stringsAsFactors=F) i = intersect(loci$V2, alleles$position) From 0418dc14d6053720a6613fb2e340cbfe9cc16725 Mon Sep 17 00:00:00 2001 From: Alison Meynert Date: Tue, 28 Mar 2023 14:34:59 +0100 Subject: [PATCH 5/8] Prettier linting --- docs/usage.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index eec56f4e3a..7fe7868f95 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -609,11 +609,12 @@ This list is by no means exhaustive and it will depend on the specific analysis While the ASCAT implementation in sarek is capable of running with whole-exome sequencing data, the needed references are currently not provided with the igenomes.config. According to the [developers](https://github.com/VanLoo-lab/ascat/issues/97) of ASCAT, loci and allele files (one file per chromosome) can be downloaded directly from the [Battenberg repository](https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52). Please note that: -* Row names (for GC and RT correction files) should be `${chr}_${position}` (there is no SNP/probe ID for HTS data). -* All row names in GC and RT correction files should also appear in the loci files -* Loci and allele files must contain the same set of SNPs -* ASCAT developers strongly recommend using a BED file for WES/TS data. This prevents considering SNPs covered by off-target reads that would add noise to log/BAF tracks. -* The total number of GC correction loci in a sample must be at least 10% of the number of loci with logR values. If the number of GC correction loci is too small compared to the total number of loci, ASCAT will throw an error. + +- Row names (for GC and RT correction files) should be `${chr}_${position}` (there is no SNP/probe ID for HTS data). +- All row names in GC and RT correction files should also appear in the loci files +- Loci and allele files must contain the same set of SNPs +- ASCAT developers strongly recommend using a BED file for WES/TS data. This prevents considering SNPs covered by off-target reads that would add noise to log/BAF tracks. +- The total number of GC correction loci in a sample must be at least 10% of the number of loci with logR values. If the number of GC correction loci is too small compared to the total number of loci, ASCAT will throw an error. From 'Reference files' https://github.com/VanLoo-lab/ascat: @@ -637,7 +638,7 @@ do head -n 1 ${t}_G1000_hg38.txt > ${t}_G1000_on_target_hg38.txt grep -f ascat_${t}_snps_on_target_hg38.txt ${t}_G1000_hg38.txt >> ${t}_G1000_on_target_hg38.txt zip ${t}_G1000_on_target_hg38.zip ${t}_G1000_on_target_hg38.txt - + rm ${t}_G1000_hg38.zip done ``` From 7cf0c1fe03881416c0331f12e0eab2703a2257c0 Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 16:20:15 +0100 Subject: [PATCH 6/8] #981 --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 2a0d3e93cf..46977f0292 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#967](https://github.com/nf-core/sarek/pull/967) - Adding new `outdir_cache` params - [#971](https://github.com/nf-core/sarek/pull/971) - Subtle bugfix to correct mutation of FASTP output channel objects. - [#978](https://github.com/nf-core/sarek/pull/978) - Validate that patient/sample does not contain spaces. +- [#981](https://github.com/nf-core/sarek/pull/981) - Added documentation on generating ASCAT resources for exome and targeted sequencing. ### Changed From 33a9bffc918f748332fa13328c5beaa3cc2570b2 Mon Sep 17 00:00:00 2001 From: ameynert Date: Tue, 28 Mar 2023 16:58:46 +0100 Subject: [PATCH 7/8] Added Alison Meynert to contributors --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 797d669112..59b8ee4d53 100644 --- a/README.md +++ b/README.md @@ -133,6 +133,7 @@ We thank the following people for their extensive assistance in the development - [cgpu](https://github.com/cgpu) - [gulfshores](https://github.com/gulfshores) - [pallolason](https://github.com/pallolason) +- [Alison Meynert](https://github.com/ameynert) ## Acknowledgements From 1ab65eb30836feac63794d3ccdd8b91ee28f3727 Mon Sep 17 00:00:00 2001 From: ameynert Date: Wed, 29 Mar 2023 09:16:05 +0100 Subject: [PATCH 8/8] Clarified that ASCAT runs out of the box on WGS data --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index 7fe7868f95..3e2d2beaf3 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -606,7 +606,7 @@ This list is by no means exhaustive and it will depend on the specific analysis ## How to run ASCAT with whole-exome sequencing data? -While the ASCAT implementation in sarek is capable of running with whole-exome sequencing data, the needed references are currently not provided with the igenomes.config. According to the [developers](https://github.com/VanLoo-lab/ascat/issues/97) of ASCAT, loci and allele files (one file per chromosome) can be downloaded directly from the [Battenberg repository](https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52). +ASCAT runs out of the box on whole genome sequencing data using iGenomes resources. While the ASCAT implementation in sarek is capable of running with whole-exome sequencing data, the needed references are currently not provided with the igenomes.config. According to the [developers](https://github.com/VanLoo-lab/ascat/issues/97) of ASCAT, loci and allele files (one file per chromosome) can be downloaded directly from the [Battenberg repository](https://ora.ox.ac.uk/objects/uuid:08e24957-7e76-438a-bd38-66c48008cf52). Please note that: