Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate GC/AT dropout for WGS cases #1240

Closed
1 task
pbiology opened this issue Aug 31, 2023 · 4 comments · Fixed by #1288
Closed
1 task

Calculate GC/AT dropout for WGS cases #1240

pbiology opened this issue Aug 31, 2023 · 4 comments · Fixed by #1288

Comments

@pbiology
Copy link
Contributor

Need

It would be good to start tracking GC and AT dropout for all WGS cases. Right now we are just doing it for panels.

Suggested approach

Run picard hs metrics also on WGS cases. Preferably with the same exome BED file as used in rare disease. Could be worthwhile asking the RD team if we should update which file is used for the exome.

Considered alternatives

Could we otherwise somehow get the whole genome, and not only exome?

Requests/suggestions/bugs solved by the feature

Can be closed when

  • BALSAMIC captures GC and At dropout in the multiqc.json file

Blockers

@pbiology
Copy link
Contributor Author

pbiology commented Sep 1, 2023

Let's implement as in MIP for release 13, and then for next release we should re-evaluate exactly which regions should be underlying the GC/AT dropout calculations

@mathiasbio mathiasbio moved this from Todo to Planned in BALSAMIC Sep 12, 2023
@mathiasbio mathiasbio moved this from Planned to In Progress in BALSAMIC Sep 12, 2023
@mathiasbio
Copy link
Collaborator

mathiasbio commented Sep 14, 2023

RD seems to be using an old bedfile from the twist exome prep: "twistexomerefseq_9.1_hg19_design.bed.pad100.interval_list" I wonder if it isn't simply better for us that we use the RefGene bedfile that we're already using in balsamic as it seems to be a more general, untampered file.

It would also be easier to implement this if we were using that file as implementing the same that RD is using would require us to either:

  • add another reference file to download in our init, and then do some modification of it to get the "pad100" (as I don't think this file is available online)
  • alternatively to save this exome-bed in a reference folder and add it as an argument to balsamic config case

To start I will test run with the exome-fil we have, AND the RD one, and compare the stats we get. If they are similar, which I think they will be as the regions included in these files are so large it should converge on the same value, then I think we can just run with the RefGene one. I'll write the results here.

@pbiology
Copy link
Contributor Author

Sounds like a good way forward. We should not skip on aligning with the RD group before making any decisions. Having the values comparable seems quite valuable

@mathiasbio
Copy link
Collaborator

I ran some tests with 3 different bedfiles. The Refgene bedfile we're using already in balsamic, the untampered Twist bedfile for exome-analysis, and the RD Twist bedfile that they're using. While the coverage values are very similar for the different bedfiles, the GC_dropout is quite different, and substantially so between the untampered Twist v10 and the others. I don't know why this is, perhaps it has something to do with the inclusion of many small bed-regions in this file, which is the one defining feature of this bedfile that I can imagine at the moment. In the end I think the most reasonable way to implement this is to use the RefGene bedfile, the results are similar to the ones we get with the RD bedfile, and I think if there is any standardisation we could achieve between the pipeline it is more reasonable to build this foundation on Refgene rather than a particular exome-panel.

Bedfile PCT_TARGET_BASES_10X PCT_TARGET_BASES_20X PCT_TARGET_BASES_30X MEAN_TARGET_COVERAGE AT_DROPOUT GC_DROPOUT
twistexomerefseq_9,1_hg19_design,bed,pad100,modifiedheader,interval_list 0,98063 0,931417 0,600774 31,370124 2,997022 0,061446
twistexomecomprehensive_10,1_hg19_design,bed 0,979294 0,92972 0,606413 31,482487 2,703384 0,441655
refGene,flat,bed 0,983179 0,95696 0,606675 31,776462 2,253385 0,054916

@mathiasbio mathiasbio linked a pull request Oct 20, 2023 that will close this issue
8 tasks
@mathiasbio mathiasbio moved this from In Progress to Review in BALSAMIC Oct 27, 2023
@ivadym ivadym moved this from Review to WIP in BALSAMIC Nov 1, 2023
@mathiasbio mathiasbio moved this from WIP to Completed in BALSAMIC Nov 6, 2023
@pbiology pbiology moved this from Completed to Archived in BALSAMIC Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Archived
Development

Successfully merging a pull request may close this issue.

3 participants