Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add extra qc metrics #1288

Merged
merged 18 commits into from
Nov 2, 2023
Merged

feat: add extra qc metrics #1288

merged 18 commits into from
Nov 2, 2023

Conversation

mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Oct 17, 2023

This PR:

The aim of this PR is primarily to adds GC dropout to WGS by adding picard hsmetrics to the WGS workflow, using the refgene bedfile.

See the issue here where a discussion was had regarding which bed file to use to calculate GC_dropout in WGS (to align with the RD pipeline, or to use the RefGene bedfile already in use in balsamic) (#1240).

It also:

  • adds picard_CollectGcBiasMetrics from Picard (since this is a GC bias metric that is based on looking at the entire genome, it could be an interesting metric to track specifically for WGS)
  • corrects some QC output names from TGA which still retained the realign part of their output names after the PR that removed realignment for TGA. (fix: Remove local realignment for TGA #1272)

Added:

  • picard_HsMetrics for WGS (using refgene as a target bedfile)
  • picard_CollectGcBiasMetrics for WGS

Fixed:

  • corrected names of QC metrics output-files for TGA to reflect the bamfile used there (not realigned, only deduped).

PR specific checks

Integrity:

  • TGA T/N workflow is still intact after renaming
  • TGA T/N multiqc_data.json contains the same info as in develop
  • WGS T-only workflow finishes without errors

New QC-metrics:

  • GC_dropout metrics is added to WGS multiqc_data.json
  • additional GC_bias metrics is added to WGS multiqc_data.json

Review and tests:

  • Tests pass
  • Code review
  • New code is executed and covered by tests, and test approve

@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Files Coverage Δ
BALSAMIC/commands/base.py 100.00% <100.00%> (ø)
BALSAMIC/commands/config/base.py 100.00% <100.00%> (ø)
BALSAMIC/commands/config/case.py 100.00% <100.00%> (+3.44%) ⬆️
BALSAMIC/commands/config/pon.py 100.00% <100.00%> (+4.87%) ⬆️
BALSAMIC/commands/init/base.py 100.00% <100.00%> (ø)
BALSAMIC/commands/options.py 100.00% <100.00%> (ø)
BALSAMIC/commands/report/base.py 100.00% <100.00%> (ø)
BALSAMIC/commands/report/deliver.py 100.00% <100.00%> (ø)
BALSAMIC/commands/report/status.py 100.00% <100.00%> (ø)
BALSAMIC/commands/run/analysis.py 100.00% <100.00%> (ø)
... and 20 more

📢 Thoughts on this report? Let us know!.

@mathiasbio mathiasbio added this to the Release 13 milestone Oct 19, 2023
@mathiasbio mathiasbio self-assigned this Oct 19, 2023
@mathiasbio mathiasbio linked an issue Oct 20, 2023 that may be closed by this pull request
1 task
@mathiasbio mathiasbio marked this pull request as ready for review October 25, 2023 15:33
@mathiasbio mathiasbio requested a review from a team as a code owner October 25, 2023 15:33
Copy link
Contributor

@ivadym ivadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good 🥇 All the new picard files are fed to Multiqc and the rule arguments look fine 💯

BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule Outdated Show resolved Hide resolved
Copy link
Contributor

@ivadym ivadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!!! ⭐ 👨‍💻 ⭐

We should add the relevant metrics to the metrics_deliverables.yaml file. Probably a good place will be during the QC review: #1297

@pbiology
Copy link
Contributor

So having talked a bit to @ivadym, @fevac and @jemten, perhaps we should make an effort to keep the current QC calculations as similar as possible between the cancer and RD pipelines.
I think there is consensus that using a static file, such as RefGene is the way forward, but in order to make things similar right now, we should perhaps start using the twistExome BED file used in MIP also in BALSAMIC, and then we have a quick look for which static BED file would be the best replacement, and we implement it in both pipelines at a future date.

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Oct 30, 2023

@pbiology I can see the appeal of using the same file as it more or less unifies this metric for all WGS samples. I agree with the image that it would be nice to just look at all WGS samples, as data gathered from the same lab-method, and to have a unified may of getting these QC-stats.

Practically implementing this change means adding another command-line argument to config a case, and the associated change in CG, which will be quite minor, and some new logic in balsamic to maybe allow it to be optional (to not force researchers to create these interval_list files). Changes should be fairly quick however.

Part of me hurts a little needing to implement logic in balsamic to make 1 metric unified with another pipeline 😅 probably there's a quite a few more values that would be different for the same sample if we ran them in MIP or Balsamic. And I guess for the future we should continue in this spirit of having identical QC workflows for RD and cancer, or maybe make a separate workflow just for QC.

For the purposes of trending I don't think it matters that we have slightly differently skewed GC_dropout for the different pipelines since they should both reflect some underlying truth of the lab-prep-quality regardless. Maybe it could even be good to get some different perspectives on the same stat 🤷 (as long as there's no obvious optimal way of calculating them)

But if we are using the same metric I can see how we can sort of steal some knowledge from the MIP metrics if we want to implement some QC-threshold for WGS in balsamic, if we use RefSeq we may need to build up this knowledge with time, but maybe if we use the same as MIP we can just learn from their stats. Maybe that's the most important point? Then again...if we're all eventually going to switch to RefGene one can argue that in the future, the RD-pipeline could steal some QC-metrics from balsamic, since we have at that point already been running it for a while 😅

Anyway, lots of text. It doesn't matter that much to me, I'll implement the logic to mirror the RD. Probably I'm just missing some bigger picture perspective

@pbiology
Copy link
Contributor

For the purposes of trending I don't think it matters that we have slightly differently skewed GC_dropout for the different pipelines since they should both reflect some underlying truth of the lab-prep-quality regardless. Maybe it could even be good to get some different perspectives on the same stat 🤷 (as long as there's no obvious optimal way of calculating them)

I think this is the biggest for me. Looking at the data you presented in #1240 , not only was the GC/AT_Drop different, but also various coverage values were quite different depending on the BED file used. It is quite appealing to say that if we look at WGS data, and we use Picards caluclate hs mtrics in both, we should be able to fully compare values comping from two different pipelines.

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Oct 30, 2023

But if what we're caring about is trending these QC-metrics from the same method, to be able to monitor the quality of the prep, I think we're already getting most of that from the MIP WGS-cases which have recorded this metric for a long time.

The extra stats from Balsamic just seems like more data-points to me, and at that point I don't see that it matters so much if these data-points are skewed in any particular way based on checking different regions, and if we're already planning to change to a more general bed-file like RefSeq maybe it's better that we've already starting trending on this bedfile in parallel?

What we will say if we collect the data from balsamic and MIP are things like "GC_DROPOUT in balsamic is consistently [5% higher/lower] than in MIP due to using different bedfiles, but the trend is the same..." I don't know if it's worth it to add new logic to reduce the pipeline-specific effects for this, maybe even with the same bedfile there's some consistent difference based on trimming etc (but probably...not)

@mathiasbio
Copy link
Collaborator Author

I don't think I would have time to make the change to twist before validation on Monday. Especially since it also requires making it work for hg38 which is not an issue with the RefGene bed, and that there are still some tasks left over in CG to solve before release. If everyone is ok with it, I'm going to merge this tomorrow morning 🔥 🧙 🔥

Copy link

sonarqubecloud bot commented Nov 2, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@mathiasbio mathiasbio merged commit dbfe0ca into develop Nov 2, 2023
9 checks passed
@mathiasbio mathiasbio deleted the add_extra_qc_metrics branch November 2, 2023 09:15
@ivadym ivadym mentioned this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Archived
Development

Successfully merging this pull request may close these issues.

Calculate GC/AT dropout for WGS cases
3 participants