feat: add extra qc metrics #1288

mathiasbio · 2023-10-17T14:01:15Z

This PR:

The aim of this PR is primarily to adds GC dropout to WGS by adding picard hsmetrics to the WGS workflow, using the refgene bedfile.

See the issue here where a discussion was had regarding which bed file to use to calculate GC_dropout in WGS (to align with the RD pipeline, or to use the RefGene bedfile already in use in balsamic) (#1240).

It also:

adds picard_CollectGcBiasMetrics from Picard (since this is a GC bias metric that is based on looking at the entire genome, it could be an interesting metric to track specifically for WGS)
corrects some QC output names from TGA which still retained the realign part of their output names after the PR that removed realignment for TGA. (fix: Remove local realignment for TGA #1272)

Added:

picard_HsMetrics for WGS (using refgene as a target bedfile)
picard_CollectGcBiasMetrics for WGS

Fixed:

corrected names of QC metrics output-files for TGA to reflect the bamfile used there (not realigned, only deduped).

PR specific checks

Integrity:

TGA T/N workflow is still intact after renaming
TGA T/N multiqc_data.json contains the same info as in develop
WGS T-only workflow finishes without errors

New QC-metrics:

GC_dropout metrics is added to WGS multiqc_data.json
additional GC_bias metrics is added to WGS multiqc_data.json

Review and tests:

Tests pass
Code review
New code is executed and covered by tests, and test approve

codecov · 2023-10-17T14:07:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files	Coverage Δ
BALSAMIC/commands/base.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/config/base.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/config/case.py	`100.00% <100.00%> (+3.44%)`	⬆️
BALSAMIC/commands/config/pon.py	`100.00% <100.00%> (+4.87%)`	⬆️
BALSAMIC/commands/init/base.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/options.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/report/base.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/report/deliver.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/report/status.py	`100.00% <100.00%> (ø)`
BALSAMIC/commands/run/analysis.py	`100.00% <100.00%> (ø)`
... and 20 more

📢 Thoughts on this report? Let us know!.

ivadym

It looks good 🥇 All the new picard files are fed to Multiqc and the rule arguments look fine 💯

BALSAMIC/snakemake_rules/quality_control/picard_wgs.rule

ivadym

Fantastic work!!! ⭐ 👨‍💻 ⭐

We should add the relevant metrics to the metrics_deliverables.yaml file. Probably a good place will be during the QC review: #1297

pbiology · 2023-10-30T09:27:20Z

So having talked a bit to @ivadym, @fevac and @jemten, perhaps we should make an effort to keep the current QC calculations as similar as possible between the cancer and RD pipelines.
I think there is consensus that using a static file, such as RefGene is the way forward, but in order to make things similar right now, we should perhaps start using the twistExome BED file used in MIP also in BALSAMIC, and then we have a quick look for which static BED file would be the best replacement, and we implement it in both pipelines at a future date.

mathiasbio · 2023-10-30T14:02:05Z

@pbiology I can see the appeal of using the same file as it more or less unifies this metric for all WGS samples. I agree with the image that it would be nice to just look at all WGS samples, as data gathered from the same lab-method, and to have a unified may of getting these QC-stats.

Practically implementing this change means adding another command-line argument to config a case, and the associated change in CG, which will be quite minor, and some new logic in balsamic to maybe allow it to be optional (to not force researchers to create these interval_list files). Changes should be fairly quick however.

Part of me hurts a little needing to implement logic in balsamic to make 1 metric unified with another pipeline 😅 probably there's a quite a few more values that would be different for the same sample if we ran them in MIP or Balsamic. And I guess for the future we should continue in this spirit of having identical QC workflows for RD and cancer, or maybe make a separate workflow just for QC.

For the purposes of trending I don't think it matters that we have slightly differently skewed GC_dropout for the different pipelines since they should both reflect some underlying truth of the lab-prep-quality regardless. Maybe it could even be good to get some different perspectives on the same stat 🤷 (as long as there's no obvious optimal way of calculating them)

But if we are using the same metric I can see how we can sort of steal some knowledge from the MIP metrics if we want to implement some QC-threshold for WGS in balsamic, if we use RefSeq we may need to build up this knowledge with time, but maybe if we use the same as MIP we can just learn from their stats. Maybe that's the most important point? Then again...if we're all eventually going to switch to RefGene one can argue that in the future, the RD-pipeline could steal some QC-metrics from balsamic, since we have at that point already been running it for a while 😅

Anyway, lots of text. It doesn't matter that much to me, I'll implement the logic to mirror the RD. Probably I'm just missing some bigger picture perspective

pbiology · 2023-10-30T14:48:13Z

For the purposes of trending I don't think it matters that we have slightly differently skewed GC_dropout for the different pipelines since they should both reflect some underlying truth of the lab-prep-quality regardless. Maybe it could even be good to get some different perspectives on the same stat 🤷 (as long as there's no obvious optimal way of calculating them)

I think this is the biggest for me. Looking at the data you presented in #1240 , not only was the GC/AT_Drop different, but also various coverage values were quite different depending on the BED file used. It is quite appealing to say that if we look at WGS data, and we use Picards caluclate hs mtrics in both, we should be able to fully compare values comping from two different pipelines.

mathiasbio · 2023-10-30T16:13:33Z

But if what we're caring about is trending these QC-metrics from the same method, to be able to monitor the quality of the prep, I think we're already getting most of that from the MIP WGS-cases which have recorded this metric for a long time.

The extra stats from Balsamic just seems like more data-points to me, and at that point I don't see that it matters so much if these data-points are skewed in any particular way based on checking different regions, and if we're already planning to change to a more general bed-file like RefSeq maybe it's better that we've already starting trending on this bedfile in parallel?

What we will say if we collect the data from balsamic and MIP are things like "GC_DROPOUT in balsamic is consistently [5% higher/lower] than in MIP due to using different bedfiles, but the trend is the same..." I don't know if it's worth it to add new logic to reduce the pipeline-specific effects for this, maybe even with the same bedfile there's some consistent difference based on trimming etc (but probably...not)

mathiasbio · 2023-11-01T13:54:32Z

I don't think I would have time to make the change to twist before validation on Monday. Especially since it also requires making it work for hg38 which is not an issue with the RefGene bed, and that there are still some tasks left over in CG to solve before release. If everyone is ok with it, I'm going to merge this tomorrow morning 🔥 🧙 🔥

sonarqubecloud · 2023-11-02T08:50:04Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

mathiasbio added 5 commits September 13, 2023 17:41

test add hsmetrics for wgs

08fd94c

fix bug

2240978

adding additional qc metrics to test

ab5f175

fix typo

e695107

Merge branch 'develop' into add_extra_qc_metrics

8d4344f

mathiasbio added 2 commits October 18, 2023 08:58

add separate plot commands

104809f

fix bug

686644a

mathiasbio added this to the Release 13 milestone Oct 19, 2023

mathiasbio self-assigned this Oct 19, 2023

mathiasbio linked an issue Oct 20, 2023 that may be closed by this pull request

Calculate GC/AT dropout for WGS cases #1240

Closed

1 task

mathiasbio added 7 commits October 23, 2023 09:48

fix bug

7e6ee55

small fixes

66b8418

remove metric

a3e5381

remove multiqc input

10fd443

remove testing qc stats

f25d7c6

Merge branch 'develop' into add_extra_qc_metrics

69925e5

changelog

93359bf

mathiasbio marked this pull request as ready for review October 25, 2023 15:33

mathiasbio requested a review from a team as a code owner October 25, 2023 15:33

ivadym reviewed Oct 26, 2023

View reviewed changes

mathiasbio added 3 commits October 27, 2023 13:34

refactor code based on codereview feedback

cb15630

solve merge conflict

82d91c3

Merge branch 'develop' into add_extra_qc_metrics

9352f46

mathiasbio added the Waiting for Review label Oct 27, 2023

ivadym approved these changes Oct 30, 2023

View reviewed changes

mathiasbio removed the Waiting for Review label Oct 30, 2023

fix merge conflict

9b638b7

mathiasbio merged commit dbfe0ca into develop Nov 2, 2023
9 checks passed

mathiasbio deleted the add_extra_qc_metrics branch November 2, 2023 09:15

ivadym mentioned this pull request Nov 6, 2023

feat: Release v13.0.0 #1320

Merged

ivadym mentioned this pull request Nov 30, 2023

Wrong metrics name for TGA cases from qc/picard.rule #1294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add extra qc metrics #1288

feat: add extra qc metrics #1288

mathiasbio commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

ivadym left a comment

ivadym left a comment

pbiology commented Oct 30, 2023

mathiasbio commented Oct 30, 2023 •

edited

Loading

pbiology commented Oct 30, 2023

mathiasbio commented Oct 30, 2023 •

edited

Loading

mathiasbio commented Nov 1, 2023

sonarqubecloud bot commented Nov 2, 2023

feat: add extra qc metrics #1288

feat: add extra qc metrics #1288

Conversation

mathiasbio commented Oct 17, 2023 • edited Loading

This PR:

PR specific checks

Review and tests:

codecov bot commented Oct 17, 2023 • edited Loading

Codecov Report

ivadym left a comment

Choose a reason for hiding this comment

ivadym left a comment

Choose a reason for hiding this comment

pbiology commented Oct 30, 2023

mathiasbio commented Oct 30, 2023 • edited Loading

pbiology commented Oct 30, 2023

mathiasbio commented Oct 30, 2023 • edited Loading

mathiasbio commented Nov 1, 2023

sonarqubecloud bot commented Nov 2, 2023

mathiasbio commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

mathiasbio commented Oct 30, 2023 •

edited

Loading

mathiasbio commented Oct 30, 2023 •

edited

Loading