-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add extra qc metrics #1288
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good 🥇 All the new picard files are fed to Multiqc and the rule arguments look fine 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work!!! ⭐ 👨💻 ⭐
We should add the relevant metrics to the metrics_deliverables.yaml
file. Probably a good place will be during the QC review: #1297
So having talked a bit to @ivadym, @fevac and @jemten, perhaps we should make an effort to keep the current QC calculations as similar as possible between the cancer and RD pipelines. |
@pbiology I can see the appeal of using the same file as it more or less unifies this metric for all WGS samples. I agree with the image that it would be nice to just look at all WGS samples, as data gathered from the same lab-method, and to have a unified may of getting these QC-stats. Practically implementing this change means adding another command-line argument to config a case, and the associated change in CG, which will be quite minor, and some new logic in balsamic to maybe allow it to be optional (to not force researchers to create these interval_list files). Changes should be fairly quick however. Part of me hurts a little needing to implement logic in balsamic to make 1 metric unified with another pipeline 😅 probably there's a quite a few more values that would be different for the same sample if we ran them in MIP or Balsamic. And I guess for the future we should continue in this spirit of having identical QC workflows for RD and cancer, or maybe make a separate workflow just for QC. For the purposes of trending I don't think it matters that we have slightly differently skewed GC_dropout for the different pipelines since they should both reflect some underlying truth of the lab-prep-quality regardless. Maybe it could even be good to get some different perspectives on the same stat 🤷 (as long as there's no obvious optimal way of calculating them) But if we are using the same metric I can see how we can sort of steal some knowledge from the MIP metrics if we want to implement some QC-threshold for WGS in balsamic, if we use RefSeq we may need to build up this knowledge with time, but maybe if we use the same as MIP we can just learn from their stats. Maybe that's the most important point? Then again...if we're all eventually going to switch to RefGene one can argue that in the future, the RD-pipeline could steal some QC-metrics from balsamic, since we have at that point already been running it for a while 😅 Anyway, lots of text. It doesn't matter that much to me, I'll implement the logic to mirror the RD. Probably I'm just missing some bigger picture perspective |
I think this is the biggest for me. Looking at the data you presented in #1240 , not only was the GC/AT_Drop different, but also various coverage values were quite different depending on the BED file used. It is quite appealing to say that if we look at WGS data, and we use Picards caluclate hs mtrics in both, we should be able to fully compare values comping from two different pipelines. |
But if what we're caring about is trending these QC-metrics from the same method, to be able to monitor the quality of the prep, I think we're already getting most of that from the MIP WGS-cases which have recorded this metric for a long time. The extra stats from Balsamic just seems like more data-points to me, and at that point I don't see that it matters so much if these data-points are skewed in any particular way based on checking different regions, and if we're already planning to change to a more general bed-file like RefSeq maybe it's better that we've already starting trending on this bedfile in parallel? What we will say if we collect the data from balsamic and MIP are things like "GC_DROPOUT in balsamic is consistently [5% higher/lower] than in MIP due to using different bedfiles, but the trend is the same..." I don't know if it's worth it to add new logic to reduce the pipeline-specific effects for this, maybe even with the same bedfile there's some consistent difference based on trimming etc (but probably...not) |
I don't think I would have time to make the change to twist before validation on Monday. Especially since it also requires making it work for hg38 which is not an issue with the RefGene bed, and that there are still some tasks left over in CG to solve before release. If everyone is ok with it, I'm going to merge this tomorrow morning 🔥 🧙 🔥 |
Kudos, SonarCloud Quality Gate passed! |
This PR:
The aim of this PR is primarily to adds GC dropout to WGS by adding picard hsmetrics to the WGS workflow, using the refgene bedfile.
See the issue here where a discussion was had regarding which bed file to use to calculate GC_dropout in WGS (to align with the RD pipeline, or to use the RefGene bedfile already in use in balsamic) (#1240).
It also:
Added:
Fixed:
PR specific checks
Integrity:
New QC-metrics:
Review and tests: