-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User Story] Artefact databases for SNVs and InDels #1377
Comments
Probably if we switch to using TNscope for even the panel data this type of database would be useful for all panels as well! |
I have started looking into this issue even in release 16 as a potential solution to adress the increased number of variants in tumor only TGA analyses since the addition of TNscope. I have done the following so far: (see sheet: https://docs.google.com/spreadsheets/d/1MjHLPSWD78rMaEP4wvJO4HAEWIx-U2eN9c27cBWohu0/edit?gid=0#gid=0)
|
I have also tested the filtering of the above database after 20 groups were added to the LoqusDB on a clinical.filtered.pass VCF from this PR: #1475 specifically a myeloid case where the number of variants had almost trippled since adding TNscope. However, even filtering out variants that only occurred 1 time in the 20 groups of merged normalbamfiles, only a small subset of variants were filtered out. About 100 out of the total 2000 variants. Most of the variants that were added in this sample (and probably this applies to other tumor only cases too) were InDels: In barplot above the v15 corresponds to unique variants in v15 of balsamic, and v16 corresponds to unique variants in the above PR when TNscope is added and merged with VarDict results. And a lot of them are InDels added in homopolymer regions: Where the repeat-units comes from TNscope and counts the number of repetitive elements, such as if T is deleted it counts how many T's are in a row, and AF is the allele frequency. Likely many of these variants are not interesting, and are probably filtered out in the tumor + normal matched analysis which is why we don't see the issue of increased variants in that analysis. As can be seen the frequency of these variants are however quite low, and are probably unlikely to be captured in even the 7 merged WGS normal samples which should have on average 210X coverage. And as expected a significant number of the variants that could be filtered out by the WGS normal artefact database were of this homopolymer indel-type: Above only InDels with more than 5 repeat-units are shown, along with their frequency in the WGS normal database. The conclusion I draw from this is that probably the normal coverage after merging 7 - 30X normals is not high enough to capture many artefacts. Which is why I now plan to test another approach |
New approach: Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:
|
Need
As a geneticist I want to see true variants and not false positive calls. Currently we have databases for annotating variants that are commonly observed as somatic in highly filtered T+N cases, as well as two databases for detected germline variants, one detected in balsamic and the other in MIP. What is lacking is a database which aims to collect artefacts, which can otherwise increase the workload unnecessarily for a geneticist, increase TAT and in the worst case lead to false reports.
Suggested approach
Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:
Considered alternatives
No response
Deviation
No response
System requirements assessed
Requirements affected by this story
No response
Risk assessment needed
Risk assessment
No response
SOUPs
No response
Can be closed when
No response
Blockers
#1376
Anything else?
No response
The text was updated successfully, but these errors were encountered: