[User Story] Artefact databases for SNVs and InDels #1377

mathiasbio · 2024-01-29T08:42:26Z

Need

As a geneticist I want to see true variants and not false positive calls. Currently we have databases for annotating variants that are commonly observed as somatic in highly filtered T+N cases, as well as two databases for detected germline variants, one detected in balsamic and the other in MIP. What is lacking is a database which aims to collect artefacts, which can otherwise increase the workload unnecessarily for a geneticist, increase TAT and in the worst case lead to false reports.

Suggested approach

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:

To avoid populating the database with common cancer relevant somatic variants we should do somatic SNV calling on normal samples.
WGS should be used as it can be difficult to collect normal samples for each panel
The WGS normal samples should be merged as some artefacts only occur in low frequencies around 1% and it would be unlikely to call these variants in 30X normal samples
The WGS samples should have data extracted from within the bedregions (thanks @khurrammaqbool for this idea) we're using to be able to run the variant-callers without issue on the cluster as the merged normal coverage would ideally reach levels of around 1000X (meaning around 33 normals would be required, each of a size of around 40G for a final bamfile, meaning a bam of 1.3 TB which would probably crash)
Somatic SNV calling is done on the merged extracted bam-files (maybe a total of 5 merged bams each consisting of 20 merged normals), and the VCFs are uploaded to a LoqusDB and exported as a VCF for annotation and filtration

Considered alternatives

No response

Deviation

No response

System requirements assessed

Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

Needed
Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

#1376

Anything else?

No response

mathiasbio · 2024-04-15T08:48:03Z

Probably if we switch to using TNscope for even the panel data this type of database would be useful for all panels as well!

mathiasbio · 2024-09-25T16:04:01Z

I have started looking into this issue even in release 16 as a potential solution to adress the increased number of variants in tumor only TGA analyses since the addition of TNscope. I have done the following so far: (see sheet: https://docs.google.com/spreadsheets/d/1MjHLPSWD78rMaEP4wvJO4HAEWIx-U2eN9c27cBWohu0/edit?gid=0#gid=0)

collected 28 groups of normal samples, each with 7 WGS normals in each group
merged within each group 7 dedup.realigned bamfiles into 1
did a basic TNscope variant calling (see sheet for settings)
uploaded raw calls to new LoqusDB instance on stage called: loqusdb-balsamic-normal-stage (but only 24 could be uploaded due to some issues)

mathiasbio · 2024-09-25T16:19:05Z

I have also tested the filtering of the above database after 20 groups were added to the LoqusDB on a clinical.filtered.pass VCF from this PR: #1475 specifically a myeloid case where the number of variants had almost trippled since adding TNscope.

However, even filtering out variants that only occurred 1 time in the 20 groups of merged normalbamfiles, only a small subset of variants were filtered out. About 100 out of the total 2000 variants.

Most of the variants that were added in this sample (and probably this applies to other tumor only cases too) were InDels:

In barplot above the v15 corresponds to unique variants in v15 of balsamic, and v16 corresponds to unique variants in the above PR when TNscope is added and merged with VarDict results.

And a lot of them are InDels added in homopolymer regions:

Where the repeat-units comes from TNscope and counts the number of repetitive elements, such as if T is deleted it counts how many T's are in a row, and AF is the allele frequency.

Likely many of these variants are not interesting, and are probably filtered out in the tumor + normal matched analysis which is why we don't see the issue of increased variants in that analysis.

As can be seen the frequency of these variants are however quite low, and are probably unlikely to be captured in even the 7 merged WGS normal samples which should have on average 210X coverage. And as expected a significant number of the variants that could be filtered out by the WGS normal artefact database were of this homopolymer indel-type:

Above only InDels with more than 5 repeat-units are shown, along with their frequency in the WGS normal database. The conclusion I draw from this is that probably the normal coverage after merging 7 - 30X normals is not high enough to capture many artefacts. Which is why I now plan to test another approach

mathiasbio · 2024-09-25T16:20:01Z

New approach:

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:

To avoid populating the database with common cancer relevant somatic variants we should do somatic SNV calling on normal samples.
WGS should be used as it can be difficult to collect normal samples for each panel
The WGS normal samples should be merged as some artefacts only occur in low frequencies around 1% and it would be unlikely to call these variants in 30X normal samples
The WGS samples should have data extracted from within the bedregions (thanks @khurrammaqbool for this idea) we're using to be able to run the variant-callers without issue on the cluster as the merged normal coverage would ideally reach levels of around 1000X (meaning around 33 normals would be required, each of a size of around 40G for a final bamfile, meaning a bam of 1.3 TB which would probably crash)
Somatic SNV calling is done on the merged extracted bam-files (maybe a total of 5 merged bams each consisting of 20 merged normals), and the VCFs are uploaded to a LoqusDB and exported as a VCF for annotation and filtration

mathiasbio added the User-Story A User-Story describing new functionality label Jan 29, 2024

mathiasbio added this to BALSAMIC Jan 29, 2024

github-project-automation bot moved this to Todo in BALSAMIC Jan 29, 2024

mathiasbio added this to the Release 15 milestone Jan 29, 2024

pbiology modified the milestones: Release 16, Release 17 Apr 29, 2024

mathiasbio mentioned this issue Oct 14, 2024

feat: add artefact database argument #1481

Merged

55 tasks

mathiasbio linked a pull request Oct 14, 2024 that will close this issue

feat: add artefact database argument #1481

Merged

55 tasks

mathiasbio closed this as completed Dec 11, 2024

github-project-automation bot moved this from Todo to Completed in BALSAMIC Dec 11, 2024

mathiasbio mentioned this issue Dec 11, 2024

Filter variants agains LoqusDB #1131

Closed

3 tasks

pbiology removed this from the Release 17 milestone Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] Artefact databases for SNVs and InDels #1377

[User Story] Artefact databases for SNVs and InDels #1377

mathiasbio commented Jan 29, 2024 •

edited

Loading

mathiasbio commented Apr 15, 2024

mathiasbio commented Sep 25, 2024 •

edited

Loading

mathiasbio commented Sep 25, 2024

mathiasbio commented Sep 25, 2024

[User Story] Artefact databases for SNVs and InDels #1377

[User Story] Artefact databases for SNVs and InDels #1377

Comments

mathiasbio commented Jan 29, 2024 • edited Loading

Need

Suggested approach

Considered alternatives

Deviation

System requirements assessed

Requirements affected by this story

Risk assessment needed

Risk assessment

SOUPs

Can be closed when

Blockers

Anything else?

mathiasbio commented Apr 15, 2024

mathiasbio commented Sep 25, 2024 • edited Loading

mathiasbio commented Sep 25, 2024

mathiasbio commented Sep 25, 2024

mathiasbio commented Jan 29, 2024 •

edited

Loading

mathiasbio commented Sep 25, 2024 •

edited

Loading