Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] Artefact databases for SNVs and InDels #1377

Closed
3 tasks
mathiasbio opened this issue Jan 29, 2024 · 4 comments · Fixed by #1481
Closed
3 tasks

[User Story] Artefact databases for SNVs and InDels #1377

mathiasbio opened this issue Jan 29, 2024 · 4 comments · Fixed by #1481
Labels
User-Story A User-Story describing new functionality

Comments

@mathiasbio
Copy link
Collaborator

mathiasbio commented Jan 29, 2024

Need

As a geneticist I want to see true variants and not false positive calls. Currently we have databases for annotating variants that are commonly observed as somatic in highly filtered T+N cases, as well as two databases for detected germline variants, one detected in balsamic and the other in MIP. What is lacking is a database which aims to collect artefacts, which can otherwise increase the workload unnecessarily for a geneticist, increase TAT and in the worst case lead to false reports.

Suggested approach

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:

  • To avoid populating the database with common cancer relevant somatic variants we should do somatic SNV calling on normal samples.
  • WGS should be used as it can be difficult to collect normal samples for each panel
  • The WGS normal samples should be merged as some artefacts only occur in low frequencies around 1% and it would be unlikely to call these variants in 30X normal samples
  • The WGS samples should have data extracted from within the bedregions (thanks @khurrammaqbool for this idea) we're using to be able to run the variant-callers without issue on the cluster as the merged normal coverage would ideally reach levels of around 1000X (meaning around 33 normals would be required, each of a size of around 40G for a final bamfile, meaning a bam of 1.3 TB which would probably crash)
  • Somatic SNV calling is done on the merged extracted bam-files (maybe a total of 5 merged bams each consisting of 20 merged normals), and the VCFs are uploaded to a LoqusDB and exported as a VCF for annotation and filtration

Considered alternatives

No response

Deviation

No response

System requirements assessed

  • Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

  • Needed
  • Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

#1376

Anything else?

No response

@mathiasbio mathiasbio added the User-Story A User-Story describing new functionality label Jan 29, 2024
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Jan 29, 2024
@mathiasbio mathiasbio added this to the Release 15 milestone Jan 29, 2024
@mathiasbio
Copy link
Collaborator Author

Probably if we switch to using TNscope for even the panel data this type of database would be useful for all panels as well!

@pbiology pbiology modified the milestones: Release 16, Release 17 Apr 29, 2024
@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Sep 25, 2024

I have started looking into this issue even in release 16 as a potential solution to adress the increased number of variants in tumor only TGA analyses since the addition of TNscope. I have done the following so far: (see sheet: https://docs.google.com/spreadsheets/d/1MjHLPSWD78rMaEP4wvJO4HAEWIx-U2eN9c27cBWohu0/edit?gid=0#gid=0)

  • collected 28 groups of normal samples, each with 7 WGS normals in each group
  • merged within each group 7 dedup.realigned bamfiles into 1
  • did a basic TNscope variant calling (see sheet for settings)
  • uploaded raw calls to new LoqusDB instance on stage called: loqusdb-balsamic-normal-stage (but only 24 could be uploaded due to some issues)

@mathiasbio
Copy link
Collaborator Author

I have also tested the filtering of the above database after 20 groups were added to the LoqusDB on a clinical.filtered.pass VCF from this PR: #1475 specifically a myeloid case where the number of variants had almost trippled since adding TNscope.

However, even filtering out variants that only occurred 1 time in the 20 groups of merged normalbamfiles, only a small subset of variants were filtered out. About 100 out of the total 2000 variants.

Most of the variants that were added in this sample (and probably this applies to other tumor only cases too) were InDels:
image

In barplot above the v15 corresponds to unique variants in v15 of balsamic, and v16 corresponds to unique variants in the above PR when TNscope is added and merged with VarDict results.

And a lot of them are InDels added in homopolymer regions:
image

Where the repeat-units comes from TNscope and counts the number of repetitive elements, such as if T is deleted it counts how many T's are in a row, and AF is the allele frequency.

Likely many of these variants are not interesting, and are probably filtered out in the tumor + normal matched analysis which is why we don't see the issue of increased variants in that analysis.

As can be seen the frequency of these variants are however quite low, and are probably unlikely to be captured in even the 7 merged WGS normal samples which should have on average 210X coverage. And as expected a significant number of the variants that could be filtered out by the WGS normal artefact database were of this homopolymer indel-type:

image

Above only InDels with more than 5 repeat-units are shown, along with their frequency in the WGS normal database. The conclusion I draw from this is that probably the normal coverage after merging 7 - 30X normals is not high enough to capture many artefacts. Which is why I now plan to test another approach

@mathiasbio
Copy link
Collaborator Author

New approach:

Do somatic SNV calling on merged WGS normal samples where the data is extracted for ALL panel regions, see reasoning below:

  • To avoid populating the database with common cancer relevant somatic variants we should do somatic SNV calling on normal samples.
  • WGS should be used as it can be difficult to collect normal samples for each panel
  • The WGS normal samples should be merged as some artefacts only occur in low frequencies around 1% and it would be unlikely to call these variants in 30X normal samples
  • The WGS samples should have data extracted from within the bedregions (thanks @khurrammaqbool for this idea) we're using to be able to run the variant-callers without issue on the cluster as the merged normal coverage would ideally reach levels of around 1000X (meaning around 33 normals would be required, each of a size of around 40G for a final bamfile, meaning a bam of 1.3 TB which would probably crash)
  • Somatic SNV calling is done on the merged extracted bam-files (maybe a total of 5 merged bams each consisting of 20 merged normals), and the VCFs are uploaded to a LoqusDB and exported as a VCF for annotation and filtration

@mathiasbio mathiasbio linked a pull request Oct 14, 2024 that will close this issue
55 tasks
@github-project-automation github-project-automation bot moved this from Todo to Completed in BALSAMIC Dec 11, 2024
@pbiology pbiology removed this from the Release 17 milestone Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User-Story A User-Story describing new functionality
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

2 participants