Representing caseLevelData/zygosity with VRS alleles #57

daisieh · 2023-02-15T20:54:09Z

If I'm creating a genomicVariant specification from a VCF variant record, I can't see how I'd specify multiple alleles in a single genomicVariant: the variation property seems to be singular? For example, a variant record might have a ref A and an alt C,T. Samples in that record might have genotypes that correspond to A/C, A/A, A/T, C/T.

LegacyVariation seems to be able to capture basic VCF-format ref/alt, at least in the case where there is only one alternate allele. It does not seem like there's an option for multiple alt alleles. So I could capture zygosity/genotype for A/A and A/T as caseLevelData corresponding to one Variation, and A/A and A/C as a different one (even there, how would I know which variation to put the A/A cases in?). But how would I represent C/T samples?

VRS's MolecularVariation seems to be the preferred schema moving forward, I assume. It seems like in this schema, there is no idea of a reference allele at all: each allele is represented by a single variation. But without an ability to specify multiple variations for a genomicVariant, how would I represent zygosity for caseLevelData?

The text was updated successfully, but these errors were encountered:

mbaudis · 2023-02-16T14:53:31Z

That's actually a VRS question in the first place - how do I express genotypes in VRS, which is not yet part of "stable": https://vrs.ga4gh.org/en/latest/terms_and_model.html?#genotype

The current Beacon v2.0 explicitly references VRS 1.2¹ which does not yet contain a genotype definition. However:

the model just describes how you should represent record-level data in responses but has no strong binding on how you store your data
You can use your own schema and reference this in the response since you are not bound to use the Beacon default model in the response (well, it is good practice to do so, but...). So here having the VRS version changed to latest while leaving everything else should be an easy way of a "forward looking implementation" if this is needed for parsing/verification.
IMO we will update to keep in line with VRS as soon as there is a stable update (there are also the upcoming structural variant improvements)

So for anything I'd build right now I'd just go w/ the upcoming VRS. But all this doesn't impact the query side where anyway no combined alternateBases are allowed - so strictly only alleles can be queried².

Now, this is about data representation ... Personally I'm not a fan of these direct "a variant is a genotype, at least sometimes". IMO for storage variants should always be alleles (or haplotypes, for phasing; + systemic like CNV) and then be post-composed (i.e. same analysisId). Which leads us to have a collection step at the Progenetix export stage where we create a variant from all case level instances of the same change instead having one variant w/ all its instances ¯\_(ツ)_/¯.

This criticism does not apply to the VRS model which will provide a transparent genotype composition if/when needed and anyway isn't really for data storage (in contrast to VCF files...).

https://raw.githubusercontent.com/ga4gh/vrs/1.2/schema/vrs.json#/definitions/MolecularVariation ↩
You can file one or more issues here :-) ↩

daisieh · 2023-02-16T19:46:49Z

I'm still a bit confused...if each variant represents a single variation, that is, one allele, how can that same variant contain zygosity in caseLevelData? The definition for CaseLevelVariant contains an entry for zygosity, which is only apparently a word like heterozygous/homozygous, but how do you represent what the CaseLevelVariant is heterozygous or homozygous for, if there's only one variation listed?

jrambla · 2023-02-16T20:23:59Z

Hi, Assuming you are looking for T > A Heterozygous would be T + A (one for each chromosome copy) Homozygous would be A + A (one for each chromosome copy) "T" is the allele in the reference genome, and the second case level has both copies "mutated". Hope this clarifies. Jordi

…

________________________________ De: Daisie Huang ***@***.***> Enviat el: dijous, 16 de febrer de 2023 20:47 Per a: ga4gh-beacon/beacon-v2 ***@***.***> A/c: Subscribed ***@***.***> Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57) I'm still a bit confused...if each variant represents a single variation, that is, one allele, how can that same variant contain zygosity in caseLevelData? The definition for CaseLevelVariant contains an entry for zygosity, which is only apparently a word like heterozygous/homozygous, but how do you represent what the CaseLevelVariant is heterozygous or homozygous for, if there's only one variation listed? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433623273__;Iw!!D9dNQwwGXtA!WGxsB53g_DSPoUoQkTT6850czg98VCGY8H5shTrh1OdOzp1ZICh5Y1RK_KTEwzJf0rVFApHfjA5nNPenTq-nYze-GF4$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOVCBIKULTWVWYOG3L3WXZ73JANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!WGxsB53g_DSPoUoQkTT6850czg98VCGY8H5shTrh1OdOzp1ZICh5Y1RK_KTEwzJf0rVFApHfjA5nNPenTq-noEIH3Mo$>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

daisieh · 2023-02-16T20:33:22Z

Okay. So this is assuming that we wouldn't want to record a T/T homozygote at all, because they're both reference?

jrambla · 2023-02-16T20:42:02Z

Hi We probably would need more details on the use case, but I believe that the spec is allowing such reference homozygous w/o problem. Jordi

…

________________________________ De: Daisie Huang ***@***.***> Enviat el: dijous, 16 de febrer de 2023 21:33 Per a: ga4gh-beacon/beacon-v2 ***@***.***> A/c: Jordi Rambla ***@***.***>; Comment ***@***.***> Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57) Okay. So this is assuming that we wouldn't want to record a T/T homozygote at all, because they're both reference? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433680355__;Iw!!D9dNQwwGXtA!Um1XanLM_kkHaMXXtRDyVb1sjqhmtW6sTdFY0JOp0XP7RzD7QIC-PLI9aGzdWI2lbvpkj60fkudKFA8jEyCELRTRDfU$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOXNANCPSGGAPW2ZHLTWX2FJ3ANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!Um1XanLM_kkHaMXXtRDyVb1sjqhmtW6sTdFY0JOp0XP7RzD7QIC-PLI9aGzdWI2lbvpkj60fkudKFA8jEyCEwAwllGc$>. You are receiving this because you commented.Message ID: ***@***.***>

daisieh · 2023-02-16T20:56:47Z

How would that reference homozygous caseLevelVariant be represented in the schema?

jrambla · 2023-02-16T21:10:31Z

This is a good question. Given that this seems a corner case, my colleagues could have a different view, but I would suggest using 0/0 for ti Jordi

…

________________________________ De: Daisie Huang ***@***.***> Enviat el: dijous, 16 de febrer de 2023 21:56 Per a: ga4gh-beacon/beacon-v2 ***@***.***> A/c: Jordi Rambla ***@***.***>; Comment ***@***.***> Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57) How would that reference homozygous caseLevelVariant be represented in the schema? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433704197__;Iw!!D9dNQwwGXtA!XOraTiPuDU7NWU6L-FZzC8k1UoXQZmridKJbPQPtGGoLbgbdcxvKqb0pUY-vdLGoDom6jo5U8aIN0DQUxNAoMxJjcyI$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOW5IBM72O4TFL7PDODWX2IBXANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!XOraTiPuDU7NWU6L-FZzC8k1UoXQZmridKJbPQPtGGoLbgbdcxvKqb0pUY-vdLGoDom6jo5U8aIN0DQUxNAoHhib46c$>. You are receiving this because you commented.Message ID: ***@***.***>

mbaudis · 2023-02-16T21:28:45Z

It usually would be implicit, by not being reported/found, since being the default (if it is a homozygous case of the predominant allele). Which isn’t a very robust assumption.

The basic principle of genomic data exchange is that we talk about variations on some reference. However, one can only be sure about the state of any specific locus if having a confirmation that it has been assessed.

So if you have a variable locus which has been reported in your population analysis (i.e. it got its line in a VCF) you can read out that your sample didn’t have a variation.

However, such assertions are only for assessed loci; there is no guarantee that your locus has been assessed in a study (think panel or WES and intergenic region_.

And Beacon instances wouldn’t usually report on the T since it isn’t a variant. So a query for it with a “no” response would not mean that it isn’t there since it is not an alternativeBase. It could be reported, though, if interpreted as “if query base is reference than report hits on reference, i.e. reference allele in variant locus”.

A good point overall: How should Beacon instances match reference allele matches?

daisieh · 2023-02-17T02:57:30Z

The general question is that to me, it seems like Beacon/VRS should want to capture what is possible in a VCF file. Since VCF specifically mentions "ALT — alternate base(s): Comma-separated list of alternate non-reference alleles" and therefore the Genotype field as being written as "GT (String): Genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc," that VRS would want to be able to capture this. If a variant is only for a single alternate allele, I'm not clear on how one would capture genotypes for multiple alternate alleles, and therefore VRS would be losing data representation, relative to VCF?

We have vcf files from cancer patients. Each file contains two samples, TUMOUR and NORMAL. I think that our users will want to be able to query about whether or not we have patients that have variants present at a site. I think that it's most likely that the general-question edge case I mentioned above won't happen in our data, but I think it's possible that it could happen: let's say that a patient has a germline alternative genotype at a site, but the cancer has mutated one copy to yet another allele. I'd imagine that the VCF record could look something like:

REF: A
ALT: C,T
GT-NORMAL: 1/1
GT-TUMOUR: 1/2

mrueda · 2023-02-17T09:16:53Z

My five cents.

In the reference implementation we start with VCFs but at the database level we store each variant as biallelic. Thus, multiallelic variants are split into as many fields as ALT alleles (ALT> 0). Depending in how you formulate the query, you may get one hit or multiple.

You my wanna take a look to this file.

Hope this helps.

Thanks,

Manu

daisieh · 2023-02-17T18:18:32Z

Thanks, Manu. That file is one of the ones I had been looking at, so it's good to hear from you. I think you were using the LegacyVariation schema, though, which does at least allow ref/alt in a single GenomicVariant...if you were to switch to the VRS MolecularVariation schema, how would you do it?

jrambla · 2023-02-25T20:17:07Z

The description in the PR above is making "your" case clear to me.
Most of the things other contributors had said makes sense: we are following VRS, we model for single variants in the genomicVariation, etc.

The solution you suggest seems reasonable to me, but also makes the spec more complex (and it is enough already). Our rough suggestion, as per today, would be for the Beacon client (note that I don't say user) to query for heterozygous for allele A, save the list of sample donors, then query for the heterozygous for allele B, and intersect the list with the saved one. The intersection must give you the 1/2 expressed by VCF. Of course, this is a two step solution, but we envision some not simple queries to be addressed that way.

Also, as some of the contributors had said, we are evolving the spec according to the feedback we are having and a community process... for the compound heterozygosity, a simpler solution could be to add something like:

            "caseLevelData": [
              {
                "zygosity": {
                  "label": "1/2", // or 1/* or 1/?
                  "id": "GENO:GENO_0000402",
                 "secondaryAlleleIds": {
                        "genomicHGVSId": "NC_000001.11:g.55039979G>A"
                      }
                },
                "biosampleId": "HG03770"
              }
            ],

This is a very rough idea, but I hope the principle of it is clear enough.
Does it make sense?

daisieh · 2023-02-26T19:35:22Z

This would work for my system, I think. I'll try it out and see how it goes. Thank you for the idea!

jrambla · 2023-02-27T10:52:51Z

Any feedback would be appreciated, yes! Jordi

…

________________________________ De: Daisie Huang ***@***.***> Enviat el: diumenge, 26 de febrer de 2023 20:35 Per a: ga4gh-beacon/beacon-v2 ***@***.***> A/c: Jordi Rambla ***@***.***>; Comment ***@***.***> Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57) This would work for my system, I think. I'll try it out and see how it goes. Thank you for the idea! — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1445446755__;Iw!!D9dNQwwGXtA!WVU1R5USB-bQ2iGMhMqgYUjy58YZHsK43PQiZ60FdIG5pdSMT679rqC22VnOEy_eTe0_Q1BlrkDbCOg8sRXbbkny8nU$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOTMZI57HGUSBOXBAUDWZOWAJANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!WVU1R5USB-bQ2iGMhMqgYUjy58YZHsK43PQiZ60FdIG5pdSMT679rqC22VnOEy_eTe0_Q1BlrkDbCOg8sRXb_GSUY7M$>. You are receiving this because you commented.Message ID: ***@***.***>

mbaudis · 2023-02-27T11:40:28Z

@daisieh I'll try to summarize this in the FAQ soonish...

Edit: Note here http://docs.genomebeacons.org/FAQ/#haplotypes; please extend/fix at https://github.com/ga4gh-beacon/beacon-v2/blob/main/docs/FAQ.md ...

daisieh · 2023-03-02T06:58:48Z

Would there be any harm in suggesting secondaryAlleleIds as a property in zygosity for every variation? That is, a sample with zygosity 0/1 would be listed in caseLevelData for the ref variation, with a secondaryAlleleId for the alt variation, and also listed in caseLevelData for the alt variation, with a secondaryAlleleId for the ref variation?

jrambla · 2023-03-02T10:49:29Z

I'm sure that I understand your suggestion correctly, but I will risk commenting on it ;-)
If a user wants to know all variants in a given position, this is equivalent to query for a region that only includes that base (e.g. chr22:1000-1001). This query will return all variants seen by that Beacon in that position.

daisieh · 2023-03-02T20:05:42Z

If there were three alleles seen at a specific location, like 22:1000-1001, and we had three samples with genotypes 0/0, 0/1 and 1/2, I am suggesting that you'd have:

{
    "variation": { allele_0 },
    "caseLevelData": [
        {
            "biosampleId": sample_1,
            "zygosity": {
                "label": "0/0"
            }
        },
        {
            "biosampleId": sample_2,
            "zygosity": {
                "label": "0/1",
                "secondaryAlleleId": allele_0
            }
        }
    ]
},
{
    "variation": { allele_1 },
    "caseLevelData": [
        {
            "biosampleId": sample_2,
            "zygosity": {
                "label": "0/1",
                "secondaryAlleleId": allele_0
            }
        },
        {
            "biosampleId": sample_3,
            "zygosity": {
                "label": "1/2",
                "secondaryAlleleId": allele_2
            }
        }
    ]
},
{
    "variation": { allele_2 },
    "caseLevelData": [
        {
            "biosampleId": sample_3,
            "zygosity": {
                "label": "1/2",
                "secondaryAlleleId": allele_1
            }
        }
    ]
}

So all alleles present in the samples are accounted for in the caseLevelData, and are associated with their biosample and the other allele in the genotype if there is one present.

daisieh · 2023-03-02T21:19:57Z

I've summarized what I'm suggesting above in this yaml snippet from my openapi schema for a caseLevelVariant. Instead of just zygosity, replace with this object:

              genotype:
                type: object
                properties:
                  zygosity:
                    $ref: '#/components/schemas/OntologyTerm'
                    description: Ontology term for zygosity in which variant is present in the sample from the Zygosity Ontology (GENO:0000391) , e.g `heterozygous` (GENO:0000135)
                    examples:
                      - id: GENO:0000458
                        label: simple heterozygous
                      - id: GENO:0000402
                        label: compound heterozygous
                      - id: GENO:0000136
                        label: homozygous
                  value:
                    type: string
                    description: VCF GT-style value, e.g. 0/0, 1|2
                  secondaryAlleleIds:
                    type: array
                    description: variantInternalIds of the other allele(s) present in this genotype
                    items:
                      type: string
                required:
                  - zygosity

This was referenced Feb 23, 2023

Update variation property to account for multiple alleles daisieh/beacon-v2#1

Merged

Update variation property to account for multiple alleles #58

Closed

jrambla added Model related Variants scout labels Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing caseLevelData/zygosity with VRS alleles #57

Representing caseLevelData/zygosity with VRS alleles #57

daisieh commented Feb 15, 2023

mbaudis commented Feb 16, 2023 •

edited

Loading

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

mbaudis commented Feb 16, 2023

daisieh commented Feb 17, 2023 •

edited

Loading

mrueda commented Feb 17, 2023

daisieh commented Feb 17, 2023

jrambla commented Feb 25, 2023

daisieh commented Feb 26, 2023

jrambla commented Feb 27, 2023 via email

mbaudis commented Feb 27, 2023 •

edited

Loading

daisieh commented Mar 2, 2023

jrambla commented Mar 2, 2023

daisieh commented Mar 2, 2023

daisieh commented Mar 2, 2023

Representing caseLevelData/zygosity with VRS alleles #57

Representing caseLevelData/zygosity with VRS alleles #57

Comments

daisieh commented Feb 15, 2023

mbaudis commented Feb 16, 2023 • edited Loading

Footnotes

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

daisieh commented Feb 16, 2023

jrambla commented Feb 16, 2023 via email

mbaudis commented Feb 16, 2023

daisieh commented Feb 17, 2023 • edited Loading

mrueda commented Feb 17, 2023

daisieh commented Feb 17, 2023

jrambla commented Feb 25, 2023

daisieh commented Feb 26, 2023

jrambla commented Feb 27, 2023 via email

mbaudis commented Feb 27, 2023 • edited Loading

daisieh commented Mar 2, 2023

jrambla commented Mar 2, 2023

daisieh commented Mar 2, 2023

daisieh commented Mar 2, 2023

mbaudis commented Feb 16, 2023 •

edited

Loading

daisieh commented Feb 17, 2023 •

edited

Loading

mbaudis commented Feb 27, 2023 •

edited

Loading