Update variation property to account for multiple alleles #58

daisieh · 2023-02-23T20:05:22Z

Addresses issue #57. In order to capture zygosity (and genotype) in CaseLevelVariants completely, we need to be able to account for the situation where a caseLevelVariant contains two alternate alleles, neither of which is the reference. I would recommend requiring the first element of a variations array, element 0, to be the reference allele, and subsequent alternate alleles to be numbered accordingly. Then zygosity can be represented as in the beacon-ri implementation:

            "caseLevelData": [
              {
                "zygosity": {
                  "label": "0/1",
                  "id": "GENO:GENO_0000458"
                },
                "biosampleId": "HG03770"
              }
            ],

with the labeling schema extended in the style of VCF, with values like 1/2.

More specifically, this allows for the specification of the GENO:0000402 value for zygosity:

compound heterozygous: A heterozygous quality inhering in a single locus complement comprised of two different varaint alleles and no wild type locus. (e.g.fgf8a/fgf8a)

In order to capture zygosity (and genotype) in CaseLevelVariants completely, we need to be able to account for the situation where a caseLevelVariant contains two alternate alleles, neither of which is the reference.

Add examples of simple and compound heterozygosity

mrueda · 2023-02-24T06:48:56Z

Thank you for the suggestion. In addition to the zygosity object, I think that modifying other genomicVariations fields (e.g., identifiers, molecularAttributes, etc.) would also be necessary for multiallelic sites.

daisieh · 2023-02-24T23:42:20Z

I can add those changes, but as I was about to make them, I realized that one of the problems I'm having is that I'm still not completely clear on whether or not genomicVariant is meant to represent genotype-level data at all: it's not clear from the top-level description, "Schema for a genomic variant entry."

If genomicVariant is meant to represent everything that could be in a vcf record, then caseLevelData would include sample-level diploid data, such as in the beacon-ri example, and then variation would have to be an array, with identifiers, molecularAttributes, etc., also following as arrays. Alternatively, these could be nested into an array of objects, with each of the properties being represented per object.

However, the alternative scenario is that genomicVariant is only meant to represent a single variation each, in which case sample-level diploid data would not be represented in here at all, and there would need to be a completely separate endpoint(?) to represent sample-level genotypic data, possibly with reference to variations by ID.

mrueda · 2023-02-25T13:25:35Z

Beacon v2 has a different function than VCFs. The Beacon v2 specification was built to facilitate data discovery (or semantic interoperability), whereas the VCF specification is meant for data analysis, storage or sharing.

As you may have noticed, in the current version of the Beacon specification, genomicVariations has properties that were created to identify unequivocally biallelic variants (e.g., genomicHGVSId). Allowing for multiallelic variants implies changing the specification for these properties.

There are other important factors to consider, such as the lack of a term/property to store variant quality or depth, not at the variant level nor at the GT level.

As an implementer, my suggestion would be to simply split your multiallelic VCFs into biallelics and go from there. Another valid option is to use the beacon2-ri-tools software (which I developed) that will perform the VCF to JSON transformation for you, including the split to biallelic.

Hope this helps.

Thx,

m

daisieh · 2023-02-25T14:24:44Z

So Beacon is not meant to facilitate discovery of genotypic data at all? That seems odd, since there is quite a lot of schema devoted to individuals and cases.

mrueda · 2023-02-25T14:40:00Z

Beacon v2 purpose is to facilitate data discovery (genomic data and phenoclinic data).

daisieh · 2023-02-25T14:49:32Z

But isn't genotypic data, like cases of compound heterozyosity, something that one might want to discover? Beacon just won't address that at all?

mrueda · 2023-02-25T15:22:22Z

The current version is 2.0, which was achieved through a huge community effort. The plan is for future iterations to address practical issues. I am speaking as an implementer. Changes in the spec are decided by a working group.

daisieh added 2 commits February 23, 2023 11:48

Update variation property to account for multiple alleles

b1fdde8

In order to capture zygosity (and genotype) in CaseLevelVariants completely, we need to be able to account for the situation where a caseLevelVariant contains two alternate alleles, neither of which is the reference.

Update defaultSchema.json

e45af95

Add examples of simple and compound heterozygosity

costero-e closed this Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update variation property to account for multiple alleles #58

Update variation property to account for multiple alleles #58

daisieh commented Feb 23, 2023

mrueda commented Feb 24, 2023

daisieh commented Feb 24, 2023

mrueda commented Feb 25, 2023

daisieh commented Feb 25, 2023

mrueda commented Feb 25, 2023

daisieh commented Feb 25, 2023

mrueda commented Feb 25, 2023

Update variation property to account for multiple alleles #58

Update variation property to account for multiple alleles #58

Conversation

daisieh commented Feb 23, 2023

mrueda commented Feb 24, 2023

daisieh commented Feb 24, 2023

mrueda commented Feb 25, 2023

daisieh commented Feb 25, 2023

mrueda commented Feb 25, 2023

daisieh commented Feb 25, 2023

mrueda commented Feb 25, 2023