Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update variation property to account for multiple alleles #58

Closed
wants to merge 2 commits into from

Conversation

daisieh
Copy link

@daisieh daisieh commented Feb 23, 2023

Addresses issue #57. In order to capture zygosity (and genotype) in CaseLevelVariants completely, we need to be able to account for the situation where a caseLevelVariant contains two alternate alleles, neither of which is the reference. I would recommend requiring the first element of a variations array, element 0, to be the reference allele, and subsequent alternate alleles to be numbered accordingly. Then zygosity can be represented as in the beacon-ri implementation:

            "caseLevelData": [
              {
                "zygosity": {
                  "label": "0/1",
                  "id": "GENO:GENO_0000458"
                },
                "biosampleId": "HG03770"
              }
            ],

with the labeling schema extended in the style of VCF, with values like 1/2.

More specifically, this allows for the specification of the GENO:0000402 value for zygosity:

compound heterozygous: A heterozygous quality inhering in a single locus complement comprised of two different varaint alleles and no wild type locus. (e.g.fgf8a/fgf8a)

In order to capture zygosity (and genotype) in CaseLevelVariants completely, we need to be able to account for the situation where a caseLevelVariant contains two alternate alleles, neither of which is the reference.
Add examples of simple and compound heterozygosity
@mrueda
Copy link
Collaborator

mrueda commented Feb 24, 2023

Thank you for the suggestion. In addition to the zygosity object, I think that modifying other genomicVariations fields (e.g., identifiers, molecularAttributes, etc.) would also be necessary for multiallelic sites.

@daisieh
Copy link
Author

daisieh commented Feb 24, 2023

I can add those changes, but as I was about to make them, I realized that one of the problems I'm having is that I'm still not completely clear on whether or not genomicVariant is meant to represent genotype-level data at all: it's not clear from the top-level description, "Schema for a genomic variant entry."

If genomicVariant is meant to represent everything that could be in a vcf record, then caseLevelData would include sample-level diploid data, such as in the beacon-ri example, and then variation would have to be an array, with identifiers, molecularAttributes, etc., also following as arrays. Alternatively, these could be nested into an array of objects, with each of the properties being represented per object.

However, the alternative scenario is that genomicVariant is only meant to represent a single variation each, in which case sample-level diploid data would not be represented in here at all, and there would need to be a completely separate endpoint(?) to represent sample-level genotypic data, possibly with reference to variations by ID.

@mrueda
Copy link
Collaborator

mrueda commented Feb 25, 2023

Beacon v2 has a different function than VCFs. The Beacon v2 specification was built to facilitate data discovery (or semantic interoperability), whereas the VCF specification is meant for data analysis, storage or sharing.

As you may have noticed, in the current version of the Beacon specification, genomicVariations has properties that were created to identify unequivocally biallelic variants (e.g., genomicHGVSId). Allowing for multiallelic variants implies changing the specification for these properties.

There are other important factors to consider, such as the lack of a term/property to store variant quality or depth, not at the variant level nor at the GT level.

As an implementer, my suggestion would be to simply split your multiallelic VCFs into biallelics and go from there. Another valid option is to use the beacon2-ri-tools software (which I developed) that will perform the VCF to JSON transformation for you, including the split to biallelic.

Hope this helps.

Thx,

m

@daisieh
Copy link
Author

daisieh commented Feb 25, 2023

So Beacon is not meant to facilitate discovery of genotypic data at all? That seems odd, since there is quite a lot of schema devoted to individuals and cases.

@mrueda
Copy link
Collaborator

mrueda commented Feb 25, 2023

Beacon v2 purpose is to facilitate data discovery (genomic data and phenoclinic data).

@daisieh
Copy link
Author

daisieh commented Feb 25, 2023

But isn't genotypic data, like cases of compound heterozyosity, something that one might want to discover? Beacon just won't address that at all?

@mrueda
Copy link
Collaborator

mrueda commented Feb 25, 2023

The current version is 2.0, which was achieved through a huge community effort. The plan is for future iterations to address practical issues. I am speaking as an implementer. Changes in the spec are decided by a working group.

@costero-e costero-e closed this Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants