-
Notifications
You must be signed in to change notification settings - Fork 91
Allow "loose" referential integrity #1528
Comments
To close this issue we should make it possible to add variant sets without a reference set added. This is possible because variants use the One way to still allow a BAM to be query-able without adding a reference would be to create a synthetic reference based on the BAM index or headers and adding it to the registry. VCF headers do not contain enough information to know what references are used, but by reflecting on the tabix index we might be able to do something similar. For RNA, making the reference set optional presents no real differences in access pattern. I think there are some features that may only be present in specific gene builds, but that relational information is captured by the FeatureSetIDs. |
Here is my view: Currently, we require that the server has the references loaded. We then use internally generated ids to refer to those references from other sets (tables). The problem is that it could be unrealistic to expect every server to have all the necessary references stored internally. We should be able to use references that are defined outside of the server. We would still like to maintain referential integrity. It just doesn't need to be enforced by the database. This can be done during ingest, ensuring that all reference ids are known, either internally or externally, much like ontology terms. Alternatively, referential integrity could be checked later or even as part of compliance. In the short term, we can turn off foreign key checks. Longer term, this should be addressed in the larger discussion involving external ids, ontology terms, federated queries, etc. |
Thanks @ejacox that's a good summary of the situation. The feature is, I don't need a FASTA to load data into the server. The aspiration is maintaining easy access patterns when data are distributed. |
We shouldn't place a foreign key requirement for references when someone does not have them, or doesn't want to manage them. This means that a VCF should be able to be added with no other data (other than a dataset) present in a server.
The problem becomes that it is unclear what reference names to use to query a variant set. From the perspective of the server, that is a data management problem and the more full-fledged offering can be made by adding a reference set, but it shouldn't be required.
The text was updated successfully, but these errors were encountered: