Skip to content

scalability

Bob Dionne edited this page Dec 21, 2023 · 2 revisions

Proposed Scalability solutions

Generally the main idea is to have the entire ontology in a database, ideally a triple store, and only retrieve what is needed to support the various usage scenarios of the modelers. As they are using the navigation panel, selecting classes to view/edit, executing associative searches, etc.. the database would be queried. Edits and additions of new classes would still be tracked separately as changesets, on the server, in order to support the reconciliation workflow process. The general outline of how this could work would be:

  • Put the ontology in a database or triple store on the server, and have all requests from the clients access this database. either directly or through the protege-server.
  • No longer enforce MVCC using a revision number. Clients are free to edit and commit changes to the server as they occur. When users connect, open a project, retrieve from the data base, and make and commit edits, those edits not only update the database but all are still captured as changesets.
  • Continue to track changes modelers make in the form of changesets. In other words the protege-server will both update the database and write changesets. The changesets may continue to be stored in files or possibly also stored in the database for convenience.
  • Use the changesets for the workflow review process. They will still be used to power the revision history panel and to resolve conflicts. Squashing will then be a process of just dropping the changesets and/or making changes to the database to resolve conflicts.

Note that the modelers will never have the entire database in memory so it makes little sense to enforce any sequentiality. They will see each others edits and perhaps can resolve any conflicts they encounter, but it's the last edit that wins. Later during reconciliation the manager will use the revision history to resolve semantic conflicts.

Issues

The main challenge is to preserve the existing application and UI as it currently is. The modelers have incoporated many of the complex features it supports into their workflows and are highly dependent on them.

The existing system is built on top of a object layer called protege-owl. This layer is built on top of the OWLAPI, which is a core foundational component and very large layer. Both of them are quite large in terms of surface areas, lines of code, and legacy dependencies. Much of this software is 20 years old.

OWL is based on RDF/XML, so there's a natural way for an OWL file of axioms to be parsed as RDF triples and stored in a triple store. However OWL is a higher level language. The collection of OWL axioms that comprise the definition of a class and it's relation through object properties to other classes, and it's annotation properties, when serialized to RDF triples might lead to 100s of them. This works well for the in memory version. As the triples in an OWL file are parsed and processed, the relevant OWL objects can be built up, references cached, anonymous nodes created as needed and so on.

Replacing this in memory version with one based on a triple store could possibly be a major rework of both the protege-owl layer and a new implementation of the OWLAPI. The access patterns we need to support scalable terminologies require new sparql queries that both retrieve specific collections of OWL objects but also the local neighborhoods of those objects.

Currently we are investigating the possible ways to modify protege-owl and the OWLAPI to support more scalbility, whilst preserving the existing UI and feature set.