Thoughts on the ROADMAP For CoreDB #986

GavinMendelGleason · 2022-02-10T14:12:35Z

GavinMendelGleason
Feb 10, 2022
Maintainer

ROADMAP For CoreDB

We have a number of options for continued development on CoreDB.

The following ideas have been on our backlog and need to be
prioritised.

Prefix Move

We have prefixes on our data products, but if you updated them, all
data in the database becomes invalidated. Worse, it probably will not
tell you this until you try to do something! This is a very irritating
"feature".

At minimum we should not allow prefix changes.

Better we make the prefix change operation perform a prefix move.

ID Generation Fixes

Currently we have a number of subtle bugs in ID generation. Fixing
this is irritating because it means we have to introduce non-backwards
compatible changes.

These changes, since they are not backward compatible will require
that we introduce a new database format. This might mean we have to
keep two formats in existence, or introduce a mandatory (possibly
hidden) upgrade path. Alternatively we may give an upgrade script or
allow manual migration by users.

If we force migration we will also potentially cause problems with
ID referencing from external systems.

Rebase

Our current rebase is no longer viable since the document upgrade. In
order for rebase to really work we need to structure it so that it
works for documents.

The features which need to exposed are:

Patch synthesis

In order to carry out a rebase operation, we need to know which
patches to apply. To find out the patches to apply we need to perform
a diff between two documents. In order to perform a diff between two
documents we need to know which documents changed.

Currently we do not track which documents changed. Instead we track
which triples changed. From this it is technically possible to
determine the containing document, but it is not super efficient.

We can however, perform a query to determine this, call the document
interface on the two commits, and perform a diff between the two
objects to synthesise a patch.

We can later back-fill these operations with faster versions. First,
we can start recording which objects change (and potentially which
reads were performed to update them). We can treat this as a cache of
the query stored at the layer. This way, if we run into a question
about a layer we can fill it in as we go, allowing full backward
compatibility.

Patch Application

Once we have patch synthesis we need to be able to apply a patch at a
layer. We need a code patch which interprets a patch and writes the
required data into the graph. This can be used by users for skeletal
updates as well and could be quite convenient (with perhaps a little
syntactic sugar).

Merge

Rebase is relatively easy in the sense that we can almost do it with
the pieces we already have. Unfortunately rebase is not always what
you want. Sometimes a merge would be a better option.

Merges are constructed from the confluence of multiple histories. And
this is where we run into a bit of trouble.

Currently our databases are trees and not DAGS. That is, we can
branch, but we never have more than one parent. This means that all
commits have a single history.

When you have a merge commit, you have more than one history. This
poses difficulties for:

History: how do we know how to perform historical operations such as
squash, delta-rollup etc.
Representation: which delta is in the layer for this merge? Both? Or
the whole database after the merge?
Patch synthesis: which patch history do you want to synthesise
from?

These problems require an implementation strategy.

Document Query

Our document query allows simple recovery of documents from the
document interface. However, at the moment it is incomplete and a bit
buggy.

First, it would be good if we dealt with all of our container types
appropriately: lists, sets and arrays.

Second, it would be nice if we had some way to utilise our document
queries in our inserts, updates and deletes.

Third, it would be handy if you could do queries against unfoldings of
the graph - this would give us an advantage over document stores such
as MongoDB in terms of expressivity (no need to do weird joins).

Schema

Schema Development

Currently our schemata are embedded in our commits. This is awkward
because schema development often feels like quiet a different
operation, and it would be nice if you could muck about with a schema
before applying it to a dataset. Allowing this sort of schema design
would require that our commit objects point to an external schema
commit.

This could be done in a backward compatible way if we simply added a
different kind of commit type which points to an external schema
commit. However it might require some thought about how to expose this
in the endpoints.

Schema Import

Currently there is no way to import different schemas. This is a
real shame since the advantage of our URI based semantic web inspired
schema design system is that it should be composable. In the case of
GeoJSON you can see big advantages to being able to import a schema
which just implements this correctly for you. But there are many other
potential examples as well.

Remotes

Since moving to TerminusX and away from TerminusHub we've neglected
remotes. This is a shame as it is a really big advantage of TerminusDB
over other databases. We should have some sort of strategy to bring
back Push/Pull/Clone onto TerminusX and perhaps in the future begin to
surface some of the more public aspects of dataset construction such
as schema browsing etc.

PR / CR

We currently lack a pull-request/change-request architecture. This is
going to be necessary for VersionXL, but it's probably also necessary
for a lot of other clients such as Seshat.

With a PR infrastructure built on top of rebase, we could really
started to enable data curation in the large.

What is the best way of implementing this? Does the PR go in the
commit graph? Is this an external meta-dataproduct?

Pub / Sub

TerminusDB provides some advantages in creating highly distributed
data applications. However, we would be a lot more flexible and
composible if we had some mechanism of Pub / Sub. This would allow
others to be informed of updates and to take actions on the basis of
these allowing complex git-like workflows to be built with other
applications.

matko · 2022-02-11T12:18:56Z

matko
Feb 11, 2022
Maintainer

I'd also like to add some things.

Document Query

Improve speed

The document query mechanism is very naive, often basically defaulting to a linear scan over all document uris of a particular type, then cutting down whichever doesn't match the further criteria. There's many ways this can be optimized, and a lot of them are not super difficult.

Utilize key uniqueness

When key fields are being queried, we can cut out a lot of potential matches just by querying for document uris with those key fields first. If all are present, we should only get one (or no) document back.

Operation reordering

We are currently very naive about our order of operations, which results in a lot of duplication of effort. In particular, for documents for which we match subdocuments, we will first look up a document uri, then traverse into the subdocument to find out if it matches. Often, a far better order would be to first find subdocuments that match the criteria, and then the documents which they are attached to.

Parallelization

Querying is fully serial. It could easily be parallelized in various ways. First, all document matching is independent and could therefore be done multi-threaded. Second, if a query is compiled into a proper plan, we could run query steps in parallel (for example, one thread could find all subdocuments matching criteria and send them on, and a second thread could receive those and further match containing documents).

Statistics

A lot of smart query reordering and plan compilation would benefit greatly from statistics, such as cardinalities for various predicates. We could investigate how to utilize that properly.

0 replies

matko · 2022-02-11T12:35:55Z

matko
Feb 11, 2022
Maintainer

WOQL Queries

Smarter interaction with document queries

the woql words for querying for documents currently resolve the entire document into a dictionary. Depending on the rest of the query, this may actually not be necessary at all, as someone may just be looking up a document cause they're interested in a handful of fields. We should be more selective about how much we actually resolve, making it more viable to write woql queries as combinations of document queries plus extra magic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Thoughts on the ROADMAP For CoreDB #986

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

TerminusDB

Thoughts on the ROADMAP For CoreDB #986

GavinMendelGleason Feb 10, 2022 Maintainer

ROADMAP For CoreDB

Prefix Move

ID Generation Fixes

Rebase

Patch synthesis

Patch Application

Merge

Document Query

Schema

Schema Development

Schema Import

Remotes

PR / CR

Pub / Sub

Replies: 2 comments

matko Feb 11, 2022 Maintainer

Document Query

Improve speed

Utilize key uniqueness

Operation reordering

Parallelization

Statistics

matko Feb 11, 2022 Maintainer

WOQL Queries

Smarter interaction with document queries

GavinMendelGleason
Feb 10, 2022
Maintainer

matko
Feb 11, 2022
Maintainer

matko
Feb 11, 2022
Maintainer