Schema Management #1380
Replies: 6 comments 7 replies
-
Some comments. Regarding the transformations between schemas, I think one operation is missing. This is Another thing. There seem to be various ways to transform one schema into another. For example, while we could be very clever about a minimal modification, doing a very crude deletion followed by an add would always also be a valid patch. Depending on what the user is actually doing, different sorts of patches may actually be suitable for transforming their instance data. Is the idea to allow the user to provide their own patch, which we then verify is indeed a valid patch for the sort of transformation they're trying to do? Regarding prefix mapping, I'm not exactly sure what the idea is of mapping I'm very positive about keeping the URL in the metadata. This'll allow interfaces to query those endpoints and figure out if maybe an update is available to such a schema, and notify the user of that. |
Beta Was this translation helpful? Give feedback.
-
Oh one more thing before I forget. I was thinking 'model products' would be a nice name for schema-only data products. |
Beta Was this translation helpful? Give feedback.
-
How does it currently work when a schema is changed mid-way? Would it in the mean time not be elegant to have the DB stop returning documents of which the schema has been changed until the developer/author updates the incompatible data to conform with the new schema? |
Beta Was this translation helpful? Give feedback.
-
I believe now that That will allow us to generate nice compressed objects on extraction, and also ingest them without requiring transformations that add prefixes etc. |
Beta Was this translation helpful? Give feedback.
-
More on Schema ManagementI wanted to write down some additional thoughts I've had since beginning the process of implementation of schema migration. Schema / Model productsBefore any of schema migration can work, we really need to break model / schema management out of the data product to separate products. These will then have a heart beat that allows changes to happen without requiring any requirement to have knowledge of actual data in a particular database, and will allow much more modular use of schemata. The easiest way to do this is probably to just create a new type of database which has a schema, and migration script. The migration script should be required to link one tagged version of a deployable / importable schema to the next. Tagging as deployable will require this information to be configured from some previous tag. However, this will allow intermediate updates without requiring a migration script. This will enable people to completely automatically upgrade to a new schema version even if some changes must take place to the data (deletion, creation of edges). Where should the migration script be kept? We could keep it in the instance data potentially as this will not be otherwise populated. It might also be useful to keep around the upgrade information from a range of previous tags. Schema MigrationReturning to schema migration, and the process for recording the operations, we need an operations language which can modify any schema to any other schema by a series of operations. The operations should be endowed with sufficient additional information to migrate instance data between them. This will allow us to have interactive updates to a schema through a UI, or a completely code-first approach to schema migration. Or a hybrid mode which shows the schema before, and schema after as well as the script. The operation language for a schema migration is as follows:
If, for instance, we expose a hybrid UI mode we might start with the initial "Before" schema, automatically construct the "After" schema in a read-only panel, and allow the user to update the script such that it might look like:
We can make the process of script creation completely guided by a Wizard in the UI, i.e. when adding a new property you must specify the default value, when changing a class in the class hierarchy, you must say what to do about fields which appear or no longer appear. This can make for relatively painless updates. Operational words could also be guessed from schema changes, but this produces an exponential number of possible answers, so probably we would only supply a single heuristic guess, or perhaps a script with "holes" which must be filled in. |
Beta Was this translation helpful? Give feedback.
-
Further notes: ImportsThe imports of a schema should contain
The last element is important because we may need to do version upgrades, but a version upgrade that starts from the wrong schema could end up being incorrect and/or fail. |
Beta Was this translation helpful? Give feedback.
-
Schema Management
With the current architecture in TerminusDB, we have one schema which
is updated in lock step with the data.
This is good in the sense that schema changes may only be done if the
instance data is correct with respect to the schema. However, it
introduces a number of staging and schema managment issues.
We would like a graceful upgrade of schema management to address these
problems such that we can deal with data product management
in-the-large.
Schema life-cycle
The life-cycle of a schema is different from that of a
data-product. For one, a schema can be used in multiple data
products. If we have, for instance, a schema of units, or of GeoJSON,
we will want to reuse it many times while building datasets.
For this reason we need schema data products which essentially live by
themselves, without the need to have any instance information. They
should be explicitly incapable of getting instance data.
These schemata could make effective use of Tags (refs which exist at a
commit) in thier commit graph to ensure that we know which schema we
are talking about.
Adding these schema data products requires a backwards-compatible
upgrade to the system graph. It should not affect current data
products which just have commits with no referenced sources.
Upgrading Schemata
When you upgrade a schema in an independent schema only data product,
any change is acceptable since there is no instance data there is no
need to ensure that the instance data is still constrained to match
the schema.
However, when imposing a new schema on a data product, we need to
ensure that it remains schema controlled.
We need operations which allows us to impose a new schema, together
with whatever is necessary to fixup data to conform to this schema.
Weakening the Schema
One avenue that ensures that a schema will always continue to be
correct for a given data product is to never strengthen, but only
weaken the schema. We call it weakening because it is strictly less
restrictive than the schema before it.
Weakening is comprised of:
inheritance tree.
Modifying the Schema
If however we can't get away with merely weakening, we need some way
to perform changes to the schema, along with bulk changes to the
instance data.
Ideally, these operations should be capable of transitioning between
any arbitrary schema and any other schema.
AddProperty(ClassId,Property)
: adds a property to a classDeleteProperty(ClassId,Property)
: deletes a property of a classMoveProperty(ClassId,Property,NewRange)
: Moves a propertyto a new range.
AddClass(Class)
: adds a class to the leaves of the hierarchyInherit(ClassSub,ClassSuper)
: Add properties to other classesMoveClass(ClassId1,ClassId2)
: moves a class in the hierarchy -this will cause new properties to be added or removed.
DeleteClass(ClassId)
: deletes a class from the hierarchy.ModifyKey(ClassId,Key)
: Key modification for a classWith these operations, we can describe all modifications as some
number of schema operations, together with a patch set (as long as
patches also have a move operation for objects - which they do not
currently).
Instance Modifications
Each of the schema modification operations yields some options
(perhaps only one) of what is required of the data to enable them.
AddProperty
: We need a default value, or one value for everyinstance.
DeleteProperty
: We need to delete every property and value fromevery class equal and below.
MoveProperty
: We need to move all data to the new range, if therange is incompatible, we need to supply alternative data (as with
add).
AddClass
: Now changes requiredInheritClass
: We need to add all of the class properties to everymember below the class, by giving each value, or a default value for
each.
MoveClass
: We need to delete all properties below from source, andadd properties with values to all below to target, as well as move
all instances from the old class to the new class, specifying new
values (possibly default) for all values from properties newly inherited.
DeleteClass
: We need to delete all instances of the class, and allproperties below.
ModifyKey
: We need to perform a recalculation of ID and instancemove for every element of the class.
What might this look like?
The specification needs to be careful only to refer to precisely two
worlds. The world w₁ of the original world, and the world w₂ of the
updated. There should be no intermediate references as these
operations need to be a single unitary transaction. It might be good
to somehow make this two world scheme explicit in the operations.
Multi-Schema
We also need a way to manage having more than one schema at the same
time. The current model is that we just build a single schema, copy
pasting description from whatever sources we have. This is less than ideal.
If Schema management was independent, and we had a special schema data
product which could be imposed we could manage change management to
the schema, as well as importing from other schema sources.
Import
It would also be extremely useful to know which schemas are imported
into a schema. This could be done by elaborating the context object
with additional information that tracks the source of changes.
Imported schemata should probably have a version tag, a commit and
possibly the data product source given as a URL so that it could
potentially be cloned by others at the same tag etc.
It might look something along the lines of:
This demonstrates a number of proposed features:
@tag
would get the appropriate tag from the importsource. important
@prefixmap
allows us to specify how we would like prefixesinterpreted in the imported schema. Retaining the fully qualified
names is possible if we remap the
@base
and@schema
prefix, butagain, could create some difficulties for the interpretation of
documents. Probably we can deal with this by allowing
@context
objects to be available anywhere in a document and lexically scoped
as with JSON-LD. important
@renaming
would allow classes to move names from the sourceschema. This obviously could impose some compatibility issues with
the interpretation of imports - so perhaps it should be avoided.
@hiding
which allows us to exclude a given class. This mayneed warnings or errors for ranges of unhidden classes.
@importing
would allow only named classes to be imported. Thismay need warnings or errors for unimported classes which are ranges of
imported classes.
@location
field gives us the origin of the schema so we can look itup if needed. This could even inform the user of updates.
Change tracking
In addition to the declarative information of sourcing that exists in
a schema, it is important that we track changes.
When we change the tag for an imported schema, add an imported schema,
or remove an imported schema, this change should result in a number of
schema change operations. These operations should be stored somewhere
so we can see what was required to move from one schema to another.
Each of these schema change operations give some options about the
data migration that will be easier to supply to the developer.
By way of example, a
Delete Class
operation will yield a delete ofall instances, and deletion of all properties downward in the
hierarchy as a necessary patch to the instance data.
Schema Update User Journey
So what does all of this look like from a DX (Developer eXperience)
perspective?
local org/db or a fully qualified one) + tag or we will default
to a commit starting at current head of
main
.insertion.
additions.
clear the fixup options available.
The commit log will make this appear as though everything has moved in
lock-step. But in fact, there is a staging here:
updated according to fixups.
schema.
Comments on Design
Schema migration is difficult in all data storage systems, but we'd
like to make it as clean as possible to update schema without loss of
data.
The above is my best guess at how to do this prior to getting into the
weeds, so I'm very keen to have any ideas and input.
EDITED FROM ORIGINAL BASED ON COMMENTS
Beta Was this translation helpful? Give feedback.
All reactions