Schema Management #1380

GavinMendelGleason · 2022-08-18T09:22:19Z

GavinMendelGleason
Aug 18, 2022
Maintainer

Schema Management

With the current architecture in TerminusDB, we have one schema which
is updated in lock step with the data.

This is good in the sense that schema changes may only be done if the
instance data is correct with respect to the schema. However, it
introduces a number of staging and schema managment issues.

We would like a graceful upgrade of schema management to address these
problems such that we can deal with data product management
in-the-large.

Schema life-cycle

The life-cycle of a schema is different from that of a
data-product. For one, a schema can be used in multiple data
products. If we have, for instance, a schema of units, or of GeoJSON,
we will want to reuse it many times while building datasets.

For this reason we need schema data products which essentially live by
themselves, without the need to have any instance information. They
should be explicitly incapable of getting instance data.

These schemata could make effective use of Tags (refs which exist at a
commit) in thier commit graph to ensure that we know which schema we
are talking about.


Units schema
————————————

           v1.0  v1.2
             ↓   ↓
schema:  ⋆—⋆—∘—⋆—∘—⋆
                   ↑
                  main


Perodic Table
—————————————

 Units v1.0 + Custom    Units v1.2 + Custom
           ⇓             ⇓
schema:    ⋆—————————————⋆
instance:  ⋆—⋆—⋆—⋆—⋆—⋆—⋆—⋆

Adding these schema data products requires a backwards-compatible
upgrade to the system graph. It should not affect current data
products which just have commits with no referenced sources.

Upgrading Schemata

When you upgrade a schema in an independent schema only data product,
any change is acceptable since there is no instance data there is no
need to ensure that the instance data is still constrained to match
the schema.

However, when imposing a new schema on a data product, we need to
ensure that it remains schema controlled.

We need operations which allows us to impose a new schema, together
with whatever is necessary to fixup data to conform to this schema.

Weakening the Schema

One avenue that ensures that a schema will always continue to be
correct for a given data product is to never strengthen, but only
weaken the schema. We call it weakening because it is strictly less
restrictive than the schema before it.

Weakening is comprised of:

Weakening a range:
- Change to Optional
- Change to Set
- Change to a super-class of the current class
Adding a new field whose range is:
- An Optional
- An Array
- A Set
Adding a new class (with any fields) which is a leaf of the
inheritance tree.

Modifying the Schema

If however we can't get away with merely weakening, we need some way
to perform changes to the schema, along with bulk changes to the
instance data.

Ideally, these operations should be capable of transitioning between
any arbitrary schema and any other schema.

AddProperty(ClassId,Property): adds a property to a class
DeleteProperty(ClassId,Property): deletes a property of a class
MoveProperty(ClassId,Property,NewRange): Moves a property
to a new range.
AddClass(Class): adds a class to the leaves of the hierarchy
Inherit(ClassSub,ClassSuper): Add properties to other classes
MoveClass(ClassId1,ClassId2): moves a class in the hierarchy -
this will cause new properties to be added or removed.
DeleteClass(ClassId): deletes a class from the hierarchy.
ModifyKey(ClassId,Key): Key modification for a class

With these operations, we can describe all modifications as some
number of schema operations, together with a patch set (as long as
patches also have a move operation for objects - which they do not
currently).

Instance Modifications

Each of the schema modification operations yields some options
(perhaps only one) of what is required of the data to enable them.

AddProperty: We need a default value, or one value for every
instance.
DeleteProperty: We need to delete every property and value from
every class equal and below.
MoveProperty: We need to move all data to the new range, if the
range is incompatible, we need to supply alternative data (as with
add).
AddClass: Now changes required
InheritClass : We need to add all of the class properties to every
member below the class, by giving each value, or a default value for
each.
MoveClass: We need to delete all properties below from source, and
add properties with values to all below to target, as well as move
all instances from the old class to the new class, specifying new
values (possibly default) for all values from properties newly inherited.
DeleteClass: We need to delete all instances of the class, and all
properties below.
ModifyKey: We need to perform a recalculation of ID and instance
move for every element of the class.

What might this look like?

[{ "@op" : "AddClass",
   "@definition" : { "@id" : "MyClass", "my_property" : "xsd:string" }
   "@fixups" : [{ "@property" : "my_property", "@default" : "test"}]},
 { "@op" : "InheritClass",
   "@class" : "SomeClass",
   "@inherits" : ["MyClass"],
   "@fixups" : [{ "@property" : "my_property",
                  "@id" : "SomeClass/test",
                  "@value" : "new_value"}]},
 { "@op" : "DeleteClass",
   "@class" : "BadClass"}]

The specification needs to be careful only to refer to precisely two
worlds. The world w₁ of the original world, and the world w₂ of the
updated. There should be no intermediate references as these
operations need to be a single unitary transaction. It might be good
to somehow make this two world scheme explicit in the operations.

Multi-Schema

We also need a way to manage having more than one schema at the same
time. The current model is that we just build a single schema, copy
pasting description from whatever sources we have. This is less than ideal.

If Schema management was independent, and we had a special schema data
product which could be imposed we could manage change management to
the schema, as well as importing from other schema sources.

Import

It would also be extremely useful to know which schemas are imported
into a schema. This could be done by elaborating the context object
with additional information that tracks the source of changes.

Imported schemata should probably have a version tag, a commit and
possibly the data product source given as a URL so that it could
potentially be cloned by others at the same tag etc.

It might look something along the lines of:

{ "@type" : "@context",
  "@documentation" : [
    { "@lang" : "en-GB",
      "@title" : "Nuclear Reactor Schema",
      "@description" : "This is a schema for a nuclear reactor knowledge graph.",
      "@authors" : ["Gavin Mendel-Gleason"]
    },
    { "@lang" : "ru",
      "@title" : "Схема ядерного реактора",
      "@description" : "Это схема графа знаний о ядерном реакторе.",
      "@authors" : ["Гэвин Мендель-Глисон"]
    }
  ],
  "@meta" : { "generated_by" : "PolyFobus3" }
  "@imports" : [
      { "@location" : "https://cloud.terminusdb.com/TerminatorsX/TerminatorsX/units",
        "@tag" : "v1.1",
        "@renaming" : { "Dimension" : "UnitDimension" },
        "@prefixmap" : { "@base" : "units", "@schema" : "unitSchema" }
      },
      { "@location" : "https://cloud.terminusdb.com/TerminatorsX/TerminatorsX/GeoJSON",
        "@tag" : "v2.3",
        "@hiding" : ["MultiPolygon"]
        "@prefixmap" : { "@base" : "@base", "@schema" : "@schema" }
      }
      { "@location" : "https://cloud.terminusdb.com/TerminatorsX/TerminatorsX/Time",
        "@tag" : "v2.3",
        "@importing" : ["TemporalScope"]
        "@prefixmap" : { "@base" : "@base", "@schema" : "@schema" }
      }

  ]
}

This demonstrates a number of proposed features:

The @tag would get the appropriate tag from the import
source. important
The @prefixmap allows us to specify how we would like prefixes
interpreted in the imported schema. Retaining the fully qualified
names is possible if we remap the @base and @schema prefix, but
again, could create some difficulties for the interpretation of
documents. Probably we can deal with this by allowing @context
objects to be available anywhere in a document and lexically scoped
as with JSON-LD. important
The @renaming would allow classes to move names from the source
schema. This obviously could impose some compatibility issues with
the interpretation of imports - so perhaps it should be avoided.
The @hiding which allows us to exclude a given class. This may
need warnings or errors for ranges of unhidden classes.
The @importing would allow only named classes to be imported. This
may need warnings or errors for unimported classes which are ranges of
imported classes.
The @location field gives us the origin of the schema so we can look it
up if needed. This could even inform the user of updates.

Change tracking

In addition to the declarative information of sourcing that exists in
a schema, it is important that we track changes.

When we change the tag for an imported schema, add an imported schema,
or remove an imported schema, this change should result in a number of
schema change operations. These operations should be stored somewhere
so we can see what was required to move from one schema to another.

Each of these schema change operations give some options about the
data migration that will be easier to supply to the developer.

By way of example, a Delete Class operation will yield a delete of
all instances, and deletion of all properties downward in the
hierarchy as a necessary patch to the instance data.

Schema Update User Journey

So what does all of this look like from a DX (Developer eXperience)
perspective?

First, a developer starts a data product.
The developer creates a schema.
- Schema imports can be specified by pointing to a source (either a
  local org/db or a fully qualified one) + tag or we will default
  to a commit starting at current head of main.
- Custom changes are added directly to the schema by document
  insertion.
The developer imposes a schema on the data product.
- This creates the reified schema from all imports and custom
  additions.
- The schema is then applied with fixups supplied by the developer.
- If errors are encountered, a report is generated which makes
  clear the fixup options available.

The commit log will make this appear as though everything has moved in
lock-step. But in fact, there is a staging here:

The current applied chema is updated (without checking)
Instances are interpreted within the context of the new schema and
updated according to fixups.
The instances which were changed are checked against the new
schema.

Comments on Design

Schema migration is difficult in all data storage systems, but we'd
like to make it as clean as possible to update schema without loss of
data.

The above is my best guess at how to do this prior to getting into the
weeds, so I'm very keen to have any ideas and input.

EDITED FROM ORIGINAL BASED ON COMMENTS

matko · 2022-08-18T09:55:52Z

matko
Aug 18, 2022
Maintainer

Some comments.

Regarding the transformations between schemas, I think one operation is missing. This is ModifyProperty, where a property's range is changed. This isn't just a combination of a delete and an add, since deleting and re-adding a property implies using the corresponding operations on the instance graph as well, which means removing all data and then adding some default property value.
As the proposal already describes, there are actually various property modifications where no data needs to be changed. These are the weakening operations, such as changing a property to be optional, or changing its object type to a superclass. In addition, there are cases where we can easily convert the underlying data, such as when changing the type of a data property to string.

Another thing. There seem to be various ways to transform one schema into another. For example, while we could be very clever about a minimal modification, doing a very crude deletion followed by an add would always also be a valid patch. Depending on what the user is actually doing, different sorts of patches may actually be suitable for transforming their instance data. Is the idea to allow the user to provide their own patch, which we then verify is indeed a valid patch for the sort of transformation they're trying to do?

Regarding prefix mapping, I'm not exactly sure what the idea is of mapping @base to @base and @schema to @schema. For any actual prefix, the mapping is clear. We just define an extra prefix in the prefix map. But @base and @schema are already defined in every single data product. Is the idea that this remapping would override that definition? But then, what is the meaning of doing this twice, as in the example?

I'm very positive about keeping the URL in the metadata. This'll allow interfaces to query those endpoints and figure out if maybe an update is available to such a schema, and notify the user of that.

5 replies

GavinMendelGleason Aug 18, 2022
Maintainer Author

Well spotted about both the modify property and weakening by super classing.

I'm not exactly sure what the idea is of mapping @base to @base and @Schema to @Schema.

The {"@base" : "@base"} mapping takes the base URI in the imported schema, and remaps it to the base URI in the target schema.

Is the idea to allow the user to provide their own patch, which we then verify is indeed a valid patch for the sort of transformation they're trying to do?

I think we should initially make it a supplied argument (so yes), and then attempt to also automagically give options for fixups with a DWIM approach or interactive approach in the interface further down the line.

matko Aug 18, 2022
Maintainer

The {"@base" : "@base"} mapping takes the base URI in the imported schema, and remaps it to the base URI in the target schema.

But what would that mean? That objects of the imported schema types would end up under the base URI for this data product? Wouldn't that lead to issues if we were then to ever want to do a merge between products with the same schema but different mappings?

And what does the {"@schema":"@schema"} mapping mean? That all types would actually be moved (renamed) to have a different prefix?

I don't say this is necessarily bad. But it's different from a a prefix mapping like {"foo": "bar"} or even {"@base": "bar"} where nothing is being renamed, nothing is being moved, you simply say that the target data product now has a bar prefix that corresponds with what foo/@base were in the imported schema. It is a completely different operation and I don't think they should be overloaded like that. A prefix name remapping doesn't change the schema. Both the imported types and the objects they generate would have the same underlying IRIs. Actually changing the @base and @schema for an imported schema changes this.

GavinMendelGleason Aug 18, 2022
Maintainer Author

It does imply a move - and yes - this could lead to bad things happening in merging different data products. It isn't punning though - if you want to use a prefix you specify this directly in the schema. This is 100% for moving the base of imported schema elements.

However, we have a bit of a problem with the recovery of objects which are read by third-party programmes and do not expect prefixes - such as GeoJSON. in which they will not be fully qualified. Maybe this fix is the The Wrong Thing (tm), but it would enable that specific use case.

Should we allow the document UI to supply prefix mappings? Should we be able to drop arbitrary qualifications when retrieving objects? Is this a better approach?

matko Aug 18, 2022
Maintainer

Are you saying that {"foo":"bar"} actually also implies a move? In that case, there's no overload and I'm ok with it.

Moving may be the right thing in many cases. I just don't think we should overload an operator which in other uses does not imply such a move.
An example where I think a move could be the right thing is if you have some generic Person type. Two data sets may very well be talking about the same person, but have conflicting data about them. The data they have is after all specific to their own domain. So it makes sense then to move objects of these types to a different base so that we can use a merged data set to query information about both domains.

But I wonder if this is always the case. I don't think we can just treat it as a generic fix for getting rid of unwanted prefixes on retrieval. Sometimes you do want objects to go into different prefixes. Nevertheless, on retrieval, you may not care at all about these prefixes. This is going to be the case when the user knows exactly what they are retrieving.
Thinking this through, supposing that on retrieval the code descends into an object living at a 'foreign base', such as some geojson info. Given the typing information we have, shouldn't it always be unambiguous that we are pointing at something from a foreign base? Maybe we should just be able to elide the prefix in this case?

The foreign base would really only ever be used explicitly if you were to insert or retrieve top-level objects for which such an inference cannot be made. But anything contained by them, or for the case where the entire foreign object is contained within a native object, we could switch @base and @schema cause we know we're descending into that foreign thing now.

GavinMendelGleason Aug 18, 2022
Maintainer Author

That's a much better solution to the prefixing problem!

matko · 2022-08-18T10:31:45Z

matko
Aug 18, 2022
Maintainer

Oh one more thing before I forget. I was thinking 'model products' would be a nice name for schema-only data products.

0 replies

Tails · 2022-09-01T07:26:42Z

Tails
Sep 1, 2022

How does it currently work when a schema is changed mid-way? Would it in the mean time not be elegant to have the DB stop returning documents of which the schema has been changed until the developer/author updates the incompatible data to conform with the new schema?

1 reply

GavinMendelGleason Sep 20, 2022
Maintainer Author

If you turn off schema checking, then you can still get documents returned according to the schema that you have, but there may be other data that is in the database that is not returned. This is somewhat convenient when ingesting raw RDF for instance, as you can keep modifying the schema until you get the whole document.

We are going to try to make migration work better in the future, so that intermediate states of data and schema can be moved in sync more easily.

GavinMendelGleason · 2022-09-20T09:14:21Z

GavinMendelGleason
Sep 20, 2022
Maintainer Author

I believe now that @imports should point to a set of library objects. Each library object has its origin information along with the context object from the schema from which it was imported.

That will allow us to generate nice compressed objects on extraction, and also ingest them without requiring transformations that add prefixes etc.

0 replies

GavinMendelGleason · 2023-01-27T11:50:44Z

GavinMendelGleason
Jan 27, 2023
Maintainer Author

More on Schema Management

I wanted to write down some additional thoughts I've had since beginning the process of implementation of schema migration.

Schema / Model products

Before any of schema migration can work, we really need to break model / schema management out of the data product to separate products. These will then have a heart beat that allows changes to happen without requiring any requirement to have knowledge of actual data in a particular database, and will allow much more modular use of schemata.

The easiest way to do this is probably to just create a new type of database which has a schema, and migration script. The migration script should be required to link one tagged version of a deployable / importable schema to the next. Tagging as deployable will require this information to be configured from some previous tag. However, this will allow intermediate updates without requiring a migration script. This will enable people to completely automatically upgrade to a new schema version even if some changes must take place to the data (deletion, creation of edges).

Where should the migration script be kept? We could keep it in the instance data potentially as this will not be otherwise populated. It might also be useful to keep around the upgrade information from a range of previous tags.

Schema Migration

Returning to schema migration, and the process for recording the operations, we need an operations language which can modify any schema to any other schema by a series of operations. The operations should be endowed with sufficient additional information to migrate instance data between them.

This will allow us to have interactive updates to a schema through a UI, or a completely code-first approach to schema migration. Or a hybrid mode which shows the schema before, and schema after as well as the script. The operation language for a schema migration is as follows:

Op := delete_class(Name)
    | create_class(ClassDocument)
    | move_class(Old_Name,New_Name)
    | delete_class_property(Class,Property)
    | create_class_property(Class,Property,Type,Default)
    | move_class_property(Class, Old_Property, New_Property)
    | upcast_class_property(Class, Property, New_Type)
    | downcast_class_property(Class, Property, New_Type, Default)
    | move_key(Class, KeyType, [Property1, ..., PropertyN])
    | change_parents(Class,
           [Parent1,...ParentN],
           [default_for_property(Property1,Default1),
            ...,
            default_for_property(PropertyN,DefaultN)])

If, for instance, we expose a hybrid UI mode we might start with the initial "Before" schema, automatically construct the "After" schema in a read-only panel, and allow the user to update the script such that it might look like:

Before

After

Script

  { "@type" : "Class" ,
  "@id" : "A",         
  "name" : "xsd:string" }

  { "@type" : "Class" ,
  "@id" : "B",         
  "user_name" : "xsd:string", 
  "real_name" : "xsd:string" }

move_class("A","B");
move_class_property("B", "name", "user_name");
add_class_property("B", "real_name", "xsd:string", "unknown");

We can make the process of script creation completely guided by a Wizard in the UI, i.e. when adding a new property you must specify the default value, when changing a class in the class hierarchy, you must say what to do about fields which appear or no longer appear. This can make for relatively painless updates.

Operational words could also be guessed from schema changes, but this produces an exponential number of possible answers, so probably we would only supply a single heuristic guess, or perhaps a script with "holes" which must be filled in.

1 reply

hoijnet Mar 9, 2023

I would like to elaborate a bit on the DX and collaboration aspects on schema authoring.

I believe the approach outlined looks great and that the approach to having separate model products as @matko suggests is a great way to go, insofar as they are compatible with regular data products (being instance-less). The way I have envisioned a strong schema solution to work is to have one in tandem with how other TerminusDB data products work and essentially have three archetypes of data/model products:

Archetype 1: Data + Schema, a standard data product, like today
Archetype 2: Schema-only, a model product, that can have no instances
Archetype 3: Schema-less, a data product without schema checker, and no schema.
- Unsure how relevant it is, but maybe relevant as a generic triple store for some? unimportant

For data and model products, what I believe would be helpful is to ensure the push and pull continues to be fully supported and that schema will not be something separate. For that reason, I believe the DX outlined in Schema Update User Journey is incomplete. I believe in most cases there is a lot of iteration where it is important the data and schema lives together. Probably even, most schema development will happen in a branch with a lot of mock data and the iterative refinement will happen in a way that the schema will be applied to more and more real data and the model product will get updated from the live changes.

Most schema CR/PR will come from small adjustments/increments being proposed/pushed from/to a development branch of the model schema, and merged into the main model branch after being vetted. I believe the correct and right approach is to model the schema and then figure out the data movement, but see that reality will be much more messy in general.

Key point here is that schema collaboration and the importance of enabling the roundtrips with git-for-data collab is likely an important part, and something to take into account in how it might work. I wanted to bring it up as it is not explicitly part of the original text.

GavinMendelGleason · 2023-01-27T15:27:12Z

GavinMendelGleason
Jan 27, 2023
Maintainer Author

Further notes:

Imports

The imports of a schema should contain

A name
A version number / tag
An origin location
A hash which allows us to ensure that this is in fact the same schema as we are referring to via location

The last element is important because we may need to do version upgrades, but a version upgrade that starts from the wrong schema could end up being incorrect and/or fail.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Schema Management #1380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

TerminusDB

Schema Management #1380

GavinMendelGleason Aug 18, 2022 Maintainer

Schema Management

Schema life-cycle

Upgrading Schemata

Weakening the Schema

Modifying the Schema

Instance Modifications

Multi-Schema

Import

Change tracking

Schema Update User Journey

Comments on Design

Replies: 6 comments · 7 replies

matko Aug 18, 2022 Maintainer

GavinMendelGleason Aug 18, 2022 Maintainer Author

matko Aug 18, 2022 Maintainer

GavinMendelGleason Aug 18, 2022 Maintainer Author

matko Aug 18, 2022 Maintainer

GavinMendelGleason Aug 18, 2022 Maintainer Author

matko Aug 18, 2022 Maintainer

Tails Sep 1, 2022

GavinMendelGleason Sep 20, 2022 Maintainer Author

GavinMendelGleason Sep 20, 2022 Maintainer Author

GavinMendelGleason Jan 27, 2023 Maintainer Author

More on Schema Management

Schema / Model products

Schema Migration

hoijnet Mar 9, 2023

GavinMendelGleason Jan 27, 2023 Maintainer Author

Imports

GavinMendelGleason
Aug 18, 2022
Maintainer

Replies: 6 comments 7 replies

matko
Aug 18, 2022
Maintainer

GavinMendelGleason Aug 18, 2022
Maintainer Author

matko Aug 18, 2022
Maintainer

GavinMendelGleason Aug 18, 2022
Maintainer Author

matko Aug 18, 2022
Maintainer

GavinMendelGleason Aug 18, 2022
Maintainer Author

matko
Aug 18, 2022
Maintainer

Tails
Sep 1, 2022

GavinMendelGleason Sep 20, 2022
Maintainer Author

GavinMendelGleason
Sep 20, 2022
Maintainer Author

GavinMendelGleason
Jan 27, 2023
Maintainer Author

GavinMendelGleason
Jan 27, 2023
Maintainer Author