Serialise as EMMO datasets #796

jesper-friis · 2024-02-21T18:05:08Z

Description

This PR implements a new dlite.dataset module for serialisation of DLite datamodels and instances to an RDF representation based on EMMO following the representation shown in the figure below.

The main interface is exposed by four new functions:

add_dataset(): stores datamodel+mappings to a triplestore
add_data(): stores an instance (or datamodel) to a triplestore
get_dataset(): loads datamodel+mappings from a triplestore
get_data(): loads an instance (or datamodel) from a triplestore

Question: Is the naming of these functions understandable? The term dataset comes from EMMO, but as a user of DLite it may be confusing. Maybe save_datamodel(), save_instance(), load_datamodel(), load_instance() would be more intuitive?

Two tests are added using a datamodel matching what is shown in the figure.

test_dataset1_dave.py loads first a FluidData datamodel and documents it semantically with the following mappings
```
mappings = [
  (FLUID,                  EMMO.isDescriptionFor, EMMO.Fluid),
  (FLUID.LJPotential,      MAP.mapsTo,            EMMO.String),
  (FLUID.LJPotential,      EMMO.isDescriptionFor, EMMO.MolecularEntity),
  (FLUID.TemperatureField, MAP.mapsTo,            EMMO.ThermodynamicTemperature),
  (FLUID.ntimes,           MAP.mapsTo,            EMMO.Time),
  (FLUID.npositions,       MAP.mapsTo,            EMMO.Position),
]
```
Note the use of emmo:isDescriptionFor relations in the mappings. They are stored as-is in the triplestore.
The map:mapsTo are translated to rdfs:subClassOf when serialised to the triplestore.

Then it uses the add_dataset() function in the new dlite.dataset module and stores it as RDF in a local triplestore. The content of the triplestore corresponds now to the figure below.

Then it creates two FluidData instances and store them (using the add_data() function) as RDF in a local triplestore as well. The instances are represented as an individual with a rdf:JSON data property containing the instance data.

Finally the triplestore is serialised to a turtle file.
test_dataset2_load.py loads the turtle file into a local triplestore and reconstruct the FluidData datamodel as well as the mappings using the get_dataset() function.

Using the get_hash() method, it is checked that the reconstruct the FluidData datamodel is exactly equal to the original datamodel.

Finally it loads the two instances using the get_data() function and check that they are exactly equal to the two original instances.

Type of change

Bug fix & code cleanup
New feature
Documentation update
Test update

Checklist for the reviewer

This checklist should be used as a help for the reviewer.

Is the change limited to one issue?
Does this PR close the issue?
Is the code easy to read and understand?
Do all new feature have an accompanying new test?
Has the documentation been updated as necessary?

…ter/software/dlite into 652-serialise-data-models-to-tbox

- Changed some blank nodes to classes and named literals. - Updated dataset figure.

…/dlite into 652-serialise-data-models-to-tbox

jesper-friis · 2024-05-28T20:28:33Z

First comments:

The chemistry dataset is not used and commented out. It is better to completely remove it.

OK, removed the chemistry dataset.

This is very complicated, in order to review I need som more documentation in addition to the svg-files.
The svg files are included in the diff. Here is a direct link to the latest version
https://raw.githubusercontent.com/SINTEF/dlite/652-serialise-data-models-to-tbox/doc/_static/dataset-v2.svg

Added more info to the PR description.

hothello · 2024-05-29T13:24:16Z

I do not understand why we need to specify this string in the mappings:
(FLUID.LJPotential, MAP.mapsTo, EMMO.String),
This information is already stated in the datamodel FluidData.json, as for instance is for FLUID.TemperatureField (float64).

Also, the example test_dataset1_save.py creates an instance of a dataset from some hard-coded data. Would it be possible to show how to instantiate a dataset coming from an external file, as it seems to me a more common scenario. I would suggest using the Fluid dataset from OntoTrans' OTE demonstration: isobaric_liquids_nist.xlsx and its associated datamodel. Note that the spreadsheet has three tabs: Benzene, water, and hexane, but these are not specified in the datamodel's dimensions (please check if this is correct).
The corresponding mapping could be something like:

FLUID = ts.bind("fluid", "http://ontotrans.eu/meta/1.0/isobaric_exp#")
SOLV = ts.bind("solv", "http://ontotrans.eu/meta/1.0/solvents") # IRI of the corresponding ontology

mappings = [
    (FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, SOLV.benzene),
    (FLUID,             EMMO.isIconFor, SOLV.water),
    (FLUID,             EMMO.isIconFor, SOLV.hexane),
    (FLUID.n,           MAP.mapsTo,  EMMO.CountingUnit),
    (FLUID.temp,        MAP.mapsTo,  EMMO.ThermodynamicTemperature),
    (FLUID.press,       MAP.mapsTo,  EMMO.Pressure),
    (FLUID.density,     MAP.mapsTo,  EMMO.Density),
    (FLUID.volume,      MAP.mapsTo,  EMMO.Volume),
    (FLUID.int_ene,     MAP.mapsTo,  EMMO.InternalEnergy),
    (FLUID.enthalpy,    MAP.mapsTo,  EMMO.Enthalpy),
    (FLUID.cv,          MAP.mapsTo,  EMMO.IsochoricHeatCapacity),
    (FLUID.cp,          MAP.mapsTo,  EMMO.IsobaricHeatCapacity),
    (FLUID.sound_speed, MAP.mapsTo,  SOLV.SpeedOfSound),
    (FLUID.viscosity,   MAP.mapsTo,  EMMO.DynamicViscosity),
    (FLUID.phase,       MAP.mapsTo,  EMMO.StateOfMatter)
]

Question: where are the prefixes EMMO and MAP defined in test_dataset1_save.py?

hothello · 2024-05-29T13:40:09Z

About the new functions.

Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

jesper-friis · 2024-05-29T16:10:10Z

I do not understand why we need to specify this string in the mappings:
(FLUID.LJPotential,      MAP.mapsTo,            EMMO.String),
This information is already stated in the datamodel FluidData.json, as for instance is for FLUID.TemperatureField (float64).

I agree that this additional mapping is probably more confusing than helpful. I just included this relation since it was in the figure you and Emanuele made. Note that String and StringData are two different concepts in EMMO. The fact that FLUID.LJPotential is a string in the data model will be represented with the triple (FLUID.LJPotential, RDFS.subClassOf, EMMO.StringData) in the triplestore.

jesper-friis · 2024-05-29T16:15:32Z

Also, the example test_dataset1_save.py creates an instance of a dataset from some hard-coded data. Would it be possible to show how to instantiate a dataset coming from an external file, as it seems to me a more common scenario. I would suggest using the Fluid dataset from OntoTrans' OTE demonstration: isobaric_liquids_nist.xlsx and its associated datamodel. Note that the spreadsheet has three tabs: Benzene, water, and hexane, but these are not specified in the datamodel's dimensions (please check if this is correct).

Good point. The test_dataset1_save.py file is intended for unit testing of the new dlite.dataset module. But I think that it would be very useful to make a real example including your suggestion. Do be done!

jesper-friis · 2024-05-29T16:25:48Z

Question: where are the prefixes EMMO and MAP defined in test_dataset1_save.py?

MAP is pre-defined and imported from tripper. tripper also defines the EMMO namespace, but importing that is not very user friendly, since it requires you to use the numerical IRIs (like writing EMMO.EMMO_eb77076b_a104_42ac_a065_798b2d2809ad in your code nstead of EMMO.Atom). The dlite.dataset module provides a smarter version of the EMMO namespace that downloads EMMO from GitHub pages and makes a lookup-table of all the labels, such that when you write EMMO.Atom, it will expand to the correct IRI:

>>> from dlite.dataset import EMMO
>>> EMMO.Atom
'https://w3id.org/emmo#EMMO_eb77076b_a104_42ac_a065_798b2d2809ad'

jesper-friis · 2024-05-29T16:27:56Z

About the new functions.

Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

hothello · 2024-05-30T11:49:29Z

About the new functions.
Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

Good. Suppose one wants to create new instances of datasets whose mappings are generic, like:

(FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, EMMO.Substance),

A particular instance should have the Substance specified, e.g.

    (FLUID,             EMMO.isIconFor, SOLV.diamond)

But the rest of the mapping should be the same. How can instances be created that share the same mappings but differ for a few triplets?

…/dlite into 652-serialise-data-models-to-tbox

francescalb

Approved, but propert documentation and examples of use, should have high priority, for this new functionality to have any real value (for others than the core developers).

Since the functionality is needed in the development of SS1 in OpenModel with short due date, I think we can approve the functionality, provided that the documentation has high priority.

jesper-friis · 2024-07-05T14:02:25Z

About the new functions.
Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

Good. Suppose one wants to create new instances of datasets whose mappings are generic, like:
(FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, EMMO.Substance),
A particular instance should have the Substance specified, e.g.
    (FLUID,             EMMO.isIconFor, SOLV.diamond)
But the rest of the mapping should be the same. How can instances be created that share the same mappings but differ for a few triplets?

We describe the datasets at the TBOX level. At this level, the simple mappings like (FLUID, EMMO.isDescriptionFor, EMMO.Fluid) will be represented as restrictions in the knowledge base.

But you are right that we at the individual level we can add simple relations. That is definitely useful. Lets discuss it and make a new PR for that.

jesper-friis added 2 commits October 6, 2023 23:31

Fixed issues

ce7f073

Working on dataset representation...

20c34dd

jesper-friis self-assigned this Feb 21, 2024

jesper-friis marked this pull request as draft February 21, 2024 18:05

jesper-friis and others added 26 commits February 21, 2024 19:07

Update figure

120f0ce

Merge branch '652-serialise-data-models-to-tbox' of bifrost:~/prosjek…

e3a1cb9

…ter/software/dlite into 652-serialise-data-models-to-tbox

Avoid failing unnessesary

56d2c95

Minor fixes:

10109f5

- Changed some blank nodes to classes and named literals. - Updated dataset figure.

Updated dataset serialisation

95c7f5d

Merge branch 'master' into 652-serialise-data-models-to-tbox

217b189

Updating dataset

c5ac6de

Handled units.

e461da3

Updates

198123b

Added new figure: dataset-v2.svg

5c19b9e

Merge branch 'master' into 652-serialise-data-models-to-tbox

8524e0d

Updated dataset-v2.svg

9873581

Merge branch '652-serialise-data-models-to-tbox' of github.com:SINTEF…

d1a620c

…/dlite into 652-serialise-data-models-to-tbox

Updated figure

7720767

Merge branch 'master' into 652-serialise-data-models-to-tbox

babcde2

Updated dataset figure with input from Emanuele

e182f95

Updated figure

a898d6e

Improving dataset

94615f2

Merge branch '652-serialise-data-models-to-tbox' of github.com:SINTEF…

e872fa2

…/dlite into 652-serialise-data-models-to-tbox

Updated dataset serialisation

13b43b4

Merge branch 'master' into 652-serialise-data-models-to-tbox

7c3821b

metadata_to_rdf() now works

d122e8d

Implemented loading datamodels from EMMO representation.

a7c7d3f

Cleanup

2c12983

Skip testing dataset if Tripper is not installed.

b42f132

Updated representation of array types in the dlite.dataset module.

c1bfe4f

jesper-friis requested a review from hothello May 29, 2024 08:18

jesper-friis and others added 13 commits June 7, 2024 09:35

Merge branch 'master' into 652-serialise-data-models-to-tbox

d3b9f66

Started to add a dataset example

9d73a70

Merge branch '652-serialise-data-models-to-tbox' of github.com:SINTEF…

32bcc13

…/dlite into 652-serialise-data-models-to-tbox

Updated readme for dataset example

c64ed0a

Merge branch 'master' into 652-serialise-data-models-to-tbox

0d206c9

Merge branch 'master' into 652-serialise-data-models-to-tbox

a733654

Updated requirements

f68659c

Merge branch '652-serialise-data-models-to-tbox' of github.com:SINTEF…

b5a0f01

…/dlite into 652-serialise-data-models-to-tbox

Also updated local requirements for TEM_data example

b45d69b

Merge branch 'master' into 652-serialise-data-models-to-tbox

8afdf52

Merge branch 'master' into 652-serialise-data-models-to-tbox

23a4460

Merge branch 'master' into 652-serialise-data-models-to-tbox

c883593

Fixed EMMO dataset representation

a045e15

jesper-friis mentioned this pull request Jul 4, 2024

Fix test redis #877

Merged

9 tasks

jesper-friis and others added 2 commits July 5, 2024 15:35

Cleaned up readme in dataset example

6e28f25

Merge branch 'master' into 652-serialise-data-models-to-tbox

86f0cca

francescalb approved these changes Jul 5, 2024

View reviewed changes

jesper-friis merged commit 6608be5 into master Jul 5, 2024
15 checks passed

jesper-friis deleted the 652-serialise-data-models-to-tbox branch July 5, 2024 15:38

jesper-friis mentioned this pull request Jul 5, 2024

Improve the dataset example #879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialise as EMMO datasets #796

Serialise as EMMO datasets #796

jesper-friis commented Feb 21, 2024 •

edited

Loading

jesper-friis commented May 28, 2024 •

edited

Loading

hothello commented May 29, 2024

hothello commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024 •

edited

Loading

hothello commented May 30, 2024

francescalb left a comment

jesper-friis commented Jul 5, 2024

Serialise as EMMO datasets #796

Serialise as EMMO datasets #796

Conversation

jesper-friis commented Feb 21, 2024 • edited Loading

Description

Type of change

Checklist for the reviewer

jesper-friis commented May 28, 2024 • edited Loading

hothello commented May 29, 2024

hothello commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024

jesper-friis commented May 29, 2024 • edited Loading

hothello commented May 30, 2024

francescalb left a comment

Choose a reason for hiding this comment

jesper-friis commented Jul 5, 2024

jesper-friis commented Feb 21, 2024 •

edited

Loading

jesper-friis commented May 28, 2024 •

edited

Loading

jesper-friis commented May 29, 2024 •

edited

Loading