Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialise as EMMO datasets #796

Merged
merged 48 commits into from
Jul 5, 2024
Merged

Conversation

jesper-friis
Copy link
Collaborator

@jesper-friis jesper-friis commented Feb 21, 2024

Description

This PR implements a new dlite.dataset module for serialisation of DLite datamodels and instances to an RDF representation based on EMMO following the representation shown in the figure below.

The main interface is exposed by four new functions:

  • add_dataset(): stores datamodel+mappings to a triplestore
  • add_data(): stores an instance (or datamodel) to a triplestore
  • get_dataset(): loads datamodel+mappings from a triplestore
  • get_data(): loads an instance (or datamodel) from a triplestore

Question: Is the naming of these functions understandable? The term dataset comes from EMMO, but as a user of DLite it may be confusing. Maybe save_datamodel(), save_instance(), load_datamodel(), load_instance() would be more intuitive?

Two tests are added using a datamodel matching what is shown in the figure.

  • test_dataset1_dave.py loads first a FluidData datamodel and documents it semantically with the following mappings

    mappings = [
      (FLUID,                  EMMO.isDescriptionFor, EMMO.Fluid),
      (FLUID.LJPotential,      MAP.mapsTo,            EMMO.String),
      (FLUID.LJPotential,      EMMO.isDescriptionFor, EMMO.MolecularEntity),
      (FLUID.TemperatureField, MAP.mapsTo,            EMMO.ThermodynamicTemperature),
      (FLUID.ntimes,           MAP.mapsTo,            EMMO.Time),
      (FLUID.npositions,       MAP.mapsTo,            EMMO.Position),
    ]

    Note the use of emmo:isDescriptionFor relations in the mappings. They are stored as-is in the triplestore.
    The map:mapsTo are translated to rdfs:subClassOf when serialised to the triplestore.

    Then it uses the add_dataset() function in the new dlite.dataset module and stores it as RDF in a local triplestore. The content of the triplestore corresponds now to the figure below.

    Then it creates two FluidData instances and store them (using the add_data() function) as RDF in a local triplestore as well. The instances are represented as an individual with a rdf:JSON data property containing the instance data.

    Finally the triplestore is serialised to a turtle file.

  • test_dataset2_load.py loads the turtle file into a local triplestore and reconstruct the FluidData datamodel as well as the mappings using the get_dataset() function.

    Using the get_hash() method, it is checked that the reconstruct the FluidData datamodel is exactly equal to the original datamodel.

    Finally it loads the two instances using the get_data() function and check that they are exactly equal to the two original instances.

Type of change

  • Bug fix & code cleanup
  • New feature
  • Documentation update
  • Test update

Checklist for the reviewer

This checklist should be used as a help for the reviewer.

  • Is the change limited to one issue?
  • Does this PR close the issue?
  • Is the code easy to read and understand?
  • Do all new feature have an accompanying new test?
  • Has the documentation been updated as necessary?

@jesper-friis jesper-friis self-assigned this Feb 21, 2024
@jesper-friis jesper-friis marked this pull request as draft February 21, 2024 18:05
jesper-friis and others added 26 commits February 21, 2024 19:07
…ter/software/dlite into 652-serialise-data-models-to-tbox
- Changed some blank nodes to classes and named literals.
- Updated dataset figure.
@jesper-friis
Copy link
Collaborator Author

jesper-friis commented May 28, 2024

First comments:

  1. The chemistry dataset is not used and commented out. It is better to completely remove it.

OK, removed the chemistry dataset.

  1. This is very complicated, in order to review I need som more documentation in addition to the svg-files.
    The svg files are included in the diff. Here is a direct link to the latest version
    https://raw.githubusercontent.com/SINTEF/dlite/652-serialise-data-models-to-tbox/doc/_static/dataset-v2.svg

Added more info to the PR description.

@hothello
Copy link
Collaborator

I do not understand why we need to specify this string in the mappings:
(FLUID.LJPotential, MAP.mapsTo, EMMO.String),
This information is already stated in the datamodel FluidData.json, as for instance is for FLUID.TemperatureField (float64).

Also, the example test_dataset1_save.py creates an instance of a dataset from some hard-coded data. Would it be possible to show how to instantiate a dataset coming from an external file, as it seems to me a more common scenario. I would suggest using the Fluid dataset from OntoTrans' OTE demonstration: isobaric_liquids_nist.xlsx and its associated datamodel. Note that the spreadsheet has three tabs: Benzene, water, and hexane, but these are not specified in the datamodel's dimensions (please check if this is correct).
The corresponding mapping could be something like:

FLUID = ts.bind("fluid", "http://ontotrans.eu/meta/1.0/isobaric_exp#")
SOLV = ts.bind("solv", "http://ontotrans.eu/meta/1.0/solvents") # IRI of the corresponding ontology

mappings = [
    (FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, SOLV.benzene),
    (FLUID,             EMMO.isIconFor, SOLV.water),
    (FLUID,             EMMO.isIconFor, SOLV.hexane),
    (FLUID.n,           MAP.mapsTo,  EMMO.CountingUnit),
    (FLUID.temp,        MAP.mapsTo,  EMMO.ThermodynamicTemperature),
    (FLUID.press,       MAP.mapsTo,  EMMO.Pressure),
    (FLUID.density,     MAP.mapsTo,  EMMO.Density),
    (FLUID.volume,      MAP.mapsTo,  EMMO.Volume),
    (FLUID.int_ene,     MAP.mapsTo,  EMMO.InternalEnergy),
    (FLUID.enthalpy,    MAP.mapsTo,  EMMO.Enthalpy),
    (FLUID.cv,          MAP.mapsTo,  EMMO.IsochoricHeatCapacity),
    (FLUID.cp,          MAP.mapsTo,  EMMO.IsobaricHeatCapacity),
    (FLUID.sound_speed, MAP.mapsTo,  SOLV.SpeedOfSound),
    (FLUID.viscosity,   MAP.mapsTo,  EMMO.DynamicViscosity),
    (FLUID.phase,       MAP.mapsTo,  EMMO.StateOfMatter)
]

Question: where are the prefixes EMMO and MAP defined in test_dataset1_save.py?

@hothello
Copy link
Collaborator

About the new functions.

Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

@jesper-friis
Copy link
Collaborator Author

I do not understand why we need to specify this string in the mappings:

(FLUID.LJPotential,      MAP.mapsTo,            EMMO.String),

This information is already stated in the datamodel FluidData.json, as for instance is for FLUID.TemperatureField (float64).

I agree that this additional mapping is probably more confusing than helpful. I just included this relation since it was in the figure you and Emanuele made. Note that String and StringData are two different concepts in EMMO. The fact that FLUID.LJPotential is a string in the data model will be represented with the triple (FLUID.LJPotential, RDFS.subClassOf, EMMO.StringData) in the triplestore.

@jesper-friis
Copy link
Collaborator Author

Also, the example test_dataset1_save.py creates an instance of a dataset from some hard-coded data. Would it be possible to show how to instantiate a dataset coming from an external file, as it seems to me a more common scenario. I would suggest using the Fluid dataset from OntoTrans' OTE demonstration: isobaric_liquids_nist.xlsx and its associated datamodel. Note that the spreadsheet has three tabs: Benzene, water, and hexane, but these are not specified in the datamodel's dimensions (please check if this is correct).

Good point. The test_dataset1_save.py file is intended for unit testing of the new dlite.dataset module. But I think that it would be very useful to make a real example including your suggestion. Do be done!

@jesper-friis
Copy link
Collaborator Author

Question: where are the prefixes EMMO and MAP defined in test_dataset1_save.py?

MAP is pre-defined and imported from tripper. tripper also defines the EMMO namespace, but importing that is not very user friendly, since it requires you to use the numerical IRIs (like writing EMMO.EMMO_eb77076b_a104_42ac_a065_798b2d2809ad in your code nstead of EMMO.Atom). The dlite.dataset module provides a smarter version of the EMMO namespace that downloads EMMO from GitHub pages and makes a lookup-table of all the labels, such that when you write EMMO.Atom, it will expand to the correct IRI:

>>> from dlite.dataset import EMMO
>>> EMMO.Atom
'https://w3id.org/emmo#EMMO_eb77076b_a104_42ac_a065_798b2d2809ad'

@jesper-friis
Copy link
Collaborator Author

jesper-friis commented May 29, 2024

About the new functions.

Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

@hothello
Copy link
Collaborator

About the new functions.
Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

Good. Suppose one wants to create new instances of datasets whose mappings are generic, like:

(FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, EMMO.Substance),

A particular instance should have the Substance specified, e.g.

    (FLUID,             EMMO.isIconFor, SOLV.diamond)

But the rest of the mapping should be the same. How can instances be created that share the same mappings but differ for a few triplets?

@jesper-friis jesper-friis mentioned this pull request Jul 4, 2024
9 tasks
Copy link
Collaborator

@francescalb francescalb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but propert documentation and examples of use, should have high priority, for this new functionality to have any real value (for others than the core developers).

Since the functionality is needed in the development of SS1 in OpenModel with short due date, I think we can approve the functionality, provided that the documentation has high priority.

@jesper-friis
Copy link
Collaborator Author

About the new functions.
Suppose a datamodel, mappings, and instance of a particular dataset have already been created and stored in the KB. Which function should be used to add a different dataset having the same datamodel and mappings?

For that you can use add_data().

Good. Suppose one wants to create new instances of datasets whose mappings are generic, like:

(FLUID,             EMMO.isDescriptionFor, EMMO.Fluid),
    (FLUID,             EMMO.isIconFor, EMMO.Substance),

A particular instance should have the Substance specified, e.g.

    (FLUID,             EMMO.isIconFor, SOLV.diamond)

But the rest of the mapping should be the same. How can instances be created that share the same mappings but differ for a few triplets?

We describe the datasets at the TBOX level. At this level, the simple mappings like (FLUID, EMMO.isDescriptionFor, EMMO.Fluid) will be represented as restrictions in the knowledge base.

But you are right that we at the individual level we can add simple relations. That is definitely useful. Lets discuss it and make a new PR for that.

@jesper-friis jesper-friis merged commit 6608be5 into master Jul 5, 2024
15 checks passed
@jesper-friis jesper-friis deleted the 652-serialise-data-models-to-tbox branch July 5, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants