Skip to content

Latest commit

 

History

History

rdf-improved

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Improvements to CIM and CGMES RDF Representation

This document describes proposed inprovements to the representation of CIM/CGMES instance data.

Table of Contents

Folders

  • instances: Sample Instance Data as Trig, from Nordic44, ENTSO-E and "multiplied" (large), see Multipled Data
  • test: xml, trig and jsonld test instance files (8 of each kind)
  • trials: various trial files

Files

  • cim-context-new.txt: prefix file for JSON-LD context using new namespaces
  • cim-context-old.txt: prefix file for JSON-LD context using old namespaces
  • cim-context-common.txt: common file for JSON-LD context
  • cim-context-new.jsonld: JSON-LD context using new namespaces
  • cim-context-old.jsonld: JSON-LD context using old namespaces
  • cim-context-strings.txt: properties with "@type": "xsd:string". Not added to context since that is the default datatype
  • cim-trig.pl: converts CIM XML (Full or Difference models) to Trig: see Custom CIM XML Parser
  • count.pl: script to clean up files produced by riot --count
  • count-ENTSOE.txt: count of triples in ENTSO-E instance files as produced by riot --count
  • count-ENTSOE1.txt: pure count of triples in ENTSO-E instance files
  • count-Nordic.txt: count of triples in Nordic44 instance files as produced by riot --count
  • count-Nordic1.txt: pure count of triples in Nordic44 instance files
  • fix-datatypes-new.ru: SPARQL Update to add datatypes to instance files using new namespaces
  • fix-datatypes-old.ru: SPARQL Update to add datatypes to instance files using old namespaces
  • fix-datatypes-both.ru: SPARQL Update to add datatypes to instance files using either new or old namespaces
  • props-same-name-different-characteristics.csv: properties with same name (last part of URL) but different characteristics
  • props-same-name-different-range.csv: properties with same name (last part of URL) but different range (the most important characteristic)
  • README.md: this file

Makefile

This folder uses make to automate various tasks and ensure that dependencies are tracked and files are remade when needed. The Makefile defines the following targets (printed when make is invoked without target)

  • context: JSON-LD context for new and old namespaces
  • dirs: all subdirs in instances
  • test: test instance files in trig
  • jsonld: test instance files in jsonld
  • nordic: Nordic44 instance files in trig
  • entsoe: ENTSO-E instance files in trig
  • multiplied: "multiplied" instance files in trig
  • rm-test: remove "test/trig" instance files
  • rm-jsonld: remove "test/jsonld" instance files
  • rm-nordic: remove Nordic44 trig instance files
  • rm-entsoe: remove ENTSO-E trig instance files
  • rm-multiplied: remove "multiplied" trig instance files
  • clean: remove files of size zero

The make manual is very comprehensive, but dense and hard to understand. So if you are not familiar with make, it can be quite a challenge to understand and maintain the Makefile. In following subsections we explain a few of the trickier aspects.

Makefile Variables

Let's first look at variable assignments. Consider the most complicated group:

nordic_source      = ../../../Nordic44/Instances

States where is the source of the Nordic44 instance files relative to the current folder.

nordic_dirs       != /usr/bin/find $(nordic_source) -type d

Finds all directories (subfolders). Unlike normal assignment, != invokes the shell with an external command. I've given the full name /usr/bin/find to avoid confusion with the DOS find program (an abomination).

nordic_target      = instances/Nordic44

States where the target instance files will go (upon conversion from xml to trig).

nordic_target_dirs = $(subst $(nordic_source),$(nordic_target),$(nordic_dirs)) 

Computes the target subfolders. $(nordic_dirs) is interpreted as a space-separated array, and for each subfolder, the source prefix is substituted with the target prefix.

nordic_ignore      = CDPSM_2_0/Nordic44_CPSM_01_MF.xml CDPSM_2_0/Nordic44_03_inc.xml CGMES_2_4/Nordic44_CGM_36f_MF.xml CGMES_2_4/Nordic44_CGM_38_CO.xml

Declares that some Nordic files will be ignored (not converted) for various reasons.

nordic_ignore2     = $(patsubst %, $(nordic_source)/%, $(nordic_ignore))

Expands the ignored files to include the source folder prepended.

nordic_rdf         = $(filter-out $(nordic_ignore2), $(wildcard $(nordic_source)/*_2_*/*.xml))

Finds all relevant source (rdf xml) files by using $(wildcard) (glob pattern). The pattern *_2_* uses only folders CDPSM_2_0, CGMES_2_4 but ignores the folder CGMES_3_0 (since that has only some draft files in ttl, nt, geojson). $(filter-out) further excludes the $(nordic_ignore2) files.

nordic_trig        = $(subst .xml,.trig, $(subst $(nordic_source),$(nordic_target), $(nordic_rdf)))

Computes the target (trig) filenames. $(nordic_rdf) is treated as a space-delimited array of filenames, and for each one we replace source with target folder, and source extension .xml with the target extension .trig

First Target

The first target in the file (conventionally called all) is executed if you run make without arguments:

all:
	@echo targets: context, dirs, test, jsonld, nordic, entsoe, multiplied, rm-test, rm-jsonld, rm-nordic, rm-entsoe, rm-multiplied, clean
  • It just prints the targets defined in the Makefile.
  • The prefix @ prevents make from printing the command line itself

This is also the place to print out any variable you're unsure about, for debugging purposes. Eg to print $(nordic_trig), add this:

	@echo $(nordic_trig)

Making Dirs

The instances folder has 49 folders going to 4 levels deep. If make tries to create a file in non-existing folder, it will fail. So we want to automate the creation of all these folders. We've already computed the nested subfolders $(nordic_target_dirs) $(entsoe_target_dirs), so we just call mkdir on the 4 root folders, plus the nested subfolders:

dirs:
	-mkdir instances $(nordic_target) $(entsoe_target) $(multiplied_target) $(nordic_target_dirs) $(entsoe_target_dirs)

The - sign tells it to proceed even if some of these folders already exist (mkdir returns an error in such case, but make ignores the error).

There is one more thing to do. Git ignores empty folders on commit, so we need to make an empty file in each folder. Such files are conventionally called .gitkeep (see What is. gitkeep):

	touch $(patsubst %, %/.gitkeep, $(multiplied_target) $(nordic_target_dirs) $(entsoe_target_dirs))

touch is a convenient command to use here: it updates the timestamp of files to the current time, and makes empty files if needed.

Making Zips

Why would we even need these empty folders? Because we don't want to:

  • Commit such a large number of large files (see Final Instance Data) to git.
  • Transfer a large number of files to a semantic database for loading. It's better to transfer just 3 zips.

So we use zip to zip the the instance files and move them out of the way:

zips:
	zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(nordic_target))     $(nordic_target)     -x "*/.gitkeep"
	zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(entsoe_target))     $(entsoe_target)	   -x "*/.gitkeep"
	zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(multiplied_target)) $(multiplied_target) -x "*/.gitkeep"
  • Option -m moves the files to the zip
  • Option -x excludes the .gitkeep files
  • The above is a bit dumb since it always considers all files and copy-pastes the same command three times, but it's ok for a starter

Represent Models as Named Graphs

  • 3lbits/CIM4NoUtility#321 Converting CIMXML DifferenceModel to CIMJSON-LD
  • #22 md:Statement is problematically defined
  • #86 no connection of instance triples to Model

If you convert a CIM XML model (eg Nordic44_CGM_36d_SSH.xml) to Turtle, you get something like this:

<urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586>
  rdf:type                       md:FullModel ;
  md:Model.DependentOn           <urn:uuid:2dd9014f-bdfb-11e5-94fa-c8f73332c8f4> ;
  md:Model.created               "2017-11-24T09:03:09.9446768Z" ;
  md:Model.description           "CGM Test model developed by Statnett SF. Nordic 44 bus system for the Nordic region" ;
  md:Model.modelingAuthoritySet  "http://www.Statnett.no/IGM/Nordic44_CGM" ;
  md:Model.profile               "http://entsoe.eu/CIM/SteadyStateHypothesis/1/1" , "http://entsoe.eu/CIM/SteadyStateHypothesis/1/2" ;
  md:Model.scenarioTime          "2015-03-06T01:30:00.0000000Z" ;
  md:Model.version               "36" ;
  pti:Model.createdBy            "Statnett SF" .

<file:///d:/Onto/proj/electrical/Nordic44/Instances/CGMES_2_4/Nordic44_CGM_36d_SSH.xml#_e2f56599-a78e-494f-8db3-c0b0bdab1d70>
  rdf:type                    cim:Terminal ;
  cim:ACDCTerminal.connected  "true" .

The problem is that there's no relation between the model and CIM triples whatsoever. The fact that they appear in the same file doesn't matter at all when it comes to the RDF representation. (Just because some triples appear in a file, does not link the triples to the model URI in that file). If you load this file to a semantic repository, these triples will be mixed with millions of other triples, losing all connection to the model.

Statement sets are modeled in the ontology by using the RDF Reification ontology: rdf:Statement (sometimes misspelled rdf:Statements), with props rdf:subject, rdf:predicate, rdf:object (sometimes misspelled rdf:Statement.subject, rdf:Statement.predicate, rdf:Statement.object). But Reification is a very inefficient way to capture statements. So in instance data, CIM doesn't actually use that construct.

It was agreed that each model will be represented as a Named Graph that contains the model metadata and triples (thus they become quads). The model URN is also used as graph URN (name). We can express this in TriG (Turtle with Graphs) as follows, where we also:

  • Use the rdfg:Graph class to emphasize that the model is a named graph
  • Fix the relative instance URL to an absolute URL
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

<urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586> {
  <urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586>
    rdf:type                       md:FullModel, rdfg:Graph ;
    md:Model.DependentOn           <urn:uuid:2dd9014f-bdfb-11e5-94fa-c8f73332c8f4> ;
    md:Model.created               "2017-11-24T09:03:09.9446768Z" ;
    md:Model.description           "CGM Test model developed by Statnett SF. Nordic 44 bus system for the Nordic region" ;
    md:Model.modelingAuthoritySet  "http://www.Statnett.no/IGM/Nordic44_CGM" ;
    md:Model.profile               "http://entsoe.eu/CIM/SteadyStateHypothesis/1/1" , "http://entsoe.eu/CIM/SteadyStateHypothesis/1/2" ;
    md:Model.scenarioTime          "2015-03-06T01:30:00.0000000Z" ;
    md:Model.version               "36".

  <http://www.Statnett.no/IGM/Nordic44_CGM/_e2f56599-a78e-494f-8db3-c0b0bdab1d70>
    rdf:type                    cim:Terminal ;
    cim:ACDCTerminal.connected  "true" .
}

Representing Difference Models

  • #53 representing difference models
  • #85 problems converting CIM XML files to Turtle

The problem is especially acute for difference models. CGMES-TC/FullGrid_SC_diff.xml is an example of such a model. CIM XML uses its own dialect of RDF/XML with rdf:parseType="Statements". This non-standard addition is only supported in CIM-specific tools and is a major impediment to the use of standard semantic web processing tools. (Eg if you use Jena, the parseType="Statements" payload is captured as a string, not as triples).

CIM Difference Models are important because they allow to record a delta against a base model, thus enabling "What If" analysis and other important scenarios.

In particular, a Difference Model is associated with 4 named graphs:

  • Model metadata in the model graph
  • Refers to the base model using md:Model.Supersedes
  • Checks for the presence of certain statements using dm:preconditions (but this is not used in CIM)
  • Specifies statements to delete using dm:reverseDifferences
  • Specifies statements to insert using dm:forwardDifferences

Naive JSON-LD Graph Representation Attempt

Note: in this and the next subsection we use illustrative graph names (eg base-model, reverse, forward) but these are not valid urn:uuid URNs.

RDF/XML cannot carry named graphs, but JSON-LD and Trig (Turtle with graphs) can.

  • 3lbits/CIM4NoUtility#321 Converting CIMXML DifferenceModel to CIMJSON-LD makes a couple of naive attempts to represent a DifferenceModel using the nesting structure of JSON-LD.

See the trials folder for some attempts. For example, option2.jsonld looks like this:

{
  "@graph": [
    {
      "@id": "urn:uuid:difference-model1",
      "@type": "dm:DifferenceModel",
      "dm:reverseDifferences": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "@type": "cim:ACLineSegment",
          "cim:Conductor.length": {"cim:Length.value": 50.0}
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
          "@type": "cim:Switch",
          "cim:IdentifiedObject.Name": "Switch1"
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model2",
      "@type": "dm:DifferenceModel",
      "dm:forwardDifferences": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "@type": "cim:ACLineSegment",
          "cim:Conductor.length": {"cim:Length.value": 55.0}
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
          "@type": "cim:Switch",
          "cim:IdentifiedObject.Name": "Switch2"
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model3",
      "@type": "dm:DifferenceModel",
      "dm:reverseDifferences": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "@type": "cim:ACLineSegment",
          "cim:Conductor.length": {"cim:Length.value": 60.0}
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model4",
      "@type": "dm:DifferenceModel",
      "dm:forwardDifferences": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "@type": "cim:ACLineSegment",
          "cim:Conductor.length": {"cim:Length.value": 65.0}
        }
      ]
    }
  ]
}

But if we convert this to Trig using Jena RIOT:

riot --formatted=trig option2.jsonld > option2.trig

We see a mixup:

  • Two of the reverse differences are mixed together at model1
  • Two of the forward differences are mixed together at model2
  • The statements Conductor.length are mixed together
<urn:uuid:difference-model1>
  rdf:type               dm:DifferenceModel ;
  dm:reverseDifferences  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> , <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .

<urn:uuid:difference-model2>
  rdf:type               dm:DifferenceModel ;
  dm:forwardDifferences  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> , <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .

<urn:uuid:difference-model3>
  rdf:type               dm:DifferenceModel ;
  dm:reverseDifferences  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .

<urn:uuid:difference-model4>
  rdf:type               dm:DifferenceModel ;
  dm:forwardDifferences  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .

<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5>
  rdf:type                   cim:Switch ;
  cim:IdentifiedObject.Name  "Switch2" , "Switch1" .

<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9>
  rdf:type              cim:ACLineSegment ;
  cim:Conductor.length  [ cim:Length.value  65 ] ;
  cim:Conductor.length  [ cim:Length.value  60 ] ;
  cim:Conductor.length  [ cim:Length.value  55 ] ;
  cim:Conductor.length  [ cim:Length.value  50 ] .

Nearly Correct JSON-LD Graph Representation

We can correct the representation by adding graph names (URNs). Let's start with Trig (option3.trig).

<urn:uuid:base-model> a dm:Model.

<urn:uuid:base-model> {
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5>
    rdf:type                   cim:Switch ;
    cim:IdentifiedObject.Name  "Switch1".

  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9>
    rdf:type              cim:ACLineSegment ;
    cim:Conductor.length  [ cim:Length.value  50 ] .
}

<urn:uuid:difference-model1> a dm:DifferenceModel ;
  md:Model.Supersedes <urn:uuid:base-model>;
  dm:forwardDifferences <urn:uuid:difference-model1-forward>;
  dm:reverseDifferences <urn:uuid:difference-model1-reverse>.

<urn:uuid:difference-model1-reverse> {
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length  [ cim:Length.value  50 ] .
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> cim:IdentifiedObject.Name "Switch1" .
}

<urn:uuid:difference-model1-forward> {
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length  [ cim:Length.value  55 ] .
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> cim:IdentifiedObject.Name "Switch2" .
}


<urn:uuid:difference-model2> a dm:DifferenceModel ;
  md:Model.Supersedes <urn:uuid:difference-model1>;
  dm:reverseDifferences <urn:uuid:difference-model2-reverse>;
  dm:forwardDifferences <urn:uuid:difference-model2-forward>.

<urn:uuid:difference-model2-reverse> {
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length  [ cim:Length.value  60 ]
}

<urn:uuid:difference-model2-forward> {
  <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length  [ cim:Length.value  65 ]
}

Let's convert this to JSON-LD.

  • The crucial difference is that the @graph elements now have @id
  • There are also two levels of @graph: an outer envelope that carries all quads, and inner named graphs
{
  "@graph": [
    {
      "@id": "urn:uuid:base-model",
      "@type": "dm:Model",
      "@graph": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "cim:Conductor.length": {"@id": "_:b4"},
          "@type": "cim:ACLineSegment"
        },
        {
          "@id": "_:b4",
          "cim:Length.value": {
            "@value": "50",
            "@type": "xsd:integer"
          }
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
          "cim:IdentifiedObject.Name": "Switch1",
          "@type": "cim:Switch"
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model1",
      "dm:reverseDifferences": {"@id": "urn:uuid:difference-model1-reverse"},
      "dm:forwardDifferences": {"@id": "urn:uuid:difference-model1-forward"},
      "md:Model.Supersedes": {"@id": "urn:uuid:base-model"},
      "@type": "dm:DifferenceModel"
    },
    {
      "@id": "urn:uuid:difference-model1-reverse",
      "@graph": [
        {
          "@id": "_:b0",
          "cim:Length.value": {
            "@value": "50",
            "@type": "xsd:integer"
          }
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "cim:Conductor.length": {
            "@id": "_:b0"
          }
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
          "cim:IdentifiedObject.Name": "Switch1"
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model1-forward",
      "@graph": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "cim:Conductor.length": {
            "@id": "_:b3"
          }
        },
        {
          "@id": "_:b3",
          "cim:Length.value": {
            "@value": "55",
            "@type": "xsd:integer"
          }
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
          "cim:IdentifiedObject.Name": "Switch2"
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model2",
      "dm:forwardDifferences": {"@id": "urn:uuid:difference-model2-forward"},
      "dm:reverseDifferences": {"@id": "urn:uuid:difference-model2-reverse"},
      "md:Model.Supersedes": {"@id": "urn:uuid:difference-model1"},
      "@type": "dm:DifferenceModel"
    },
    {
      "@id": "urn:uuid:difference-model2-reverse",
      "@graph": [
        {
          "@id": "_:b1",
          "cim:Length.value": {
            "@value": "60",
            "@type": "xsd:integer"
          }
        },
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "cim:Conductor.length": {
            "@id": "_:b1"
          }
        }
      ]
    },
    {
      "@id": "urn:uuid:difference-model2-forward",
      "@graph": [
        {
          "@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
          "cim:Conductor.length": {
            "@id": "_:b2"
          }
        },
        {
          "@id": "_:b2",
          "cim:Length.value": {
            "@value": "65",
            "@type": "xsd:integer"
          }
        }
      ]
    }
  ]
}

Note: We'll see later how by using a richer @context we'll reduce the expanded representation:

"cim:Length.value": {
 "@value": "50",
 "@type": "xsd:integer"
}

To the compact and natural representation:

"cim:Length.value": "50"

But there are still some problems:

  • URNs like urn:uuid:difference-model1-forward are not valid URNs under the urn:uuid: scheme, so we must generate new UUIDs for the reverse and forward graphs.
  • There are blank nodes represented in Trig as cim:Conductor.length [cim:Length.value 60] and in JSON-LD as _:b4 etc. This is a problem, since we cannot delete a blank node by specifying another blank node in the reverse graph. Every two blank nodes are different, unless they came from the same file and have the same blank node name. So it is good that actual CIM instance data has the simpler representation cim:Conductor.length "60", and we fixed the CIM ontologies to use the simpler representation (#38)

Custom CIM XML Parser

#94 make custom CIM XML parser

We need to implement a custom CIM XML parser that handles parseType="Statements" and emits named graphs.

cim-trig.pl is a Perl script that converts CIM XML file to Trig (Turtle with graphs). It uses simple string manipulation rather than a XML parser, so it relies on a repeatable CIM XML layout as lines:

  • A file has exactly one model: md:FullModel or dm:DifferenceModel
  • dm:DifferenceModel has exactly two sections dm:reverseDifferences and dm:forwardDifferences, in this order, even if one of them is empty

It uses command-line tools to do the bulk of the work (see sub ttl):

owl.bat write --keepUnusedPrefixes -i rdfxml ...rdf ...ttl
  • For very large files, give option -r to use Jena Riot in streaming mode:
riot.bat --syntax=rdfxml --stream=ttl ...rdf > ...ttl

For a dm:DifferenceModel it invokes the command-line tool 3 times:

  • To convert the model statements
  • To convert the dm:reverseDifferences statements
  • To convert the dm:forwardDifferences statements

It generates new urn:uuid URIs for the reverse and forward models (using UUID v4), and adds named graphs to all model parts. In particular, model metadata is stored in the model graph, so it can be updated or deleted easily (eg by using the SPARQL Graph Protocol).

See test results in test/trig. Let's look at a couple of examples.

test/trig/FullGrid_OP.trig:

<urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> { # model graph

# model metadata
<urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> a md:FullModel ;
  md:Model.DependentOn <urn:uuid:0cd6ada4-b6dc-4a36-a98c-877a39168cd3> ;
  md:Model.created "2020-12-10T00:21:43Z" ;

# statements
<http://fullgrid.eu/CGMES/3.0#_13dacabf-aa4c-4a78-806e-c7c4c6949718> a cim:Discrete ;
  cim:Discrete.ValueAliasSet <http://fullgrid.eu/CGMES/3.0#1a457323-2094-440f-8d30-dc93adf0cdb3> ;
...
}

test/trig/FullGrid_OP_diff.trig:

<urn:uuid:05edbf91-231f-4386-97c0-d4cb498d0afc> { # model graph

# model metadata
<urn:uuid:05edbf91-231f-4386-97c0-d4cb498d0afc> a dm:DifferenceModel ;
  dm:forwardDifferences <urn:uri:63528ef9-48ff-469b-a58e-ba274f2a10bb> ;
  dm:reverseDifferences <urn:uri:27c8a164-c656-4712-994a-0ab7cec4fd34> ;
  md:Model.DependentOn <urn:uuid:0cd6ada4-b6dc-4a36-a98c-877a39168cd3> ;
  md:Model.Supersedes <urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> ; # base model
  md:Model.created "2021-11-19T23:16:27Z" ;
}


<urn:uri:27c8a164-c656-4712-994a-0ab7cec4fd34> { # reverseDifferences
  <http://fullgrid.eu/CGMES/3.0#87478acb-cd1f-40a6-b4a7-59ec99f8b063> cim:IdentifiedObject.description "SET_PNT_1" .
  <http://fullgrid.eu/CGMES/3.0#fc908c16-468f-4a64-ba74-6f57175e0005> cim:AnalogLimit.value "99" .
}

<urn:uri:63528ef9-48ff-469b-a58e-ba274f2a10bb> { # forwardDifferences
  <http://fullgrid.eu/CGMES/3.0#87478acb-cd1f-40a6-b4a7-59ec99f8b063> cim:IdentifiedObject.description "SET_PNT_1 test" .
  <http://fullgrid.eu/CGMES/3.0#fc908c16-468f-4a64-ba74-6f57175e0005> cim:AnalogLimit.value "100" .
}

Instance Data Fixes

Fix Resource URLs

  • #87 bad relative URLs (need BASE or urn:uuid:)
  • #98 URL policy about MAS and BASE

The URLs of CIM power system resources are represented in CIM XML like this:

  • definition:
    • rdf:ID="_f37786d0-b118-4b92-bafb-326eac2a3877"
    • or rdf:about="#_f37786d0-b118-4b92-bafb-326eac2a3877"
  • reference: rdf:resource="#_44e63d79-6b05-4c64-b490-d181863af7da"

They have two problems:

These are relative URLs.

  • However, CIM XML files don't specify xml:base (see RDF 1.1 XML Syntax, section 2.14 Abbreviating URIs: rdf:ID and xml:base).
  • This means the URLs are resolved in a tool-dependent way (e.g. by using the file location on local disk).
  • This is a serious problem that undermines the stability of resource URLs.
  • We've resolved it by declaring md:Model.modelingAuthoritySet as BASE.
  • This is fixed by the cim-trig.pl script described above: see URL examples in the previous section.

They start with a parasitic _.

  • The reason is that rdf:ID cannot start with a digit, see
  • rdf:about could have been used instead of rdf:ID to avoid that limitation.
  • This is a purely cosmetic problem and we leave it as is.

Add Datatypes To Instance Data

#49 Add Datatypes To Instance Data

In CGMES instance data, all literals are strings, but should be marked with the appropriate datatype.

  • E.g. cim:ACDCConverter.baseS should be marked ^^xsd:float
  • Otherwise sort won't work properly and range queries will be slower.
  • This pertains to boolean, dateTme, float, gMonthDay, integer
  • string is the default datatype

Property Datatype Maps and the sibling folder datatypes make a comprehensive analysis. We extract a datatypes map, omitting hijacked namespaces and xsd:string:

grep -E '^(cim|nc|eu|md|eumd)' datatypes-older.tsv | grep -v xsd:string > fix-datatypes.ru

Then we format it as values for use in SPARQL.

We make 3 scripts to account for namespace differences:

prefix cim: <http://iec.ch/TC57/CIM100#>
prefix eu:  <http://iec.ch/TC57/CIM100-European#>
prefix cim:  <https://cim.ucaiug.io/ns#>
prefix eu:   <https://cim.ucaiug.io/ns/eu#>
  • fix-datatypes-both.ru works with either namespaces.
  • Note: the NC spec is new, so its prefix is only available in the new namespaces:
prefix nc:   <https://cim4.eu/ns/nc#>

The more complex "both" script works like this:

  • Defines dual prefixes cim, cim1 and eu, eu1:
prefix cim:  <https://cim.ucaiug.io/ns#>
prefix cim1: <http://iec.ch/TC57/CIM100#>
prefix eu:   <https://cim.ucaiug.io/ns/eu#>
prefix eu1:  <http://iec.ch/TC57/CIM100-European#>
prefix nc:   <https://cim4.eu/ns/nc#>
prefix eumd: <https://cim4.eu/ns/Metadata-European#>
prefix md:   <http://iec.ch/TC57/61970-552/ModelDescription/1#>
prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
delete {graph ?g {?x ?p ?old}}
insert {graph ?g {?x ?p ?new}}
  • The where clause includes a pretty huge mapping table from props to datatypes
    • It finds quads where the ?old value is xsd:string
    • Maps it to the appropriate datatype, considering different namespace versions
where {
  values (?prop ?dt) {
    (cim:ACDCConverter.baseS xsd:float)
    # 3000 more rows
  }
  graph ?g {?x ?p ?old}
  filter(datatype(?old)=xsd:string)
  bind(if(strstarts(str(?p),str(cim1:)),uri(concat(str(cim:),strafter(str(?p),str(cim1:)))),?UNDEF) as ?p1)
  bind(if(strstarts(str(?p),str(eu1:)), uri(concat(str(eu:), strafter(str(?p),str(eu1:)))), ?UNDEF) as ?p2)
  filter(?p=?prop || ?p1=?prop || ?p2=?prop)
  bind(strdt(?old,?dt) as ?new)
};

These updates can be applied on:

  • One CIM file, using an in-memory SPARQL Update tool like Jena update.bat (but it needs inordinate amounts of RAM for large files)
  • A whole repository of CIM data, eg using GraphDB

We include 3 versions because applying "both" on old data produces cim1, eu1 prefixes. This is harmless, but doesn't look nice.

Sample Instance Data

To work out reasoning, validation and performance issues, we need sample instance data. We can use the following datasets (one of them has minor defects):

  • #134 ENTSO-E_Test_Configurations_v3.0.2 defects
dataset xml zip files FullModel triples largest largest file
Nordic44 2.9M 15 12 35481 17420 CGMES_2_4/Nordic44_CGM_37a_EQ.xml
ENTSO-E_Test_Configurations_v3.0.2 151M 19M 357 350 1844380 947208 RealGrid/RealGrid-Merged/RealGrid_EQ.xml
Multiplied 11G 1.9G 4 4 94720800 RealGrid_EQ100.zip
Statnett 800MB 30MB
  • "FullModel" are files that have a standard md:FullModel structure. ENTSOE also has 7 dm:DifferenceModel
  • See next section for counting triples
  • See Multipled Data for "multiplied"
  • "Statnett" describes the actual Statnett grid, which is not public data. It's included only for comparison

Counting Triples

ENTSO-E files are nested 2-3 levels deep in the folder hierarchy:

cd ENTSO-E_Test_Configurations_v3.0.2/v3.0
find . -name *.xml |perl -pe 's{[\w-]+}{*}g' | sort | uniq -c
     47 ./*/*/*.*
    310 ./*/*/*/*.*

I want to use riot.bat --count to see how many triples in total. But we will exclude DifferenceModel files (*_diff.xml) because riot cannot handle them (they are not standard RDF XML format):

find . -name *.xml ! -name *diff* | wc
    350     350   23847

The total length of all filenames is quite large (24k) so it overflows the command line:

riot.bat --count `find . -name *.xml ! -name *diff*`
The command line is too long.

In such case one uses xargs. Since the environment and the command line together are subject to a size limit, I tried to remove some wordy env vars (ORIGINAL_PATH= PSModulePath= INFOPATH=), but still it's greater than the limit on my shell (Cygwin Bash):

find . -name *.xml ! -name *diff* | env ORIGINAL_PATH= PSModulePath= INFOPATH= xargs --show-limit riot.bat --count
Your environment variables take up 3940 bytes
POSIX upper limit on argument length (this system): 26012
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 22072
Size of command buffer we are actually using: 26012
Maximum parallelism (--max-procs must be no greater): 2147483647
The command line is too long.

So I have to split the work in several parts: -n 100 passes 100 files at a time, and 2> saves STDERR to a file:

find . -name *.xml ! -name *diff* | xargs -n 100 riot.bat --count 2> count-ENTSOE.txt

I wrote a small script to massage this file:

perl count.pl count-ENTSOE.txt > count-ENTSOE1.txt

The total is 1844380 (1.8M triples) and the largest file is

947208  ./RealGrid/RealGrid-Merged/RealGrid_EQ.xml

Nordic44 files are a lot smaller:

cd Nordic44/Instances
find . -name *.xml | xargs riot.bat --count 2> count-Nordic.txt
perl count.pl count-Nordic.txt > count-Nordic1.txt

The total is 35481 (35k triples) and the largest file is

17420	./CGMES_2_4/Nordic44_CGM_37a_EQ.xml

Multipled Data

  • #117 multiply instance data

ENTSO-E plus Nordic44 make only 1.8M triples. This is not very much as semantic databases go, so we decided to multiply it 100 times to obtain bigger examples.

Chavdar Ivanov took 4 files from ENTSO-E_Test_Configurations_v3.0.2 and multiplied the data in 4 variants (10, 20, 50 and 100 times). The results are in this Microsoft Teams Drive.

I got only the largest files: RealGrid_EQ100.zip, RealGrid_SSH100.zip, RealGrid_SV100.zip, RealGrid_TP100.zip. They are 1.9Gb zipped, 11Gb unzipped.

The files use DOS line endings and maybe have byte-order mark (BOM). BOM doesn't play well with riot, so we remove the BOM and convert to Unix line endings:

d2u *

(This takes about 15 minutes because the files are large)

The files also include

xml:base="http://iec.ch/TC57/CIM100"

which doesn't match other instance files, contradicts the decision to use modelAuthoritySet as base, and is inappropriate for base of instance URLs. So cim-trig removes it.

The largest file is 8Gb and takes 8 min to convert from CIM XML to Trig (with about 20Gb RAM for Java and similar for Perl). However, the query fix-datatypes-old.ru cannot be executed in-memory with the Jena update command on my laptop (64Gb RAM). With default JVM parameters, it throws:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

We allow Java to take 60Gb, but that causes swapping and slows down the process:

# cmd:
set JVM_ARGS=-Xmx60000M -Dfile.encoding=UTF-8
update.bat --update=fix-datatypes-old.ru --data=temp1.trig --dump > instances/multiplied/RealGrid_EQ100.trig

# bash:
export JVM_ARGS="-Xmx60000M -Dfile.encoding=UTF-8"
time update.bat --update=fix-datatypes-old.ru --data=temp1.trig --dump > instances/multiplied/RealGrid_EQ100.trig

The process was really busy, taking 60-80% of CPU and lots of RAM. I canceled it after 140 min. So we need to run this update against a database (GraphDB), not against the Jena in-memory store.

update runs successfully only for RealGrid_TP100.trig (9.6M triples)

Final Instance Data

The final instance data for testing consists of the following trig files:

Instances folders files trig zip
Nordic44 3 12 5M 340k
ENTSO-E_Test_Configurations_v3.0.2 43 357 179M 23.8M
multiplied 1 4 9.67G 2.2G
TOTAL 47 373 9.85G 2.2G
  • Only ENTSOE has 7 dm:DifferenceModel, all others are md:FullModel.
  • DifferenceModels cannot be validated on their own (see shacl-improved for a scenario)

The 3 zipped files are available publicly in the Google Folder instance-zipped.

JSON-LD Serialization

After converting CIM XML to a representation using named graphs (Trig), we can convert it to JSON-LD. E.g. to convert an instance file using the old namespaces, we use this command:

riot.bat --formatted jsonld test/trig/FullGrid_OP.trig | jsonld compact -c https://rawgit2.com/Sveino/Inst4CIM-KG/develop/rdf-improved/cim-context-old.jsonld

The tools used are described in the sibling folder at JSON-LD Serialization.

JSON-LD Context

A good JSON-LD serialization depends on an appropriate context that defines namespaces and property characteristics i.e. @type (@id for object props, XSD datatype for datatype props).

We want to cater to old and new namespaces, so we use some text (not proper JSON-LD) files to assemble contexts:

  • cim-context-common.txt: a common file that defines the common namespaces, and characteristics for about 5100 props
  • (cim-context-strings.txt: a "spill-over" file that defines 120 xsd:string properties: not added to context since that is the default datatype)
  • cim-context-new.txt: prefix file for JSON-LD context using new namespaces
 {"cim":          "https://cim.ucaiug.io/ns#",
  "eu":           "https://cim.ucaiug.io/ns/eu#",
  • cim-context-old.txt: prefix file for JSON-LD context using old namespaces
 {"cim":          "http://iec.ch/TC57/CIM100#",
  "eu":           "http://iec.ch/TC57/CIM100-European#",

The assembled context files are:

  • cim-context-new.jsonld: JSON-LD context using new namespaces
  • cim-context-old.jsonld: JSON-LD context using old namespaces

#110 deploy JSON-LD contexts on a permanent network location:

Formatting of Numbers and Booleans

#120 number representation in JSONLD

JSON has only a few native literal datatypes: number, boolean, string, null. JSON numbers are imprecise:

  • There is no distinction between integer and floating point
  • JSON doesn't define whether a number should be represented as float or double
  • Exact numbers (xsd:decimal) are not available natively

This is raised as issue json-ld-syntax#387, and is accepted in the JSON-LD errata.

It is therefore better to always use strings rather than native numbers. The JSON-LD context (see previois section) attaches appropriate datatypes.

To test the output of CIM numbers and booleans, we made test/test.rq that constructs a few triples:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX cim: <https://cim.ucaiug.io/ns#>
construct {
  [] cim:reactance "0.123"^^xsd:float; cim:normallyInService true
} where {}

The respective Turtle is test.ttl.

Then we tried with a few tools and saved the results:

  • test-GraphDB.jsonld: GraphDB 10.7.3, save query result as JSON-LD, no context
  • test-Jena-riot.jsonld:
    • Install from Apache Jena Commands
    • Then run: riot --formatted jsonld test.ttl > test-ttl2jsonld.jsonld
  • test-ttl2jsonld.jsonld, no context:
    • Install with npm install -g @frogcat/ttl2jsonld
    • Then run ttl2jsonld test.ttl > test-ttl2jsonld.jsonld
  • test-Virtuoso-context.jsonld: DBpedia SPARQL endpoint, save query result as JSON-LD with context
  • test-Virtuoso-plain.jsonld: DBpedia SPARQL endpoint, save query result as JSON-LD plain
tool reactance normallyInService
GraphDB "0.123" xsd:float "true" xsd:boolean
Jena riot "0.123" xsd:float "true" xsd:boolean
ttl2jsonld "0.123" xsd:float true
Virtuoso context 0.1230000033974648 true
Virtuoso plain 0.1230000033974648 true
  • GraphDB and Jena output @value in quotes and always attach a datatype
  • Virtuoso outputs only @value without quotes (and adds some fake decimal digits due to internal conversions)
  • ttl2json outputs the number as @value in quotes with datatype, but the boolean without quotes

Note1: above we didn't specify a context to use. If we do, then more tools may output values in quotes.

Note2: see digitalbazaar/jsonld.js#558 for a similar problem related to native boolean in JSON-LD.