This document describes proposed inprovements to the representation of CIM/CGMES instance data.
Table of Contents
- Improvements to CIM and CGMES RDF Representation
- Represent Models as Named Graphs
- Instance Data Fixes
- Sample Instance Data
- JSON-LD Serialization
- instances: Sample Instance Data as Trig, from Nordic44, ENTSO-E and "multiplied" (large), see Multipled Data
- test: xml, trig and jsonld test instance files (8 of each kind)
- trials: various trial files
- cim-context-new.txt: prefix file for JSON-LD context using new namespaces
- cim-context-old.txt: prefix file for JSON-LD context using old namespaces
- cim-context-common.txt: common file for JSON-LD context
- cim-context-new.jsonld: JSON-LD context using new namespaces
- cim-context-old.jsonld: JSON-LD context using old namespaces
- cim-context-strings.txt: properties with
"@type": "xsd:string"
. Not added to context since that is the default datatype - cim-trig.pl: converts CIM XML (Full or Difference models) to Trig: see Custom CIM XML Parser
- count.pl: script to clean up files produced by
riot --count
- count-ENTSOE.txt: count of triples in ENTSO-E instance files as produced by
riot --count
- count-ENTSOE1.txt: pure count of triples in ENTSO-E instance files
- count-Nordic.txt: count of triples in Nordic44 instance files as produced by
riot --count
- count-Nordic1.txt: pure count of triples in Nordic44 instance files
- fix-datatypes-new.ru: SPARQL Update to add datatypes to instance files using new namespaces
- fix-datatypes-old.ru: SPARQL Update to add datatypes to instance files using old namespaces
- fix-datatypes-both.ru: SPARQL Update to add datatypes to instance files using either new or old namespaces
- props-same-name-different-characteristics.csv: properties with same name (last part of URL) but different characteristics
- props-same-name-different-range.csv: properties with same name (last part of URL) but different range (the most important characteristic)
- README.md: this file
This folder uses make to automate various tasks
and ensure that dependencies are tracked and files are remade when needed.
The Makefile defines the following targets (printed when make
is invoked without target)
- context: JSON-LD context for new and old namespaces
- dirs: all subdirs in
instances
- test: test instance files in trig
- jsonld: test instance files in jsonld
- nordic: Nordic44 instance files in trig
- entsoe: ENTSO-E instance files in trig
- multiplied: "multiplied" instance files in trig
- rm-test: remove "test/trig" instance files
- rm-jsonld: remove "test/jsonld" instance files
- rm-nordic: remove Nordic44 trig instance files
- rm-entsoe: remove ENTSO-E trig instance files
- rm-multiplied: remove "multiplied" trig instance files
- clean: remove files of size zero
The make
manual is very comprehensive, but dense and hard to understand.
So if you are not familiar with make, it can be quite a challenge to understand and maintain the Makefile.
In following subsections we explain a few of the trickier aspects.
Let's first look at variable assignments. Consider the most complicated group:
nordic_source = ../../../Nordic44/Instances
States where is the source of the Nordic44 instance files relative to the current folder.
nordic_dirs != /usr/bin/find $(nordic_source) -type d
Finds all directories (subfolders).
Unlike normal assignment, !=
invokes the shell with an external command.
I've given the full name /usr/bin/find
to avoid confusion with the DOS find
program (an abomination).
nordic_target = instances/Nordic44
States where the target instance files will go (upon conversion from xml to trig).
nordic_target_dirs = $(subst $(nordic_source),$(nordic_target),$(nordic_dirs))
Computes the target subfolders.
$(nordic_dirs)
is interpreted as a space-separated array,
and for each subfolder, the source prefix is substituted with the target prefix.
nordic_ignore = CDPSM_2_0/Nordic44_CPSM_01_MF.xml CDPSM_2_0/Nordic44_03_inc.xml CGMES_2_4/Nordic44_CGM_36f_MF.xml CGMES_2_4/Nordic44_CGM_38_CO.xml
Declares that some Nordic files will be ignored (not converted) for various reasons.
nordic_ignore2 = $(patsubst %, $(nordic_source)/%, $(nordic_ignore))
Expands the ignored files to include the source folder prepended.
nordic_rdf = $(filter-out $(nordic_ignore2), $(wildcard $(nordic_source)/*_2_*/*.xml))
Finds all relevant source (rdf xml) files by using $(wildcard)
(glob pattern).
The pattern *_2_*
uses only folders CDPSM_2_0, CGMES_2_4
but ignores the folder CGMES_3_0
(since that has only some draft files in ttl, nt, geojson
).
$(filter-out)
further excludes the $(nordic_ignore2)
files.
nordic_trig = $(subst .xml,.trig, $(subst $(nordic_source),$(nordic_target), $(nordic_rdf)))
Computes the target (trig) filenames.
$(nordic_rdf)
is treated as a space-delimited array of filenames, and for each one
we replace source with target folder,
and source extension .xml
with the target extension .trig
The first target in the file (conventionally called all
) is executed if you run make
without arguments:
all:
@echo targets: context, dirs, test, jsonld, nordic, entsoe, multiplied, rm-test, rm-jsonld, rm-nordic, rm-entsoe, rm-multiplied, clean
- It just prints the targets defined in the Makefile.
- The prefix
@
prevents make from printing the command line itself
This is also the place to print out any variable you're unsure about, for debugging purposes.
Eg to print $(nordic_trig)
, add this:
@echo $(nordic_trig)
The instances
folder has 49 folders going to 4 levels deep.
If make tries to create a file in non-existing folder, it will fail.
So we want to automate the creation of all these folders.
We've already computed the nested subfolders $(nordic_target_dirs) $(entsoe_target_dirs)
,
so we just call mkdir
on the 4 root folders, plus the nested subfolders:
dirs:
-mkdir instances $(nordic_target) $(entsoe_target) $(multiplied_target) $(nordic_target_dirs) $(entsoe_target_dirs)
The -
sign tells it to proceed even if some of these folders already exist
(mkdir
returns an error in such case, but make ignores the error).
There is one more thing to do.
Git ignores empty folders on commit, so we need to make an empty file in each folder.
Such files are conventionally called .gitkeep
(see What is. gitkeep):
touch $(patsubst %, %/.gitkeep, $(multiplied_target) $(nordic_target_dirs) $(entsoe_target_dirs))
touch
is a convenient command to use here:
it updates the timestamp of files to the current time, and makes empty files if needed.
Why would we even need these empty folders? Because we don't want to:
- Commit such a large number of large files (see Final Instance Data) to git.
- Transfer a large number of files to a semantic database for loading. It's better to transfer just 3 zips.
So we use zip
to zip the the instance files and move them out of the way:
zips:
zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(nordic_target)) $(nordic_target) -x "*/.gitkeep"
zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(entsoe_target)) $(entsoe_target) -x "*/.gitkeep"
zip -r -m $(patsubst instances/%, $(zip)/%.zip, $(multiplied_target)) $(multiplied_target) -x "*/.gitkeep"
- Option
-m
moves the files to the zip - Option
-x
excludes the.gitkeep
files - The above is a bit dumb since it always considers all files and copy-pastes the same command three times, but it's ok for a starter
- 3lbits/CIM4NoUtility#321 Converting CIMXML DifferenceModel to CIMJSON-LD
- #22 md:Statement is problematically defined
- #86 no connection of instance triples to Model
If you convert a CIM XML model (eg Nordic44_CGM_36d_SSH.xml
) to Turtle, you get something like this:
<urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586>
rdf:type md:FullModel ;
md:Model.DependentOn <urn:uuid:2dd9014f-bdfb-11e5-94fa-c8f73332c8f4> ;
md:Model.created "2017-11-24T09:03:09.9446768Z" ;
md:Model.description "CGM Test model developed by Statnett SF. Nordic 44 bus system for the Nordic region" ;
md:Model.modelingAuthoritySet "http://www.Statnett.no/IGM/Nordic44_CGM" ;
md:Model.profile "http://entsoe.eu/CIM/SteadyStateHypothesis/1/1" , "http://entsoe.eu/CIM/SteadyStateHypothesis/1/2" ;
md:Model.scenarioTime "2015-03-06T01:30:00.0000000Z" ;
md:Model.version "36" ;
pti:Model.createdBy "Statnett SF" .
<file:///d:/Onto/proj/electrical/Nordic44/Instances/CGMES_2_4/Nordic44_CGM_36d_SSH.xml#_e2f56599-a78e-494f-8db3-c0b0bdab1d70>
rdf:type cim:Terminal ;
cim:ACDCTerminal.connected "true" .
The problem is that there's no relation between the model and CIM triples whatsoever. The fact that they appear in the same file doesn't matter at all when it comes to the RDF representation. (Just because some triples appear in a file, does not link the triples to the model URI in that file). If you load this file to a semantic repository, these triples will be mixed with millions of other triples, losing all connection to the model.
Statement sets are modeled in the ontology by using the RDF Reification ontology:
rdf:Statement
(sometimes misspelled rdf:Statements
),
with props rdf:subject, rdf:predicate, rdf:object
(sometimes misspelled rdf:Statement.subject, rdf:Statement.predicate, rdf:Statement.object
).
But Reification is a very inefficient way to capture statements.
So in instance data, CIM doesn't actually use that construct.
It was agreed that each model will be represented as a Named Graph that contains the model metadata and triples (thus they become quads). The model URN is also used as graph URN (name). We can express this in TriG (Turtle with Graphs) as follows, where we also:
- Use the
rdfg:Graph
class to emphasize that the model is a named graph - Fix the relative instance URL to an absolute URL
PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>
<urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586> {
<urn:uuid:1d8b61bc-c7f3-4e9e-a3bd-f4ec24beb586>
rdf:type md:FullModel, rdfg:Graph ;
md:Model.DependentOn <urn:uuid:2dd9014f-bdfb-11e5-94fa-c8f73332c8f4> ;
md:Model.created "2017-11-24T09:03:09.9446768Z" ;
md:Model.description "CGM Test model developed by Statnett SF. Nordic 44 bus system for the Nordic region" ;
md:Model.modelingAuthoritySet "http://www.Statnett.no/IGM/Nordic44_CGM" ;
md:Model.profile "http://entsoe.eu/CIM/SteadyStateHypothesis/1/1" , "http://entsoe.eu/CIM/SteadyStateHypothesis/1/2" ;
md:Model.scenarioTime "2015-03-06T01:30:00.0000000Z" ;
md:Model.version "36".
<http://www.Statnett.no/IGM/Nordic44_CGM/_e2f56599-a78e-494f-8db3-c0b0bdab1d70>
rdf:type cim:Terminal ;
cim:ACDCTerminal.connected "true" .
}
The problem is especially acute for difference models.
CGMES-TC/FullGrid_SC_diff.xml is an example of such a model.
CIM XML uses its own dialect of RDF/XML with rdf:parseType="Statements"
.
This non-standard addition is only supported in CIM-specific tools
and is a major impediment to the use of standard semantic web processing tools.
(Eg if you use Jena, the parseType="Statements"
payload is captured as a string, not as triples).
CIM Difference Models are important because they allow to record a delta against a base model, thus enabling "What If" analysis and other important scenarios.
In particular, a Difference Model is associated with 4 named graphs:
- Model metadata in the model graph
- Refers to the base model using
md:Model.Supersedes
- Checks for the presence of certain statements using
dm:preconditions
(but this is not used in CIM) - Specifies statements to delete using
dm:reverseDifferences
- Specifies statements to insert using
dm:forwardDifferences
Note: in this and the next subsection we use illustrative graph names (eg base-model, reverse, forward
) but these are not valid urn:uuid
URNs.
RDF/XML cannot carry named graphs, but JSON-LD and Trig (Turtle with graphs) can.
- 3lbits/CIM4NoUtility#321 Converting CIMXML DifferenceModel to CIMJSON-LD
makes a couple of naive attempts to represent a
DifferenceModel
using the nesting structure of JSON-LD.
See the trials folder for some attempts.
For example, option2.jsonld
looks like this:
{
"@graph": [
{
"@id": "urn:uuid:difference-model1",
"@type": "dm:DifferenceModel",
"dm:reverseDifferences": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"@type": "cim:ACLineSegment",
"cim:Conductor.length": {"cim:Length.value": 50.0}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
"@type": "cim:Switch",
"cim:IdentifiedObject.Name": "Switch1"
}
]
},
{
"@id": "urn:uuid:difference-model2",
"@type": "dm:DifferenceModel",
"dm:forwardDifferences": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"@type": "cim:ACLineSegment",
"cim:Conductor.length": {"cim:Length.value": 55.0}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
"@type": "cim:Switch",
"cim:IdentifiedObject.Name": "Switch2"
}
]
},
{
"@id": "urn:uuid:difference-model3",
"@type": "dm:DifferenceModel",
"dm:reverseDifferences": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"@type": "cim:ACLineSegment",
"cim:Conductor.length": {"cim:Length.value": 60.0}
}
]
},
{
"@id": "urn:uuid:difference-model4",
"@type": "dm:DifferenceModel",
"dm:forwardDifferences": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"@type": "cim:ACLineSegment",
"cim:Conductor.length": {"cim:Length.value": 65.0}
}
]
}
]
}
But if we convert this to Trig using Jena RIOT:
riot --formatted=trig option2.jsonld > option2.trig
We see a mixup:
- Two of the reverse differences are mixed together at
model1
- Two of the forward differences are mixed together at
model2
- The statements
Conductor.length
are mixed together
<urn:uuid:difference-model1>
rdf:type dm:DifferenceModel ;
dm:reverseDifferences <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> , <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .
<urn:uuid:difference-model2>
rdf:type dm:DifferenceModel ;
dm:forwardDifferences <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> , <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .
<urn:uuid:difference-model3>
rdf:type dm:DifferenceModel ;
dm:reverseDifferences <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .
<urn:uuid:difference-model4>
rdf:type dm:DifferenceModel ;
dm:forwardDifferences <urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> .
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5>
rdf:type cim:Switch ;
cim:IdentifiedObject.Name "Switch2" , "Switch1" .
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9>
rdf:type cim:ACLineSegment ;
cim:Conductor.length [ cim:Length.value 65 ] ;
cim:Conductor.length [ cim:Length.value 60 ] ;
cim:Conductor.length [ cim:Length.value 55 ] ;
cim:Conductor.length [ cim:Length.value 50 ] .
We can correct the representation by adding graph names (URNs).
Let's start with Trig (option3.trig
).
<urn:uuid:base-model> a dm:Model.
<urn:uuid:base-model> {
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5>
rdf:type cim:Switch ;
cim:IdentifiedObject.Name "Switch1".
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9>
rdf:type cim:ACLineSegment ;
cim:Conductor.length [ cim:Length.value 50 ] .
}
<urn:uuid:difference-model1> a dm:DifferenceModel ;
md:Model.Supersedes <urn:uuid:base-model>;
dm:forwardDifferences <urn:uuid:difference-model1-forward>;
dm:reverseDifferences <urn:uuid:difference-model1-reverse>.
<urn:uuid:difference-model1-reverse> {
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length [ cim:Length.value 50 ] .
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> cim:IdentifiedObject.Name "Switch1" .
}
<urn:uuid:difference-model1-forward> {
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length [ cim:Length.value 55 ] .
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5> cim:IdentifiedObject.Name "Switch2" .
}
<urn:uuid:difference-model2> a dm:DifferenceModel ;
md:Model.Supersedes <urn:uuid:difference-model1>;
dm:reverseDifferences <urn:uuid:difference-model2-reverse>;
dm:forwardDifferences <urn:uuid:difference-model2-forward>.
<urn:uuid:difference-model2-reverse> {
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length [ cim:Length.value 60 ]
}
<urn:uuid:difference-model2-forward> {
<urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9> cim:Conductor.length [ cim:Length.value 65 ]
}
Let's convert this to JSON-LD.
- The crucial difference is that the
@graph
elements now have@id
- There are also two levels of
@graph
: an outer envelope that carries all quads, and inner named graphs
{
"@graph": [
{
"@id": "urn:uuid:base-model",
"@type": "dm:Model",
"@graph": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"cim:Conductor.length": {"@id": "_:b4"},
"@type": "cim:ACLineSegment"
},
{
"@id": "_:b4",
"cim:Length.value": {
"@value": "50",
"@type": "xsd:integer"
}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
"cim:IdentifiedObject.Name": "Switch1",
"@type": "cim:Switch"
}
]
},
{
"@id": "urn:uuid:difference-model1",
"dm:reverseDifferences": {"@id": "urn:uuid:difference-model1-reverse"},
"dm:forwardDifferences": {"@id": "urn:uuid:difference-model1-forward"},
"md:Model.Supersedes": {"@id": "urn:uuid:base-model"},
"@type": "dm:DifferenceModel"
},
{
"@id": "urn:uuid:difference-model1-reverse",
"@graph": [
{
"@id": "_:b0",
"cim:Length.value": {
"@value": "50",
"@type": "xsd:integer"
}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"cim:Conductor.length": {
"@id": "_:b0"
}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
"cim:IdentifiedObject.Name": "Switch1"
}
]
},
{
"@id": "urn:uuid:difference-model1-forward",
"@graph": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"cim:Conductor.length": {
"@id": "_:b3"
}
},
{
"@id": "_:b3",
"cim:Length.value": {
"@value": "55",
"@type": "xsd:integer"
}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d5",
"cim:IdentifiedObject.Name": "Switch2"
}
]
},
{
"@id": "urn:uuid:difference-model2",
"dm:forwardDifferences": {"@id": "urn:uuid:difference-model2-forward"},
"dm:reverseDifferences": {"@id": "urn:uuid:difference-model2-reverse"},
"md:Model.Supersedes": {"@id": "urn:uuid:difference-model1"},
"@type": "dm:DifferenceModel"
},
{
"@id": "urn:uuid:difference-model2-reverse",
"@graph": [
{
"@id": "_:b1",
"cim:Length.value": {
"@value": "60",
"@type": "xsd:integer"
}
},
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"cim:Conductor.length": {
"@id": "_:b1"
}
}
]
},
{
"@id": "urn:uuid:difference-model2-forward",
"@graph": [
{
"@id": "urn:uuid:9d58e5bb-834c-4faa-928c-7da0bb1497d9",
"cim:Conductor.length": {
"@id": "_:b2"
}
},
{
"@id": "_:b2",
"cim:Length.value": {
"@value": "65",
"@type": "xsd:integer"
}
}
]
}
]
}
Note: We'll see later how by using a richer @context
we'll reduce the expanded representation:
"cim:Length.value": {
"@value": "50",
"@type": "xsd:integer"
}
To the compact and natural representation:
"cim:Length.value": "50"
But there are still some problems:
- URNs like
urn:uuid:difference-model1-forward
are not valid URNs under theurn:uuid:
scheme, so we must generate new UUIDs for thereverse
andforward
graphs. - There are blank nodes represented in Trig as
cim:Conductor.length [cim:Length.value 60]
and in JSON-LD as_:b4
etc. This is a problem, since we cannot delete a blank node by specifying another blank node in thereverse
graph. Every two blank nodes are different, unless they came from the same file and have the same blank node name. So it is good that actual CIM instance data has the simpler representationcim:Conductor.length "60"
, and we fixed the CIM ontologies to use the simpler representation (#38)
#94 make custom CIM XML parser
We need to implement a custom CIM XML parser that handles parseType="Statements"
and emits named graphs.
cim-trig.pl is a Perl script that converts CIM XML file to Trig (Turtle with graphs). It uses simple string manipulation rather than a XML parser, so it relies on a repeatable CIM XML layout as lines:
- A file has exactly one model:
md:FullModel
ordm:DifferenceModel
dm:DifferenceModel
has exactly two sectionsdm:reverseDifferences
anddm:forwardDifferences
, in this order, even if one of them is empty
It uses command-line tools to do the bulk of the work (see sub ttl
):
- For prettier formatting, it runs
owl-cli
by@atextor
(the Windows version of a batch file) as described at atextor Tools: owl-cli and turtle-formatter :
owl.bat write --keepUnusedPrefixes -i rdfxml ...rdf ...ttl
- For very large files, give option
-r
to use Jena Riot in streaming mode:
riot.bat --syntax=rdfxml --stream=ttl ...rdf > ...ttl
For a dm:DifferenceModel
it invokes the command-line tool 3 times:
- To convert the model statements
- To convert the
dm:reverseDifferences
statements - To convert the
dm:forwardDifferences
statements
It generates new urn:uuid
URIs for the reverse and forward models (using UUID v4),
and adds named graphs to all model parts.
In particular, model metadata is stored in the model graph,
so it can be updated or deleted easily (eg by using the SPARQL Graph Protocol).
See test results in test/trig. Let's look at a couple of examples.
<urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> { # model graph
# model metadata
<urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> a md:FullModel ;
md:Model.DependentOn <urn:uuid:0cd6ada4-b6dc-4a36-a98c-877a39168cd3> ;
md:Model.created "2020-12-10T00:21:43Z" ;
# statements
<http://fullgrid.eu/CGMES/3.0#_13dacabf-aa4c-4a78-806e-c7c4c6949718> a cim:Discrete ;
cim:Discrete.ValueAliasSet <http://fullgrid.eu/CGMES/3.0#1a457323-2094-440f-8d30-dc93adf0cdb3> ;
...
}
test/trig/FullGrid_OP_diff.trig:
<urn:uuid:05edbf91-231f-4386-97c0-d4cb498d0afc> { # model graph
# model metadata
<urn:uuid:05edbf91-231f-4386-97c0-d4cb498d0afc> a dm:DifferenceModel ;
dm:forwardDifferences <urn:uri:63528ef9-48ff-469b-a58e-ba274f2a10bb> ;
dm:reverseDifferences <urn:uri:27c8a164-c656-4712-994a-0ab7cec4fd34> ;
md:Model.DependentOn <urn:uuid:0cd6ada4-b6dc-4a36-a98c-877a39168cd3> ;
md:Model.Supersedes <urn:uuid:52a409c9-72d8-4b5f-bf72-9a22ec9353f7> ; # base model
md:Model.created "2021-11-19T23:16:27Z" ;
}
<urn:uri:27c8a164-c656-4712-994a-0ab7cec4fd34> { # reverseDifferences
<http://fullgrid.eu/CGMES/3.0#87478acb-cd1f-40a6-b4a7-59ec99f8b063> cim:IdentifiedObject.description "SET_PNT_1" .
<http://fullgrid.eu/CGMES/3.0#fc908c16-468f-4a64-ba74-6f57175e0005> cim:AnalogLimit.value "99" .
}
<urn:uri:63528ef9-48ff-469b-a58e-ba274f2a10bb> { # forwardDifferences
<http://fullgrid.eu/CGMES/3.0#87478acb-cd1f-40a6-b4a7-59ec99f8b063> cim:IdentifiedObject.description "SET_PNT_1 test" .
<http://fullgrid.eu/CGMES/3.0#fc908c16-468f-4a64-ba74-6f57175e0005> cim:AnalogLimit.value "100" .
}
The URLs of CIM power system resources are represented in CIM XML like this:
- definition:
rdf:ID="_f37786d0-b118-4b92-bafb-326eac2a3877"
- or
rdf:about="#_f37786d0-b118-4b92-bafb-326eac2a3877"
- reference:
rdf:resource="#_44e63d79-6b05-4c64-b490-d181863af7da"
They have two problems:
These are relative URLs.
- However, CIM XML files don't specify
xml:base
(see RDF 1.1 XML Syntax, section 2.14 Abbreviating URIs: rdf:ID and xml:base). - This means the URLs are resolved in a tool-dependent way (e.g. by using the file location on local disk).
- This is a serious problem that undermines the stability of resource URLs.
- We've resolved it by declaring
md:Model.modelingAuthoritySet
as BASE. - This is fixed by the
cim-trig.pl
script described above: see URL examples in the previous section.
They start with a parasitic _
.
- The reason is that
rdf:ID
cannot start with a digit, see- RDF 1.1 XML Syntax, section C.1 RELAX NG Compact Schema,
IDsymbol
- XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes, section 3.4.4 NMTOKEN
- Extensible Markup Language (XML) 1.1 (Second Edition) section Nmtoken
- RDF 1.1 XML Syntax, section C.1 RELAX NG Compact Schema,
rdf:about
could have been used instead ofrdf:ID
to avoid that limitation.- This is a purely cosmetic problem and we leave it as is.
#49 Add Datatypes To Instance Data
In CGMES instance data, all literals are strings, but should be marked with the appropriate datatype.
- E.g.
cim:ACDCConverter.baseS
should be marked^^xsd:float
- Otherwise sort won't work properly and range queries will be slower.
- This pertains to
boolean, dateTme, float, gMonthDay, integer
string
is the default datatype
Property Datatype Maps and the sibling folder datatypes make a comprehensive analysis.
We extract a datatypes map, omitting hijacked namespaces and xsd:string
:
grep -E '^(cim|nc|eu|md|eumd)' datatypes-older.tsv | grep -v xsd:string > fix-datatypes.ru
Then we format it as values
for use in SPARQL.
We make 3 scripts to account for namespace differences:
- fix-datatypes-old.ru works with the old namespaces:
prefix cim: <http://iec.ch/TC57/CIM100#>
prefix eu: <http://iec.ch/TC57/CIM100-European#>
- fix-datatypes-new.ru works with the new namespaces:
prefix cim: <https://cim.ucaiug.io/ns#>
prefix eu: <https://cim.ucaiug.io/ns/eu#>
- fix-datatypes-both.ru works with either namespaces.
- Note: the NC spec is new, so its prefix is only available in the new namespaces:
prefix nc: <https://cim4.eu/ns/nc#>
The more complex "both" script works like this:
- Defines dual prefixes
cim, cim1
andeu, eu1
:
prefix cim: <https://cim.ucaiug.io/ns#>
prefix cim1: <http://iec.ch/TC57/CIM100#>
prefix eu: <https://cim.ucaiug.io/ns/eu#>
prefix eu1: <http://iec.ch/TC57/CIM100-European#>
prefix nc: <https://cim4.eu/ns/nc#>
prefix eumd: <https://cim4.eu/ns/Metadata-European#>
prefix md: <http://iec.ch/TC57/61970-552/ModelDescription/1#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
- After Represent Models as Named Graphs, all CIM triples live in named graphs, so:
delete {graph ?g {?x ?p ?old}}
insert {graph ?g {?x ?p ?new}}
- The
where
clause includes a pretty huge mapping table from props to datatypes- It finds quads where the
?old
value isxsd:string
- Maps it to the appropriate datatype, considering different namespace versions
- It finds quads where the
where {
values (?prop ?dt) {
(cim:ACDCConverter.baseS xsd:float)
# 3000 more rows
}
graph ?g {?x ?p ?old}
filter(datatype(?old)=xsd:string)
bind(if(strstarts(str(?p),str(cim1:)),uri(concat(str(cim:),strafter(str(?p),str(cim1:)))),?UNDEF) as ?p1)
bind(if(strstarts(str(?p),str(eu1:)), uri(concat(str(eu:), strafter(str(?p),str(eu1:)))), ?UNDEF) as ?p2)
filter(?p=?prop || ?p1=?prop || ?p2=?prop)
bind(strdt(?old,?dt) as ?new)
};
These updates can be applied on:
- One CIM file, using an in-memory SPARQL Update tool like Jena
update.bat
(but it needs inordinate amounts of RAM for large files) - A whole repository of CIM data, eg using GraphDB
We include 3 versions because applying "both" on old data produces cim1, eu1
prefixes.
This is harmless, but doesn't look nice.
To work out reasoning, validation and performance issues, we need sample instance data. We can use the following datasets (one of them has minor defects):
- #134
ENTSO-E_Test_Configurations_v3.0.2
defects
dataset | xml | zip | files | FullModel | triples | largest | largest file |
---|---|---|---|---|---|---|---|
Nordic44 | 2.9M | 15 | 12 | 35481 | 17420 | CGMES_2_4/Nordic44_CGM_37a_EQ.xml | |
ENTSO-E_Test_Configurations_v3.0.2 | 151M | 19M | 357 | 350 | 1844380 | 947208 | RealGrid/RealGrid-Merged/RealGrid_EQ.xml |
Multiplied | 11G | 1.9G | 4 | 4 | 94720800 | RealGrid_EQ100.zip | |
Statnett | 800MB | 30MB |
- "FullModel" are files that have a standard
md:FullModel
structure. ENTSOE also has 7dm:DifferenceModel
- See next section for counting triples
- See Multipled Data for "multiplied"
- "Statnett" describes the actual Statnett grid, which is not public data. It's included only for comparison
ENTSO-E files are nested 2-3 levels deep in the folder hierarchy:
cd ENTSO-E_Test_Configurations_v3.0.2/v3.0
find . -name *.xml |perl -pe 's{[\w-]+}{*}g' | sort | uniq -c
47 ./*/*/*.*
310 ./*/*/*/*.*
I want to use riot.bat --count
to see how many triples in total.
But we will exclude DifferenceModel
files (*_diff.xml
) because riot
cannot handle them (they are not standard RDF XML format):
find . -name *.xml ! -name *diff* | wc
350 350 23847
The total length of all filenames is quite large (24k) so it overflows the command line:
riot.bat --count `find . -name *.xml ! -name *diff*`
The command line is too long.
In such case one uses xargs
.
Since the environment and the command line together are subject to a size limit,
I tried to remove some wordy env vars (ORIGINAL_PATH= PSModulePath= INFOPATH=
),
but still it's greater than the limit on my shell (Cygwin Bash):
find . -name *.xml ! -name *diff* | env ORIGINAL_PATH= PSModulePath= INFOPATH= xargs --show-limit riot.bat --count
Your environment variables take up 3940 bytes
POSIX upper limit on argument length (this system): 26012
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 22072
Size of command buffer we are actually using: 26012
Maximum parallelism (--max-procs must be no greater): 2147483647
The command line is too long.
So I have to split the work in several parts: -n 100
passes 100 files at a time, and 2>
saves STDERR to a file:
find . -name *.xml ! -name *diff* | xargs -n 100 riot.bat --count 2> count-ENTSOE.txt
I wrote a small script to massage this file:
perl count.pl count-ENTSOE.txt > count-ENTSOE1.txt
The total is 1844380 (1.8M triples) and the largest file is
947208 ./RealGrid/RealGrid-Merged/RealGrid_EQ.xml
Nordic44 files are a lot smaller:
cd Nordic44/Instances
find . -name *.xml | xargs riot.bat --count 2> count-Nordic.txt
perl count.pl count-Nordic.txt > count-Nordic1.txt
The total is 35481 (35k triples) and the largest file is
17420 ./CGMES_2_4/Nordic44_CGM_37a_EQ.xml
- #117 multiply instance data
ENTSO-E plus Nordic44 make only 1.8M triples. This is not very much as semantic databases go, so we decided to multiply it 100 times to obtain bigger examples.
Chavdar Ivanov took 4 files from ENTSO-E_Test_Configurations_v3.0.2 and multiplied the data in 4 variants (10, 20, 50 and 100 times). The results are in this Microsoft Teams Drive.
I got only the largest files: RealGrid_EQ100.zip, RealGrid_SSH100.zip, RealGrid_SV100.zip, RealGrid_TP100.zip
.
They are 1.9Gb zipped, 11Gb unzipped.
The files use DOS line endings and maybe have byte-order mark (BOM).
BOM doesn't play well with riot
, so we remove the BOM and convert to Unix line endings:
d2u *
(This takes about 15 minutes because the files are large)
The files also include
xml:base="http://iec.ch/TC57/CIM100"
which doesn't match other instance files, contradicts the decision to use modelAuthoritySet
as base, and is inappropriate for base of instance URLs.
So cim-trig removes it.
The largest file is 8Gb and takes 8 min to convert from CIM XML to Trig (with about 20Gb RAM for Java and similar for Perl).
However, the query fix-datatypes-old.ru
cannot be executed in-memory with the Jena update
command on my laptop (64Gb RAM).
With default JVM parameters, it throws:
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
We allow Java to take 60Gb, but that causes swapping and slows down the process:
# cmd:
set JVM_ARGS=-Xmx60000M -Dfile.encoding=UTF-8
update.bat --update=fix-datatypes-old.ru --data=temp1.trig --dump > instances/multiplied/RealGrid_EQ100.trig
# bash:
export JVM_ARGS="-Xmx60000M -Dfile.encoding=UTF-8"
time update.bat --update=fix-datatypes-old.ru --data=temp1.trig --dump > instances/multiplied/RealGrid_EQ100.trig
The process was really busy, taking 60-80% of CPU and lots of RAM. I canceled it after 140 min. So we need to run this update against a database (GraphDB), not against the Jena in-memory store.
update
runs successfully only for RealGrid_TP100.trig
(9.6M triples)
The final instance data for testing consists of the following trig files:
Instances | folders | files | trig | zip |
---|---|---|---|---|
Nordic44 | 3 | 12 | 5M | 340k |
ENTSO-E_Test_Configurations_v3.0.2 | 43 | 357 | 179M | 23.8M |
multiplied | 1 | 4 | 9.67G | 2.2G |
TOTAL | 47 | 373 | 9.85G | 2.2G |
- Only ENTSOE has 7
dm:DifferenceModel
, all others aremd:FullModel
. DifferenceModels
cannot be validated on their own (seeshacl-improved
for a scenario)
The 3 zipped files are available publicly in the Google Folder instance-zipped.
After converting CIM XML to a representation using named graphs (Trig), we can convert it to JSON-LD. E.g. to convert an instance file using the old namespaces, we use this command:
riot.bat --formatted jsonld test/trig/FullGrid_OP.trig | jsonld compact -c https://rawgit2.com/Sveino/Inst4CIM-KG/develop/rdf-improved/cim-context-old.jsonld
The tools used are described in the sibling folder at JSON-LD Serialization.
A good JSON-LD serialization depends on an appropriate context that defines namespaces
and property characteristics i.e. @type
(@id
for object props, XSD datatype for datatype props).
We want to cater to old and new namespaces, so we use some text (not proper JSON-LD) files to assemble contexts:
- cim-context-common.txt: a common file that defines the common namespaces, and characteristics for about 5100 props
- (cim-context-strings.txt: a "spill-over" file that defines 120
xsd:string
properties: not added to context since that is the default datatype) - cim-context-new.txt: prefix file for JSON-LD context using new namespaces
{"cim": "https://cim.ucaiug.io/ns#",
"eu": "https://cim.ucaiug.io/ns/eu#",
- cim-context-old.txt: prefix file for JSON-LD context using old namespaces
{"cim": "http://iec.ch/TC57/CIM100#",
"eu": "http://iec.ch/TC57/CIM100-European#",
The assembled context files are:
- cim-context-new.jsonld: JSON-LD context using new namespaces
- cim-context-old.jsonld: JSON-LD context using old namespaces
#110 deploy JSON-LD contexts on a permanent network location:
- Currently JSON-LD files use network contexts on "rawgit2.com", which serves them with appropriate
content-type: application/ld+json
:- https://rawgit2.com/Sveino/Inst4CIM-KG/develop/rdfs-improved/CIM-ontology-context.jsonld for ontologies
- https://rawgit2.com/Sveino/Inst4CIM-KG/develop/rdf-improved/cim-context-old.jsonld for instance files using old namespaces
- https://rawgit2.com/Sveino/Inst4CIM-KG/develop/rdf-improved/cim-context-new.jsonld for instance files using new namespaces
- But we need for a more permanent CIMug or ENTSOE location.
#120 number representation in JSONLD
JSON has only a few native literal datatypes: number, boolean, string, null
.
JSON numbers are imprecise:
- There is no distinction between integer and floating point
- JSON doesn't define whether a number should be represented as
float
ordouble
- Exact numbers (
xsd:decimal
) are not available natively
This is raised as issue json-ld-syntax#387, and is accepted in the JSON-LD errata.
It is therefore better to always use strings rather than native numbers. The JSON-LD context (see previois section) attaches appropriate datatypes.
To test the output of CIM numbers and booleans, we made test/test.rq
that constructs a few triples:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX cim: <https://cim.ucaiug.io/ns#>
construct {
[] cim:reactance "0.123"^^xsd:float; cim:normallyInService true
} where {}
The respective Turtle is test.ttl
.
Then we tried with a few tools and saved the results:
test-GraphDB.jsonld
: GraphDB 10.7.3, save query result as JSON-LD, no contexttest-Jena-riot.jsonld
:- Install from Apache Jena Commands
- Then run:
riot --formatted jsonld test.ttl > test-ttl2jsonld.jsonld
test-ttl2jsonld.jsonld
, no context:- Install with
npm install -g @frogcat/ttl2jsonld
- Then run
ttl2jsonld test.ttl > test-ttl2jsonld.jsonld
- Install with
test-Virtuoso-context.jsonld
: DBpedia SPARQL endpoint, save query result as JSON-LD with contexttest-Virtuoso-plain.jsonld
: DBpedia SPARQL endpoint, save query result as JSON-LD plain
tool | reactance | normallyInService |
---|---|---|
GraphDB | "0.123" xsd:float | "true" xsd:boolean |
Jena riot | "0.123" xsd:float | "true" xsd:boolean |
ttl2jsonld | "0.123" xsd:float | true |
Virtuoso context | 0.1230000033974648 | true |
Virtuoso plain | 0.1230000033974648 | true |
- GraphDB and Jena output
@value
in quotes and always attach a datatype - Virtuoso outputs only
@value
without quotes (and adds some fake decimal digits due to internal conversions) - ttl2json outputs the number as
@value
in quotes with datatype, but the boolean without quotes
Note1: above we didn't specify a context to use. If we do, then more tools may output values in quotes.
Note2: see digitalbazaar/jsonld.js#558 for a similar problem related to native boolean
in JSON-LD.