-
Notifications
You must be signed in to change notification settings - Fork 9
MappingMasterDSL
MappingMaster is a domain-specific language (DSL) that defines mappings from spreadsheet content to OWL ontologies. This language is based on the Manchester OWL Syntax, which is itself a DSL for describing OWL ontologies.
An introduction to the Manchester Syntax can be found here. A set of example Manchester Syntax expressions can be found in the Quick Reference section of that document.
The Manchester Syntax supports the declarative specification of OWL axioms.
For example, a Manchester Syntax declaration of an OWL named class Gum that is a subclass of a named class called Product can be written using a class declaration clause as:
Class: Gum SubClassOf: Product
The MappingMaster DSL extends the Manchester Syntax to support references to spreadsheet content in these declarations. MappingMaster introduces a new reference clause for referring to spreadsheet content. In this DSL, any clause in a Manchester Syntax expression that indicates an OWL named class, OWL property, OWL individual, data type, or a literal can be substituted with this reference clause. Any declarations containing such references are preprocessed and the relevant spreadsheet content specified by these references is imported. As each declaration is processed, the appropriate spreadsheet content is retrieved for each reference. This content can then be used in four main ways:
- It can be used to directly name OWL entities that are created on demand.
- It can be used to annotate OWL entities that are created on demand.
- The content may reference existing OWL entities, either directly as a URI or through an annotation property.
- Finally, the content may be used as a literal.
Reference in the MappingMaster DSL are prefixed by the character @. These are generally followed by an Excel-style cell reference. In the standard Excel cell notation, cells extend from A1 in the top left corner of a sheet within a spreadsheet to successively higher columns and rows, with alpha characters referring to columns and numerical values referring to rows .
For example, a reference to cell A5 in a spreadsheet is written as follows:
@A5
Sheets can also be specified by enclosing their name in single quotes and using the "!" character separator between the sheet name and the cell specification:
@'A sheet'!A3
For example, in the following spreadsheet rows 4 to 6 of column B contain product categories; columns D to G of row 2 contain state identifiers, and the grid range D4 to G6 contains sales amounts.
These references can then be used in MappingMaster's DSL to define OWL constructs using spreadsheet content.
For example, a MappingMaster expression to declare that a class FlavouredGum is a subclass of the class named by the contents of cell B4 can be written:
Class: FlavouredGum SubClassOf: @B4
When processed, this expression will create an OWL named class using the contents of cell B4 ("Gum") as the class name and declare FlavouredGum to be its subclass. If the class Gum already exists, the subclass relationship will simply be established.
That is, references can be used both to define new OWL entities or to refer to existing entities.
A similar expression to declare that the class SalesItem is equivalent to the class named by the contents of cell B4 can be written:
Class: SalesItem EquivalentTo: @B4
The Manchester Syntax also supports an individual declaration clause for declaring individuals; property values can be associated with the declared individuals using a facts subclause, which contains a list of property value declarations.
For example, an expression to specify that an individual created from the contents of cell D2 ("CA") has a value of "California" for a data property value hasStateName can be written:
Individual: @D2 Facts: hasStateName "California"
Here, an individual will CA be created if necessary and associated with the data property hasStateName, which will be given the string value "California".
Using the standard Manchester Syntax, annotation properties can also be associated with declared entities.
For example, an existing string data type annotation property called hasSource can be used to associated the above declared California individual with the source document as follows:
Individual: @D2 Facts: hasStateName "California" Annotations: hasSource "DMV Spreadsheet 12/12/2010"
Classes or properties can be annotated in the same way. For example, a class can be annotated with the hasSource annotation property as follows:
Class: @D2 Annotations: hasSource "DMV Spreadsheet 12/12/2010"
The Manchester Syntax also supports the use of OWL class expressions. In general, a class expression may occur anywhere a named class can occur.
For example, an expression to define a necessary and sufficient condition of a class Sale used the contents of cell D4 as the filler of an owl:HasValue axiom with the property hasAmount can be written:
Class: Sale SubClassOf: (hasAmount value @D4)
In general, OWL entities named explicitly in a MappingMaster expression (as opposed to resolved through a reference) must already exist in the target ontology. In these examples, the classes Sale, SalesItem and FlavouredGum, and properties hasAmount, hasStateName and hasSource must already exist.
In the expression
Class: @A5 SubClassOf: Drug
reference @A5 clearly refers to an OWL class. However, the reference type cannot always be inferred unambiguously.
For example, in the expression
Class: Sale SubClassOf: (@A3 value @D4)
the reference @A3 could refer to an object, data, or annotation property, and reference @D4 could be either an OWL individual or a literal.
To deal with this situation, Mapping Master supports explicit entity type specification. Specifically, a reference may be optionally followed by a parenthesis-enclosed entity type specification to explicitly declare the type of referenced entity. This specification can indicate that the entity is an OWL named class, an OWL data or annotation property, an OWL named individual, or a data type. The MappingMaster keywords to specify the types are the standard Manchester Syntax keywords Class, ObjectProperty, DataProperty, AnnotationProperty and Individual, plus any XSD type name (e.g., xsd:int).
Using this specification, the previous drug declaration, for example, can be written:
Class: @A5(Class) SubClassOf: Drug
A declaration of an individual from cell B5 with an associated property value from cell C5 that is of type float can be specified as follows:
Individual: @B5 Facts: hasSalary @C5(xsd:float)
If the hasSalary data property is already declared to be of type xsd:float then the explicit type qualification is not needed. A global default type can also be specified for literals in the case where the type of the associated data property is either unknown or unspecified or if no explicit type is provided in the reference.
References to OWL properties and individuals can be qualified in the same way.
References may specify OWL entities (i.e., classes, properties, individuals, or datatypes) or literals. When a reference specified an OWL entity the reference value may resolve to an existing OWL entity or may be used to name an OWL entity that is created on demand.
A variety of name resolution strategies are supported when creating or referencing OWL entities. The three primary strategies are to:
- Using rdf:IDs to create or resolve OWL entities.
- Use rdfs:label annotations to create or resolve OWL entities
- Create OWL entities based on the location of a cell ignoring the resolved reference value
Using rdfs:label encoding, an OWL entity resolved from a reference is given an automatically generated URI and its rdfs:label annotation value is set to the resolved reference value.
With location encoding, an OWL entity generated from a reference is also given an automatically generated URI but in this case the resolved reference value is unused.
The default naming encoding uses the rdfs:label annotation property. The default may also be changed globally.
A name encoding clause is provided to explicitly specify a desired encoding for a particular reference. As with entity type specifications, this clause is enclosed by parentheses after the cell reference. The keywords to specify the three types of encoding are mm:Location, rdf:ID, and rdfs:label.
Using this clause, a specification of rdf:ID encoding for the previous drug example can be written:
Class: @B4(rdf:ID) SubClassOf: Drug
As mentioned, MappingMaster also supports entity creation where cell values are ignored. In this case, the keyword mm:Location can be used in parenthesis following a reference.
For example, an expression to create an individual for cell D4 while ignoring the contents of the cell can be written:
Individual: @D4(mm:Location)
By default, OWL entities names are resolved or generated using the namespace of the currently active ontology. The language includes mm:prefix and mm:namespace clauses to override this default behavior.
For example, an expression to indicate that an individual created or resolved from the contents of cell A2 (assuming rdfs:label resolution) should use the namespace identified by the prefix "clinical", can be written:
Individual: @A2(mm:prefix="clinical")
Similarly, an expression to indicate that it must use the namespace "http://clinical.stanford.edu/Clinical.owl#" can be written:
Individual: @A2(mm:namespace="http://clinical.stanford.edu/Clinical.owl#")
Explicit namespace or prefix qualification in reference allows disambiguation of duplicate labels in an ontology.
To support direct references to annotation values in expressions, MappingMaster's DSL adopts the Manchester Syntax mechanism of enclosing these references in single quotes.
For example, if the OWL class Product has an rdfs:label annotation value 'A sellable product' it can be referred as follows:
Class: @B4 SubClassOf: 'A sellable product'
A sellable product will be resolved through an annotation value to the class Product when this expression is processed.
Document the following options:
mm:defaultPrefix, mm:defaultNamespace, mm:defaultLanguage, mm:ResolveIfOWLEntityExists, mm:SkipIfOWLEntityExists, mm:WarningIfOWLEntityExists, mm:ErrorIfOWLEntityExists, mm:CreateIfOWLEntityDoesNotExist, mm:SkipIfOWLEntityDoesNotExist, mm:WarningIfOWLEntityDoesNotExist, mm:ErrorIfOWLEntityDoesNotExist, mm:ProcessIfEmptyLabel, mm:ErrorIfEmptyLabel, mm:WarningIfEmptyLabel, mm:SkipIfEmptyLabel
The default behavior is to directly use the contents of the referenced cell. However, this default can be overridden using an optional value specification clause.
This clause is usually indicated by the '=' character immediately after the encoding specification keyword and is followed by a parenthesis-enclosed, comma-separated list of value specifications, which are appended to each other. These value specifications can be cell references, quoted values, regular expressions containing capturing groups, or inbuilt text processing functions.
For example, an expression that extends a reference to specify that the entity created from cell A5 is to use rdfs:label name encoding and that the name is to be the value of the cell preceded by the string "Sale:" can be written as follows:
Class: @A5(rdfs:label=("Sale:", @A5))
Value specification references are not restricted to the referenced cell itself and may indicate arbitrary cells. More than one encoding can also be specified for a particular reference so, for example, separate identifier and label annotation values can be generated for a particular entity using the contents of different cells.
For example, we can extend the example above to assign the rdf:ID of generated classes to cell B5 as follows:
Class: @A5(rdf:ID=@B5 rdfs:label=("Sale:", @A5))
If the assignment list includes only a single value then the opening and closing parenthesis can be omitted:
Class: @A5(rdf:ID=@B5 rdfs:label=("Sale:", @A5))
The language includes several inbuilt text processing methods that be used in value specifications. At present, several methods are supported. These include mm:replace, mm:replaceAll, mm:replaceFirst, mm:prepend, mm:append, mm:toLowerCase, mm:toUpperCase, mm:trim, mm:reverse, and mm:printf, mm:decimalFormat. These methods take zero or more arguments and return a value. Supplied arguments may be any combination of quoted strings or references.
An expression to convert the contents of cell A5 to upper case before label assignment can be written:
Class: @A5(mm:toUpperCase(@A5))
A method can also have an explicit first argument omitted if the argument refers to the current location value. The previous expression can thus also be written:
Class: @A5(mm:toUpperCase)
Value processing functions can also used outside of value specification clauses - but only if these clause are not used in a reference, and only a single function can be used.
decimalFormat and printf support formatting of textual and numerical content. Their behavior follows the standard Java specifications for the DecimalFormat class and the String.format method.
mm:decimalFormat can be used as follows:
Individual: Fred Facts: hasSalary @A1(mm:decimalFormat("###,###.00", @A1))
When the value of cell A1 is "23000.2" this will render:
Individual: Fred Facts: hasSalary "23,000.20"
Here is an example of mm:printf:
Class: @A1(mm:printf("A_%s", @A1))
When value of cell A1 is "Car" this will render:
Class: A_Car
Any parameter can be replaced with a reference clause. These functions will work with explicit rdf:ID and rdfs:label assignment too.
Note that if only one parameter is supplied the second is assumed to be the enclosing reference location.
So
Individual: Fred Facts: hasSalary @A1(mm:decimalFormat("###,###.00"))
is equivalent to:
Individual: Fred Facts: hasSalary @A1(mm:decimalFormat("###,###.00", @A1))
And
Class: @A1(mm:printf("A_%s"))
is equivalent to:
Class: @A1(mm:printf("A_%s", @A1))
Which is also equivalent to:
Class: @A1(rdf:ID=mm:printf("A_%s", @A1))
Note that functions cannot be directly nested inside other functions. For example, we cannot directly convert a value to upper case using the mm:toUpperCase function and then use an enclosing the mm:prepend function:
Class: @A1(mm:prepend("_", mm:toUpperCase)) # THIS IS NOT ALLOWED
However, parameters to functions can include references, which themselves can contain functions, so we can indirectly nest. For example, the previous upper case followed by a prepend operations could be written as follows using this approach:
Class: @A1(mm:prepend("_", @A1(mm:toUpperCase)))
The mm:printf function can also be very useful when performing complex combinations of operations. For example, the previous upper case followed by a prepend operations could be written as follows using mm:printf:
Class: @A1(mm:printf("_%s", @A1(mm:toUpperCase)))
The mm:replace and mm:replaceAll functions follow from the associated methods in the standard Java String class.
For example, to remove all non alphanumeric characters from a cell before assignment, the mm:replaceAll function can be used as follows:
Individual: @A5 Facts: hasItems @B5(mm:replaceAll("[^a-zA-Z0-9]",""))
Similarly, the mm:replace method can be used to replace commas with periods when processing literals:
Individual: @A2 Facts: hasSalary @A3(xsd:float mm:replace(",", "."))
The mm:prepend method can be used as follows to simplify the above example:
Class: @A5(rdfs:label=mm:prepend("Sale:"))
The expression can be further simplified by omitting the explicit rdfs:label qualification if it is the default:
Class: @A5(mm:prepend("Sale:"))
The append method works similarly.
For example, assuming default rdfs:label encoding, the string "_MM" can be appended to a generated label as follows using the mm:append function:
Individual: @A2(mm:append("_MM"))
A similar approach can be used to selectively extract values from referenced cells. A regular expression capturing groups clause is provided and can be used in any position in a value specification clause. This clause is contained in a quoted string enclosed by square parenthesis. For example, if cell A5 in a spreadsheet contains the string "Pfizer:Zyvox" but only the text following the ':' character is to be used in the label encoding, an appropriate capture expression could be written as:
Class: @A5(rdfs:label=[":(\S+)"])
Note that parentheses around the sub-expressions in a regular expression clause specify capture groups and indicate that the matched strings are to be extracted. In some cases, more than one group may be matched for a cell value, in which case the matched strings are extracted in the order that they are matched and are appended to each other.
Capturing groups can also be used to generate literals. For example, if cell A2 in a spreadsheet has a person's forename, middle initial, and surname separated by a single space, three capturing expressions can be used to selectively extract each name portion and separately assign them to different properties as follows:
Individual: @A2 Types: Person Facts: hasForename @A2(["(\S+)"]), hasInitial @A2(["\S+\s(\S+)"]), hasSurname @A2(["\S+\s\S+\s(\S+)"])
A similar example to separately extract two space-separated integers from a cell can be written as:
Individual: @A2 Types: Person Facts: hasMin @A2(xsd:int ["(\d+)\s+"]), hasMax @A2(xsd:int ["\s+(\d+)"])
If the hasMan and hasMax properties are of type xsd:int then the explicit qualification is not required here.
Capturing expressions can also be invoked via the mm:capturing function:
Individual: @A2 Types: Person Facts: hasForename @A2(mm:capturing("(\S+)")
The syntax of capturing expressions follows that supported by the Java Pattern class.
Mapping Master currently supports the following datatypes:
xsd:string, xsd:boolean, xsd:byte, xsd:short, xsd:int, xsd:long, xsd:float, xsd:double, xsd:integer, xsd:decimal, xsd:dateTime, xsd:date, xsd:time, xsd:Duration, rdf:PlainLiteral, rdf:XMLLiteral
Mapping Master has several directives to customize the IRI creation process.
Directive | Explanation |
---|---|
mm:iri | Use the resolved reference value to generate an IRI. An error will be thrown if the generated value does not represent a valid IRI. |
mm:camelCaseEncode | |
mm:snakeCaseEncode | |
mm:uuidEncode | |
mm:hashEncode |
To deal with missing cell values, default values can also be specified in references. A default value clause is provided to assign these values. This clause is indicated by the keywords mm:DefaultLocationValue, mm:DefaultLiteral, mm:DefaultLabel, and mm:DefaultID followed by an assignment to a string. For example, the following expression uses this clause to indicate that the value "Unknown" should be used as the created class label if cell A5 is empty:
Class: @A5(rdfs:label mm:DefaultLabel="Unknown")
Additional behaviors are also supported to deal with missing cell values. The default behavior is to skip an entire expression if it contains any references with empty cells. Four keywords are supplied to modify this behavior. These keywords indicate that:
- An error should be thrown if a cell value is missing and the mapping process should be stopped (mm:ErrorIfEmptyLocation)
- Expressions containing references with empty cells should be skipped (mm:SkipIfEmptyLocation)
- Expressions containing references with empty cells should generate a warning in addition to being skipped (mm:WarningIfEmptyLocation)
- Expressions containing such empty cells should be processed (mm:ProcessIfEmptyLocation).
Consider, for example, the following expression declaring an individual from cell A5 of a spreadsheet and associating a property hasAge with it using the value in cell A6:
Individual: @A5 Facts: hasAge @A6(mm:ProcessIfEmptyLocation)
Here, using the default skip behavior action, a missing value in cell A5 will cause the expression to be skipped. However, the process directive for the hasAge property value in cell A6 will instead drop only the sub-expression containing it if that cell is empty. So, if cell A5 contains a value and cell A6 is empty, the resulting expression will still declare an individual.
Using a similar approach, more fine grained empty value handling is also supported to specify different empty value handling behaviors for mm:Literal, rdf:ID and rdfs:label values. Here, the label directives are mm:ErrorIfEmptyLabel, mm:SkipIfEmptyLabel, mm:WarningIfEmptyLabel, and mm:ProcessIfEmptyLabel, with equivalent keywords for RDF identifier and literal handling. These are mm:ErrorIfEmptyID, mm:SkipIfEmptyID, mm:WarningIfEmptyID, mm:ProcessIfEmptyID and mm:ErrorIfEmptyLiteral, mm:SkipIfEmptyLiteral, mm:WarningIfEmptyLiteral, mm:ProcessIfEmptyLiteral.
One additional option is provided to deal with empty cell values. This option is targeted to the common case in many spreadsheets where a particular cell is supplied with a value and all empty cells below it are implied to have the same value. In this case, when these empty cells are being processed, their location must be shifted to the location above it containing a value. For example, the following expression uses this keyword to indicate that call A5 does not contain a value for the name of the declared class then the row number must be shifted upwards until a value is found:
Class: @A5(mm:ShiftUp)
If no value is found, normal empty value handling processing is applied. Similar directives provide for shifting down (mm:ShiftDown), and to allow shifting to the left (mm:ShiftLeft), or to the right (mm:ShiftRight).
Obviously, most mappings will not just reference individual cells but will instead iterate of a range of columns or rows in a spreadsheet. The wildcard character '*' can then be used in references to refer to the current column and/or row in an iteration. MappingMaster provides a graphical interface to specify these ranges. (They will soon be supported in the DSL.)
Example references using this wildcard notation include:
- @A3
- @A*
- @**
Individual: @** Types: Sale
This expression can be extended to assign property values to these individuals:
Individual: @** Types: Sale Facts: hasAmount @**, hasProduct @B*, hasState @*2
The DSL does not support the entire Manchester Syntax. The following clauses are not currently supported:
- OWL object property declarations
- OWL data property declarations
- OWL annotation property declarations
- OWL datatype declarations
- OWL literal type qualification
- OWL disjoint classes
- OWL equivalent and disjoint properties
- OWL negative property assertions
- OWL has key
A set of global defaults can be specified for reference directives. The language has a number of clauses to specify these defaults.
The following examples illustrate the use of these clauses together with the current defaults.
- mm:DefaultReferenceType Current default is Class. Other possible values include NamedIndividual, ObjectProperty, DataProperty, AnnotationProperty, and any XSD datatype.
- mm:DefaultPropertyType Current default is ObjectProperty. Other possible value are DataProperty and AnnotationProperty.
- mm:DefaultPropertyValueType Current default is xsd:string If we are expecting a (data or annotation) property value, use xsd:string
- mm:DefaultDataPropertyValueType Current default is xsd:string. Other possible values include any XSD datatype.
- mm:DefaultValueEncoding Current default is rdf:ID. Other possible values are rdfs:Label, mm:Literal andrdfs:Location.
- mm:DefaultIRIEncoding Current default is mm:CamelCaseEncoding. Other passible values are mm:NoEncode, mm:NoSnakeCaseEncode, mm:UUIDEncode and mm:HashEncode.
- mm:DefaultShiftSetting Current default is mm:NoShift. Other possible values are mm:ShiftUp, mm:ShiftDown, mm:ShiftLeft, and mm:ShiftRight.
- mm:DefaultEmptyLocationSetting Current default is mm:WarningIfEmptyLocation.
- mm:DefaultEmptyLiteralSetting Current default is mm:WarningIfEmptyLiteral.
- mm:DefaultEmptyRDFIDSetting Current default is mm:WarningIfEmptyRDFID.
- mm:DefaultEmptyRDFSLabelSetting Current default is mm:WarningIfEmptyRDFSLabel.
- mm:DefaultIfOWLEntityExistsSetting Current default is mm:ResolveIfOWLEntityExists.
- mm:DefaultIfOWLEntityDoesNotExistSetting Current default is mm:CreateIfOWLEntityDoesNotExist.
- mm:DefaultLocationValue Current default is "".
- mm:DefaultLiteralValue Current default is "".
- mm:DefaultRDFID Current default is "".
- mm:DefaultRDFSLabel Current default is "".
- mm:DefaultLanguage Current default is "".
- mm:DefaultPrefix Current default is "".
- mm:DefaultNamespace Current default is "".
The MappingMaster DSL allows OWL axioms and entities to be created from spreadsheet content. The use of the Manchester syntax allows these OWL entities to be related to each other in complex ways.
Declaratively specifying mappings in this way has several advantages. The writing of these mappings does not require any programming or scripting expertise. These mappings can be shared easily using the MappingMaster GUI, which can save and load theese mappings. The mappings can also easily be executed repeatedly on different spreadsheets with the same structure.