-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ontologizing CIF data #16
Comments
@jesper-friis and @emanueleghedini I am not completely sure I implemented the hierarchy correctly in the example branch, please update it as you see fitting. |
I have a few suggestions:
|
By the way, Protege behaved very strange for me when I moving around within the CIF branch. Not sure why. Could be the |
Here are some questions and comments, bearing in mind that I'm not that familiar with the terminology here:
As a comment: when describing CIF data you are describing a relational model. Some options for relational models include: a row-based description as here; a column-based description, in which case the values attached to the data names are always equal-length arrays of values; and a functional description. The functional description is particularly interesting as it maps simply to mathematical category theory. In a functional description, each of the data names in a CIF category is the name of a function mapping from the key data names (which are not identified explicitly as such in DDL1, but the |
It is clear that we have a lot to learn about CIF. I was not aware about the column and functional descriptions. I think it is should be possible to describe the data schema with EMMO, but we need your help to come up with a suggestion. Regarding your points:
|
Units are specified using the One query about the updated schema, which looks cleaner: why is there any need for |
Thank you for the clearification about I don't see any reference to Regarding |
One thing I think started to come out in the meeting was the distinction between syntax and underlying semantics. So I think the original EMMO goal might have been to capture the contents of a CIF syntax data file, in which case The dictionaries aim to be syntax-agnostic, instead modelling data as relational tables. As I said, all that you need from a format for this to be possible is that a vector of values can be associated with some object or abstract location in data files, and an association between that location and a CIF dataname. Collection of these vectors into tables is specified by the CIF dictionaries, so in this sense CIF loops are redundant. But back to the question at hand, a 'category' is a CIF semantic concept and a 'loop' is a CIF syntactic concept, so it seems strange to mix them in a single scheme. Maybe what we could do is clearly delineate syntax from semantics, perhaps by providing two ontologies: this description of CIF syntax in the 'language' sub-area of EMMO, and a separate branch of the emmo ontology that transcribes the contents of CIF dictionaries. So for the syntax, you'd say that a CIF data file contains data blocks, which contain key-value pairs and loops. Loops contain rows of values each associated with a data name. At this point all values are just strings, and data names are generic. Maybe it would then be possible (don't know how EMMO works yet) to state that a data name in the syntax is a data name found in the dictionary ontology, thus making the link between syntax and semantics. On the other hand, it looks like it might be possible to express the two separate ontologies (syntax and semantics) in a single ontology, but then I think we should distinguish carefully between syntactical types and semantic types. So a data name isa "data name appearing in a loop in a file" and also isa "thing defined in a dictionary" belonging to a "dictionary category" and its value has syntactic type "string" and semantic type "whatever the dictionary says". |
Some thoughts ahead of tonight's meeting - I think the simplest way of incorporating CIF dictionary information is to choose which of the attributes attached to a data name are relevant, and then they each become a box with an arrow pointing to them from the data name. In the simplest version, you care only about the type of the value, which is about what we currently have. |
So I think v3 looks neater and more flexible in terms of adding additional attributes to data names when/if they become important. Some comments:
|
Here is v4 after a discussion with Emanuele and looking back on the comments by James: I have also created a new github branch cif-data-v4 with the following:
Some notes:
|
@jamesrhester, it seems that you might know PyCIFRW pretty well. How do we access information about type and unit for data names like >>> from CifFile import ReadCif
>>> cf = ReadCif('cif_core.dic')
>>> cf.get_children('core_dic')['cell.length_a'].items()
[('_definition.id', '_cell.length_a'),
('_alias.definition_id', ['_cell_length_a']),
('_import.get', [{'save': 'cell_length', 'file': 'templ_attr.cif'}]),
('_name.category_id', 'cell'),
('_name.object_id', 'length_a')] it seems that we somehow are supposed to import templ_attr.cif and obtain the missing information from there. Is that something PyCIFRW can do for us? |
Indeed I do know it pretty well! See the original paper. The In [1]: from CifFile import CifDic
In [2]: p = CifDic("/home/jrh/COMCIFS/cif_core/cif_core.dic",do_dREL=False)
# lots of output edited out...
In [3]: p["_space_group_symop.id"]["_type.contents"]
Out[3]: 'Integer' Attributes of CIF categories can also be found in the same way: In [4]: p["space_group_symop"]["_category_key.name"]
Out[4]: ['_space_group_symop.id'] Be sure to use the Note that if an attribute appears in a loop (even if it only has one row) the return value will be an array as in the last example. Most (all?) current small molecule CIF data files use the old-style data names that don't have a period character In [5]: p["_space_group_symop.operation_xyz"]["_alias.definition_id"]
Out[5]:
['_space_group_symop_operation_xyz',
'_symmetry_equiv.pos_as_xyz',
'_symmetry_equiv_pos_as_xyz'] |
Thank you James. CifDic seems to be exactly what I was looking for. Wonderful to work with the main developer! |
Do you have any comments about v4 above: emmo-repo/domain-crystallography#16 (comment)? |
v4 looks pretty good. Some thoughts:
|
Regarding your comment about aliases. They are already included as skos:altLabel in the generated cif_core.ttl file. |
By the way, here is the generated cif_core turtle file if you want to explore it in Protege. |
|
Thank you @jamesrhester. I made yet an update where I introduced type contents and containers from DDLm. I have also updated the script for generating the cif_core ontology. Francesca has released version 1.0.0 of EMMO Python, so generating the ontology can now be done with the following steps: $ pip install PyCifRW EMMO
$ git fetch origin cif-data-v4
$ git checkout cif-data-v4
$ python generate_cif.py Some notes and questions:
|
The latest ontology looks good and I don't have any specific comments beyond what is below.
That makes sense
In practice there are not that many distinct matrix types in the core dictionary so you could generate specific types. Composite and incommensurate structures, which eventually you'll want to capture, can have a bit more variety in dimensions. I don't think there is a right answer here. I do like the way the ontology looks with separate types for each type of matrix.
All of the compound data types are "new" and very rarely if ever appear in data files currently, as they require use of the new CIF syntax that supports them. Syntactically, an array/matrix/list looks like: A table is written
Definitely not. A table constructed as
Yes it is. This is almost never used in domain dictionaries, and where it does occur it means we haven't cleaned up properly. It is mainly used in the dictionary describing the attributes themselves where the type of an attribute depends on the type of another attribute.
I think the general idea behind TABLE, DATA_ITEMS, ROW, COLUMN is correct. In terms of naming I don't think we have any particular concept, as we are table-oriented. Being table-oriented means that we have column headings (the data names) and columns that are lists of values, or alternatively we have rows, and for each row we can associate a data name with a data value. You may wish to rename
COLUMN as a concept is useful if you consider that the value attached to a data name in a file is actually a whole column of values. This is efficient for programming but perhaps sets of rows makes more ontological sense. The dictionary does not and cannot make use of COLUMN anywhere so you can leave it out if you want.
Yes, that would be OK, as the data names listed under |
Thank you James for useful comments. I have now implemented the specific data types in point 2 above. I have to return to the other points. Ideally, we should also link the array elements to their corresponding arrays, like explicit stating that I have also started to connect the generated cif_core ontology with our original EMMO crystallography domain ontology. |
That's right, there is no attribute for explicitly linking the elements of a matrix and the matrix itself. The precise relationship is expressed in the dREL code for constructing the matrix from the elements (e.g. here) |
Latest working CIF ontology based on the work done in #16. It implements a software tool to convert CIF -> OWL (TTL). As well as the latest use of the tool to generate the CIF Core dictionary as an OWL (TTL) ontology. It also encompasses the latest changes to the CIF top ontology.
While this issue has been closed, it can be used as reference for further changes, but it would be better to open specific GitHub issues pertaining to the subject of the suggested change. |
In a discussion between myself, @jesper-friis and @emanueleghedini, we tried to tackle the hurdles for getting development on this ontology started in a practical way. In other words, we tried setting up a basic taxonomy and parthood graph for CIF data.
By CIF data we mean the semantics of the actual data (the values) not the semantics of the associated keywords. However, since the values are represented by their keywords/data names, these have been used in the mock up.
The resulting graph is shown below.
The graph has essentially two important parts. One pertains to the hierarchy of the data, the other relates to the semantics of the data types.
For the hierarchy, we see that
CIF_DATA
has a partloop_
. This is not to be taken as the syntacticalloop_
, but rather the concept of the CIF data expressed as loops.A loop has a part
ROW
, which is our attempt to define a collection of a single row of data within a loop. Note that we do not care here whether the CIF file syntactically defines a ROW using key+value lines or as part of a syntacticloop_
.ROW
encompasses both as the same concept semantically.Now we come to how one may practically extend the ontology. Here we have added the concept of
_space_group_symop_[]
, whichisA ROW
andhasPart
s_space_group_symop_id
,_space_group_symop_operation_xyz
, and_space_group_symop_sg_id
. There is a restriction of how many times_space_group_symop_[]
can have each of these parts (max 1
).Now, all of these are
SPACE_GROUP_SYMOP
, i.e., they are of the CIF categorySPACE_GROUP_SYMOP
.For the data types, you can see that we have given each of the CIF data names types according to the type definitions of the CIF dictionary. In CIFv1 (which we are currently only concerned with) there are only three types:
null
,char
, andnumb
(REF).Each of the three data types has been defined and also related to the general
xsd
types via the types defined in EMMO.This creates a data type relationship for all CIF data to that of EMMO.
Now if one wants to extend this, you would simply add another
null
type/category_overview CIF data as a sub-class of bothROW
,cif:null
, and its associated category, and afterwards add all its containing data keys/names as parts of it, sub-classing both the category (again) and the related type.Finally, this can be automated by going through the actual
.dic
file, which defines all the relevant metadata for each data key/name (link to coreCIF.dic
-file).This is not meant to be the absolute way of ontologizing CIF data, but rather, it is our currently suggested way of doing it. This issue is meant to be a discussion of its validity and one can suggest or ask questions freely.
As an added bonus, I have created a branch where one can see the implementation of the graph above into a Turtle file in the current repository (cif-data). If you checkout this branch and open the Turtle file
cif-data.ttl
in Protégé, you should see the suggested implementation, which could act as a template for adding more CIF data keys/names (the added concepts are marked in bold).The text was updated successfully, but these errors were encountered: