Provide default values for author identifiers #359

jamesrhester · 2023-03-20T05:05:52Z

Recent changes to the author categories introduced author identifiers, which were key data names. However, legacy data files will not have this data name and thus fail validation. The current changes introduce dREL methods that provide default values. Closes #103 .

Note that the dREL is implemented using a Definition method rather than an Evaluation method, as the Evaluation method would imply a single correct value, and validation would therefore fail if the file contained any values for the identifiers that were not the row numbers. Evaluation methods were appropriate for _space_group_symop.id because legacy CIFs used the space_group_symop category row number to refer to symops.

The Definition method creates a new value for the default value for each row. This is an original use of Definition methods, which up until now have been used in Set categories only, that is, _enumeration.default might be interpreted as creating a category-wide default. Now that Set categories are simply seen as Loop categories that have been split over several data blocks, there is no fundamental difference between a per-row default and a per-Set-category-row default. But please think if there is any reason not to allow this behaviour.

Recent changes to the author categories introduced author identifiers, which were key data names. However, legacy data files will not have this data name and thus fail validation. The current changes introduce dREL methods that provide default values.

vaitkus · 2023-03-20T14:41:26Z

Thank you for pointing out the not so obvious distinction between Definition and Evaluation methods. However, shouldn't we rely on the Validation methods to carry out the correct type of comparison rather than require for the provided and evaluated values to always match? This way we could use Evaluation rules to determine the id values and, if needed, explicitly specify that they should not be validated (i.e. by defining the Validation methods as '.')?

I quite like the current assumption that it is sufficient to run Definition methods once per file while the Evaluation methods can/should be run for each value. What do you think?

Also, as mentioned in #103, wouldn't it be better to use a method that does not tie the generated id to the row number? Using something like a uuid would help to prevent accidental id clashes and would be more in line with the idea that loop order should/can be generally ignored when dealing with DDLm ontologies.

Finally, note that according to the interpretation that you provided, the Evaluation method is also not fully suitable for _space_group_symop.id data item as it could lead to similar validation issues. This is due to the fact that symop ids are only recommended to match the row numbers, but not formally required to do so (except for x,y,z which is required to have the id of 1):

It is normally the sequence number of the entry in that
list, and should be identified with the code 'n' in the
geometry symmetry codes of the form 'n_pqr'.

jamesrhester · 2023-03-21T04:44:26Z

Thank you for pointing out the not so obvious distinction between Definition and Evaluation methods. However, shouldn't we rely on the Validation methods to carry out the correct type of comparison rather than require for the provided and evaluated values to always match? This way we could use Evaluation rules to determine the id values and, if needed, explicitly specify that they should not be validated (i.e. by defining the Validation methods as '.')?

The meaning of Evaluation has always been that it calculates the correct value of a missing data item, not that it calculates one possible value, so the results of Evaluation must be usable as a way to check values. Validation makes very little sense as a separate method purpose attached to a single data name, as it either fulfills the same role as Evaluation, by finding incorrect values, or it checks collections of values, e.g. that cell parameters match the space group symmetry, and is therefore not associated with any single data name. That's why I've always considered Validation methods to belong to a special dictionary that defines "validation" data names, like "_valid.cell_symmetry_consistent", and even in that case Evaluation would work just as well. A further issue is that currently we can only have a single method attached to a definition. We do need to define a _method.id to fix that, but I've been putting that off.

I quite like the current assumption that it is sufficient to run Definition methods once per file while the Evaluation methods can/should be run for each value. What do you think?

In the dREL context, there are no data blocks or data files, only collections of tables. So the simplest option for Definition methods is that they do evaluate for every row. The alternative might be to say something like "once for every distinct value of a Set category key", which is more complex to implement and means that Set categories acquire special status in dREL, which they haven't had before.

Also, as mentioned in #103, wouldn't it be better to use a method that does not tie the generated id to the row number? Using something like a uuid would help to prevent accidental id clashes and would be more in line with the idea that loop order should/can be generally ignored when dealing with DDLm ontologies.

Yes, CurrentRow is fairly appalling in many senses. I agree that it would be great to introduce a new dREL function "unique_id()", to replace CurrentRow. I'll raise an issue in the dREL repository where we can discuss it further.

Finally, note that according to the interpretation that you provided, the Evaluation method is also not fully suitable for _space_group_symop.id data item as it could lead to similar validation issues. This is due to the fact that symop ids are only recommended to match the row numbers, but not formally required to do so (except for x,y,z which is required to have the id of 1):

It is normally the sequence number of the entry in that
list, and should be identified with the code 'n' in the
geometry symmetry codes of the form 'n_pqr'.

Unfortunately 3 decades of practice have made that "normally" into "always". This is hard-coded into too much software to change now. We should change that definition to move away from sequence number and simply say that it is the value of _space_group_symop.id, and the row number when this is missing.

But, given my comments above, CurrentRow is not guaranteed to work for merged data blocks, where there might be several space groups and therefore several lists of symmetry operators. Essentially we want to change its definition so that we are talking about the row number where all key data names except for one are constant.

vaitkus · 2023-03-21T14:18:38Z

The meaning of Evaluation has always been that it calculates the correct value of a missing data item, not that it calculates one possible value, so the results of Evaluation must be usable as a way to check values. Validation makes very little sense as a separate method purpose attached to a single data name, as it either fulfills the same role as Evaluation, by finding incorrect values, or it checks collections of values, e.g. that cell parameters match the space group symmetry, and is therefore not associated with any single data name. That's why I've always considered Validation methods to belong to a special dictionary that defines "validation" data names, like "_valid.cell_symmetry_consistent", and even in that case Evaluation would work just as well.

Understood, so the main idea is that Evaluation methods base their results on other data items and should generally always provide the same output between different runs (thus not fit for the current id use case).

A further issue is that currently we can only have a single method attached to a definition. We do need to define a _method.id to fix that, but I've been putting that off.

This is just a nitpick, but a definition can have several attached methods, but just not several attached methods of the same type (_method.purpose serves as the key). However, the _method.id item would be useful since, in theory, one could provide several alternative Evaluation methods that calculate the value based on the availability of other items or several Validation methods which check different aspects of the validated data item (e.g. as suggested in #262).

In the dREL context, there are no data blocks or data files, only collections of tables. So the simplest option for Definition methods is that they do evaluate for every row. The alternative might be to say something like "once for every distinct value of a Set category key", which is more complex to implement and means that Set categories acquire special status in dREL, which they haven't had before.

But doesn't dREL still operate on a finite dataset which in the regular use case matches a single data block? Probably, not all of the currently defined methods could not handle merged multiblock data files (e.g. method provided with _refln.f_complex uses data from several categories without trying to determine if the data came from the same data block).

Unfortunately 3 decades of practice have made that "normally" into "always". This is hard-coded into too much software to change now. We should change that definition to move away from sequence number and simply say that it is the value of _space_group_symop.id, and the row number when this is missing.

But, given my comments above, CurrentRow is not guaranteed to work for merged data blocks, where there might be several space groups and therefore several lists of symmetry operators. Essentially we want to change its definition so that we are talking about the row number where all key data names except for one are constant.

Ok, these changes can be made in a separate PR once the discussions on the id generation functions are settled.

As for the current PR, the Evaluation method was slightly easier to process as one could automatically assume it returns the missing value without analysing the actual code. On the other hand, the Definition method may modify any of the definition attributes, thus a more detailed analysis of the code is needed to figure out that the value is being assigned by modifying the default value. Currently, the CIF validator from the cod-tools package generally is not capable of interpreting dREL, but I can implement a hackish solution for now to handled just such cases.

jamesrhester · 2023-03-22T01:49:01Z

But doesn't dREL still operate on a finite dataset which in the regular use case matches a single data block? Probably, not all of the currently defined methods could not handle merged multiblock data files (e.g. method provided with _refln.f_complex uses data from several categories without trying to determine if the data came from the same data block).

That is the reason for a specific dREL behaviour: when indexing into a category in a dREL method, if the values of any key data names are not provided explicitly, their values can be assumed from the values any related (child-parent links) key data names take in the currently-processed row. Otherwise the values of the key data names should be provided (cat[a=b].obj) in the dREL method. Certainly, once we have supplied all the key data names to Set categories to allow multi-blocks we will need to check that everything works for _refln.f_complex (the most complicated dREL method we have).

As for the current PR, the Evaluation method was slightly easier to process as one could automatically assume it returns the missing value without analysing the actual code. On the other hand, the Definition method may modify any of the definition attributes, thus a more detailed analysis of the code is needed to figure out that the value is being assigned by modifying the default value. Currently, the CIF validator from the cod-tools package generally is not capable of interpreting dREL, but I can implement a hackish solution for now to handled just such cases.

In practice a Definition method can only change single-valued attributes, as there is no way to loop over a Loop-type DDLm category. The only such categories relevant for value creation are _enumeration.default and _units.code, I think, so a simple text search for enumeration.default = .... would probably suffice.

jamesrhester · 2023-03-22T02:08:35Z

I believe there might be a further issue: even if _publ_author.id passes validation, _publ_contact_author.id will still be missing and therefore validation will fail. To fix this we should add dREL code for _publ_contact_author.id that finds the correct index in _publ_author.id based on the old key data names.

Although we can generate arbitrary ids for parent categories publ_author and audit_author where they are missing, we cannot do the same for their child categories. This commit adds dREL that searches for matching names in the parent categories and determines the missing identifier accordingly.

vaitkus · 2023-03-22T03:43:12Z

I believe there might be a further issue: even if _publ_author.id passes validation, _publ_contact_author.id will still be missing and therefore validation will fail. To fix this we should add dREL code for _publ_contact_author.id that finds the correct index in _publ_author.id based on the old key data names.

Yes, the same missing key problem would technically occur with the PUBL_CONTACT_AUTHOR category, but I was slightly less worried about it since it is not a prevalent in the existing CIF files. However, I would like to note, that originally in DDL1 the PUBL_AUTHOR and PUBL_CONTACT_AUTHOR categories were never explicitly linked together. What is more, _publ_contact_author_* items were unlooped as they belonged to the PUBL category so the _publ_contact_author_name item never served as a proper category key either. The logic behind the proposed dREL code is clearly correct (contact author must be one of the publication authors), but I just wanted to point out that it might expect a bit too much from the _publ_contact_author_name values (often they do not exactly match the _publ_author_name values due to a difference in formatting, etc.)

jamesrhester · 2023-03-22T04:34:06Z

Understood. I think by making this a default value, which might be missing if everything is not perfect, we are at least making an attempt at supporting legacy files without enforcing values (as an Evaluation method would).

vaitkus

After the discussion the provided methods look great. I suggested a few minor tweaks for the _audit_contact_author.id method, if you are ok with them, we should also propagate them to the _publ_contact_author method.

Also, I guess issue COMCIFS/dREL#6 should first be resolved before merging this?

cif_core.dic

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

jamesrhester · 2023-03-23T02:12:17Z

Those changes are fine. Yes, we should resolve COMCIFS/dREL#6 first.

vaitkus · 2023-03-28T09:15:19Z

cif_core.dic

+    _method.purpose               Definition
+    _method.expression
+;
+    _enumeration.default = Current_Row(audit_author) + 1


The Unique_id function (COMCIFS/dREL#10) might be more suitable here.

If not, the Current_row() call should be updated to reflect the changes implemented in COMCIFS/dREL#9.

vaitkus · 2023-03-28T09:16:08Z

cif_core.dic

+    _method.purpose               Definition
+    _method.expression
+;
+    _enumeration.default = Current_Row(publ_author) + 1


The Unique_id function (COMCIFS/dREL#10) might be more suitable here.

If not, the Current_row() call should be updated to reflect the changes implemented in COMCIFS/dREL#9.

Missing identifiers can now be generated by dREL using the Unique_id function.

cif_core.dic

vaitkus · 2023-03-30T20:53:19Z

As all of the related issues seem to be resolved, I will merge this PR. In the future, if we get any more relations similar to those between AUDIT_CONTACT and AUDIT_CONTACT_AUTHOR it might be a good idea to parameterise and refactor the dREL code into a separate reusable function.

jamesrhester mentioned this pull request Mar 21, 2023

Need a unique_id function COMCIFS/dREL#5

Closed

vaitkus reviewed Mar 23, 2023

View reviewed changes

cif_core.dic Outdated Show resolved Hide resolved

cif_core.dic Outdated Show resolved Hide resolved

jamesrhester and others added 2 commits March 23, 2023 13:11

Improve dREL for _audit_contact_author.id

63c651f

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

Minor dREL improvement

81a56ff

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

Update dREL code of _publ_contact_author.id.

0d0537b

vaitkus reviewed Mar 28, 2023

View reviewed changes

Use new Unique_id function for creating missing author ids

abb2de2

Missing identifiers can now be generated by dREL using the Unique_id function.

jamesrhester mentioned this pull request Mar 30, 2023

Update the Current_row() function call. #360

Closed

vaitkus reviewed Mar 30, 2023

View reviewed changes

cif_core.dic Outdated Show resolved Hide resolved

Add the object identifier part in the Current_row() subroutine call.

d05288f

vaitkus approved these changes Mar 30, 2023

View reviewed changes

vaitkus merged commit 3e0f84d into COMCIFS:master Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide default values for author identifiers #359

Provide default values for author identifiers #359

jamesrhester commented Mar 20, 2023

vaitkus commented Mar 20, 2023

jamesrhester commented Mar 21, 2023

vaitkus commented Mar 21, 2023

jamesrhester commented Mar 22, 2023

jamesrhester commented Mar 22, 2023

vaitkus commented Mar 22, 2023

jamesrhester commented Mar 22, 2023

vaitkus left a comment

jamesrhester commented Mar 23, 2023

vaitkus Mar 28, 2023

vaitkus Mar 28, 2023

vaitkus commented Mar 30, 2023

Provide default values for author identifiers #359

Provide default values for author identifiers #359

Conversation

jamesrhester commented Mar 20, 2023

vaitkus commented Mar 20, 2023

jamesrhester commented Mar 21, 2023

vaitkus commented Mar 21, 2023

jamesrhester commented Mar 22, 2023

jamesrhester commented Mar 22, 2023

vaitkus commented Mar 22, 2023

jamesrhester commented Mar 22, 2023

vaitkus left a comment

Choose a reason for hiding this comment

jamesrhester commented Mar 23, 2023

vaitkus Mar 28, 2023

Choose a reason for hiding this comment

vaitkus Mar 28, 2023

Choose a reason for hiding this comment

vaitkus commented Mar 30, 2023