Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide default values for author identifiers #359

Merged
merged 7 commits into from
Mar 30, 2023

Conversation

jamesrhester
Copy link
Contributor

Recent changes to the author categories introduced author identifiers, which were key data names. However, legacy data files will not have this data name and thus fail validation. The current changes introduce dREL methods that provide default values. Closes #103 .

Note that the dREL is implemented using a Definition method rather than an Evaluation method, as the Evaluation method would imply a single correct value, and validation would therefore fail if the file contained any values for the identifiers that were not the row numbers. Evaluation methods were appropriate for _space_group_symop.id because legacy CIFs used the space_group_symop category row number to refer to symops.

The Definition method creates a new value for the default value for each row. This is an original use of Definition methods, which up until now have been used in Set categories only, that is, _enumeration.default might be interpreted as creating a category-wide default. Now that Set categories are simply seen as Loop categories that have been split over several data blocks, there is no fundamental difference between a per-row default and a per-Set-category-row default. But please think if there is any reason not to allow this behaviour.

Recent changes to the author categories introduced author identifiers, which were key data names. However, legacy data files will not have this data name and thus fail validation. The current changes introduce dREL methods that provide default values.
@vaitkus
Copy link
Collaborator

vaitkus commented Mar 20, 2023

Thank you for pointing out the not so obvious distinction between Definition and Evaluation methods. However, shouldn't we rely on the Validation methods to carry out the correct type of comparison rather than require for the provided and evaluated values to always match? This way we could use Evaluation rules to determine the id values and, if needed, explicitly specify that they should not be validated (i.e. by defining the Validation methods as '.')?

I quite like the current assumption that it is sufficient to run Definition methods once per file while the Evaluation methods can/should be run for each value. What do you think?

Also, as mentioned in #103, wouldn't it be better to use a method that does not tie the generated id to the row number? Using something like a uuid would help to prevent accidental id clashes and would be more in line with the idea that loop order should/can be generally ignored when dealing with DDLm ontologies.

Finally, note that according to the interpretation that you provided, the Evaluation method is also not fully suitable for _space_group_symop.id data item as it could lead to similar validation issues. This is due to the fact that symop ids are only recommended to match the row numbers, but not formally required to do so (except for x,y,z which is required to have the id of 1):

It is normally the sequence number of the entry in that
list, and should be identified with the code 'n' in the
geometry symmetry codes of the form 'n_pqr'.

@jamesrhester
Copy link
Contributor Author

Thank you for pointing out the not so obvious distinction between Definition and Evaluation methods. However, shouldn't we rely on the Validation methods to carry out the correct type of comparison rather than require for the provided and evaluated values to always match? This way we could use Evaluation rules to determine the id values and, if needed, explicitly specify that they should not be validated (i.e. by defining the Validation methods as '.')?

The meaning of Evaluation has always been that it calculates the correct value of a missing data item, not that it calculates one possible value, so the results of Evaluation must be usable as a way to check values. Validation makes very little sense as a separate method purpose attached to a single data name, as it either fulfills the same role as Evaluation, by finding incorrect values, or it checks collections of values, e.g. that cell parameters match the space group symmetry, and is therefore not associated with any single data name. That's why I've always considered Validation methods to belong to a special dictionary that defines "validation" data names, like "_valid.cell_symmetry_consistent", and even in that case Evaluation would work just as well. A further issue is that currently we can only have a single method attached to a definition. We do need to define a _method.id to fix that, but I've been putting that off.

I quite like the current assumption that it is sufficient to run Definition methods once per file while the Evaluation methods can/should be run for each value. What do you think?

In the dREL context, there are no data blocks or data files, only collections of tables. So the simplest option for Definition methods is that they do evaluate for every row. The alternative might be to say something like "once for every distinct value of a Set category key", which is more complex to implement and means that Set categories acquire special status in dREL, which they haven't had before.

Also, as mentioned in #103, wouldn't it be better to use a method that does not tie the generated id to the row number? Using something like a uuid would help to prevent accidental id clashes and would be more in line with the idea that loop order should/can be generally ignored when dealing with DDLm ontologies.

Yes, CurrentRow is fairly appalling in many senses. I agree that it would be great to introduce a new dREL function "unique_id()", to replace CurrentRow. I'll raise an issue in the dREL repository where we can discuss it further.

Finally, note that according to the interpretation that you provided, the Evaluation method is also not fully suitable for _space_group_symop.id data item as it could lead to similar validation issues. This is due to the fact that symop ids are only recommended to match the row numbers, but not formally required to do so (except for x,y,z which is required to have the id of 1):

It is normally the sequence number of the entry in that
list, and should be identified with the code 'n' in the
geometry symmetry codes of the form 'n_pqr'.

Unfortunately 3 decades of practice have made that "normally" into "always". This is hard-coded into too much software to change now. We should change that definition to move away from sequence number and simply say that it is the value of _space_group_symop.id, and the row number when this is missing.

But, given my comments above, CurrentRow is not guaranteed to work for merged data blocks, where there might be several space groups and therefore several lists of symmetry operators. Essentially we want to change its definition so that we are talking about the row number where all key data names except for one are constant.

@vaitkus
Copy link
Collaborator

vaitkus commented Mar 21, 2023

The meaning of Evaluation has always been that it calculates the correct value of a missing data item, not that it calculates one possible value, so the results of Evaluation must be usable as a way to check values. Validation makes very little sense as a separate method purpose attached to a single data name, as it either fulfills the same role as Evaluation, by finding incorrect values, or it checks collections of values, e.g. that cell parameters match the space group symmetry, and is therefore not associated with any single data name. That's why I've always considered Validation methods to belong to a special dictionary that defines "validation" data names, like "_valid.cell_symmetry_consistent", and even in that case Evaluation would work just as well.

Understood, so the main idea is that Evaluation methods base their results on other data items and should generally always provide the same output between different runs (thus not fit for the current id use case).

A further issue is that currently we can only have a single method attached to a definition. We do need to define a _method.id to fix that, but I've been putting that off.

This is just a nitpick, but a definition can have several attached methods, but just not several attached methods of the same type (_method.purpose serves as the key). However, the _method.id item would be useful since, in theory, one could provide several alternative Evaluation methods that calculate the value based on the availability of other items or several Validation methods which check different aspects of the validated data item (e.g. as suggested in #262).

In the dREL context, there are no data blocks or data files, only collections of tables. So the simplest option for Definition methods is that they do evaluate for every row. The alternative might be to say something like "once for every distinct value of a Set category key", which is more complex to implement and means that Set categories acquire special status in dREL, which they haven't had before.

But doesn't dREL still operate on a finite dataset which in the regular use case matches a single data block? Probably, not all of the currently defined methods could not handle merged multiblock data files (e.g. method provided with _refln.f_complex uses data from several categories without trying to determine if the data came from the same data block).

Unfortunately 3 decades of practice have made that "normally" into "always". This is hard-coded into too much software to change now. We should change that definition to move away from sequence number and simply say that it is the value of _space_group_symop.id, and the row number when this is missing.

But, given my comments above, CurrentRow is not guaranteed to work for merged data blocks, where there might be several space groups and therefore several lists of symmetry operators. Essentially we want to change its definition so that we are talking about the row number where all key data names except for one are constant.

Ok, these changes can be made in a separate PR once the discussions on the id generation functions are settled.

As for the current PR, the Evaluation method was slightly easier to process as one could automatically assume it returns the missing value without analysing the actual code. On the other hand, the Definition method may modify any of the definition attributes, thus a more detailed analysis of the code is needed to figure out that the value is being assigned by modifying the default value. Currently, the CIF validator from the cod-tools package generally is not capable of interpreting dREL, but I can implement a hackish solution for now to handled just such cases.

@jamesrhester
Copy link
Contributor Author

But doesn't dREL still operate on a finite dataset which in the regular use case matches a single data block? Probably, not all of the currently defined methods could not handle merged multiblock data files (e.g. method provided with _refln.f_complex uses data from several categories without trying to determine if the data came from the same data block).

That is the reason for a specific dREL behaviour: when indexing into a category in a dREL method, if the values of any key data names are not provided explicitly, their values can be assumed from the values any related (child-parent links) key data names take in the currently-processed row. Otherwise the values of the key data names should be provided (cat[a=b].obj) in the dREL method. Certainly, once we have supplied all the key data names to Set categories to allow multi-blocks we will need to check that everything works for _refln.f_complex (the most complicated dREL method we have).

As for the current PR, the Evaluation method was slightly easier to process as one could automatically assume it returns the missing value without analysing the actual code. On the other hand, the Definition method may modify any of the definition attributes, thus a more detailed analysis of the code is needed to figure out that the value is being assigned by modifying the default value. Currently, the CIF validator from the cod-tools package generally is not capable of interpreting dREL, but I can implement a hackish solution for now to handled just such cases.

In practice a Definition method can only change single-valued attributes, as there is no way to loop over a Loop-type DDLm category. The only such categories relevant for value creation are _enumeration.default and _units.code, I think, so a simple text search for enumeration.default = .... would probably suffice.

@jamesrhester
Copy link
Contributor Author

I believe there might be a further issue: even if _publ_author.id passes validation, _publ_contact_author.id will still be missing and therefore validation will fail. To fix this we should add dREL code for _publ_contact_author.id that finds the correct index in _publ_author.id based on the old key data names.

Although we can generate arbitrary ids for parent categories publ_author and audit_author where they are missing, we cannot do the same for their child categories. This commit adds dREL that searches for matching names in the parent categories and determines the missing identifier accordingly.
@vaitkus
Copy link
Collaborator

vaitkus commented Mar 22, 2023

I believe there might be a further issue: even if _publ_author.id passes validation, _publ_contact_author.id will still be missing and therefore validation will fail. To fix this we should add dREL code for _publ_contact_author.id that finds the correct index in _publ_author.id based on the old key data names.

Yes, the same missing key problem would technically occur with the PUBL_CONTACT_AUTHOR category, but I was slightly less worried about it since it is not a prevalent in the existing CIF files. However, I would like to note, that originally in DDL1 the PUBL_AUTHOR and PUBL_CONTACT_AUTHOR categories were never explicitly linked together. What is more, _publ_contact_author_* items were unlooped as they belonged to the PUBL category so the _publ_contact_author_name item never served as a proper category key either. The logic behind the proposed dREL code is clearly correct (contact author must be one of the publication authors), but I just wanted to point out that it might expect a bit too much from the _publ_contact_author_name values (often they do not exactly match the _publ_author_name values due to a difference in formatting, etc.)

@jamesrhester
Copy link
Contributor Author

Understood. I think by making this a default value, which might be missing if everything is not perfect, we are at least making an attempt at supporting legacy files without enforcing values (as an Evaluation method would).

Copy link
Collaborator

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the discussion the provided methods look great. I suggested a few minor tweaks for the _audit_contact_author.id method, if you are ok with them, we should also propagate them to the _publ_contact_author method.

Also, I guess issue COMCIFS/dREL#6 should first be resolved before merging this?

cif_core.dic Outdated Show resolved Hide resolved
cif_core.dic Outdated Show resolved Hide resolved
jamesrhester and others added 2 commits March 23, 2023 13:11
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
@jamesrhester
Copy link
Contributor Author

Those changes are fine. Yes, we should resolve COMCIFS/dREL#6 first.

cif_core.dic Outdated
_method.purpose Definition
_method.expression
;
_enumeration.default = Current_Row(audit_author) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unique_id function (COMCIFS/dREL#10) might be more suitable here.

If not, the Current_row() call should be updated to reflect the changes implemented in COMCIFS/dREL#9.

cif_core.dic Outdated
_method.purpose Definition
_method.expression
;
_enumeration.default = Current_Row(publ_author) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unique_id function (COMCIFS/dREL#10) might be more suitable here.

If not, the Current_row() call should be updated to reflect the changes implemented in COMCIFS/dREL#9.

Missing identifiers can now be generated by dREL using the Unique_id function.
cif_core.dic Outdated Show resolved Hide resolved
@vaitkus
Copy link
Collaborator

vaitkus commented Mar 30, 2023

As all of the related issues seem to be resolved, I will merge this PR. In the future, if we get any more relations similar to those between AUDIT_CONTACT and AUDIT_CONTACT_AUTHOR it might be a good idea to parameterise and refactor the dREL code into a separate reusable function.

@vaitkus vaitkus merged commit 3e0f84d into COMCIFS:master Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category
2 participants