Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

Closed
Tracked by #317
vaitkus opened this issue Oct 9, 2018 · 8 comments · Fixed by #359
Closed
Tracked by #317

Comments

@vaitkus
Copy link
Collaborator

vaitkus commented Oct 9, 2018

The PUBL_AUTHOR category is defined as having the _publ_author.name as its primary key. Clearly, situations can arise where several authors having the same name co-write a paper (the COD currently contains about ~40 such entries);. What is more, no combination of the data items currently defined in the category seem to satisfy the conditions of the primary key -- all items seem to describe properties that can match for both authors (i.e. the address) or be simply be undefined (i.e. ORCID).

One of the possible solutions would be to create a new data item (i.e. _publ_author.ordinal) that takes the author order into account as is done in the CITATION_AUTHOR category (using the _citation_author.ordinal data item). The author order by itself could act as a sufficient primary key, however, this change would render most of the CIFs produced so far invalid. Using both the author order and the name as a composite key might be a more reasonable change since it would require the addition of ordinance only to those loops that contain the duplicate author names.

@jamesrhester
Copy link
Contributor

I agree with the proposal to add _publ_author.ordinal as a new data name and to make this part of a composite key. I will flag this to the core CIF group.

@jamesrhester
Copy link
Contributor

Update - cif core had no opinion, tried cif-developers.

@jamesrhester
Copy link
Contributor

jamesrhester commented Mar 13, 2020

Update - cif-developers didn't seem much perturbed either way. So here are some updated definitions which I plan to run by the Core DMG. I have written them to replace _publ_author.name as the category key, as it wasn't really working as a key anyway. The only possible problem is that dictionary-aware software will complain about the missing key.

save_PUBL_AUTHOR

_definition.id                          PUBL_AUTHOR
_definition.scope                       Category
_definition.class                       Loop
_definition.update                      2020-03-xx
_description.text                       
;
     Category of data items recording the author information.
;
_name.category_id                       PUBL
_name.object_id                         PUBL_AUTHOR
_category.key_id                        '_publ_author.id'
loop_
  _category_key.name
         '_publ_author.id' 

save_
save_publ_author.id
    _definition.id              '_publ_author.id'
    _definition.update          2020-03-xx
    _description.text
;              Arbitrary identifier for this author
;
    _name.category_id                       publ_author
    _name.object_id                         id
    _type.purpose                           Key
    _type.source                             Assigned
    _type.container                         Single
    _type.contents                          Code
save_

@jamesrhester
Copy link
Contributor

Note that audit_author would also be updated in the same way.

@jamesrhester
Copy link
Contributor

Identifiers and links added in c70b1e8.

@vaitkus
Copy link
Collaborator Author

vaitkus commented Mar 16, 2023

I tried validating the existing CIF files against the new version of the dictionary and ran into a bit of a problem. Since the newly added _publ_author.id data item is now the primary key of the PUBL_AUTHOR category, all older CIF files with the PUBL_AUTHOR loop now cause validation issues related to the missing category key. Note, that these key values are not considered as automatically derivable (at least under our current approach).

Some possible solutions:

  1. Implemented the primary key solution that I originally suggested:

One of the possible solutions would be to create a new data item (i.e. _publ_author.ordinal) that takes the author order into account as is done in the CITATION_AUTHOR category (using the _citation_author.ordinal data item). The author order by itself could act as a sufficient primary key, however, this change would render most of the CIFs produced so far invalid. Using both the author order and the name as a composite key might be a more reasonable change since it would require the addition of ordinance only to those loops that contain the duplicate author names.

The ordinal data item would be useful either way, so I will file a separate issue for its inclusion if this approach is not accepted.

  1. Make a composite key of _publ_author.id and _publ_author.name. This would make the _publ_author.id mandatory only when the _publ_author.name are not unique.

However, both of these approaches have the drawback of no longer requiring the _publ_author.id to be unique and thus no longer fit to serve as an foreign key to the _publ_contact_author.id key data item.

@jamesrhester , what is your opinion on this? It would be nice to resolve this prior to the release of the new version of the dictionary which formally introduces the _publ_author.id data item.

@vaitkus vaitkus reopened this Mar 16, 2023
@jamesrhester
Copy link
Contributor

This looks to be the identical situation to _space_group_symop.id, which was also introduced as a new key data name, and is missing from older CIF files. In that case, dREL was introduced which simply assigned a value based on the current row. _space_group_symop.id was a particularly delicate case, as the the symop number is widely used in other data values, and so the number really had to be the row number, as that had been assumed from the very beginning of CIF. In our case, we would use current row simply as a convenient way to obtain a unique identifier.

In terms of validation, I think there is value in raising a validation issue (perhaps in some sort of strict mode?) as eventually we would like software to move to using the new data names. Nevertheless, if dREL were added to the _publ_author.id would that satisfy the validation software?

Ideally, dREL would have a function, named something like unique_id(), which would return an identifier guaranteed to be unique within the category, so that we maintain the idea that row order is irrelevant. Of course, under the hood this function could just return the current row number.

@vaitkus
Copy link
Collaborator Author

vaitkus commented Mar 19, 2023

You seem to have identified the best approach. Adding a dREL evaluation code would silence the validator in a semantically sound way.

A unique_id() function would indeed be useful, but I guess it would be better for it to return an id that is unique not only within the category, but a globally unique one (for example, a uuid). This way we could be reasonably sure that the generation of a new id does not introduce unexpected relations to other categories (e.g. the file may contain an item from another category that is linked to _publ_author.id).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants