cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

vaitkus · 2018-10-09T02:17:29Z

The PUBL_AUTHOR category is defined as having the _publ_author.name as its primary key. Clearly, situations can arise where several authors having the same name co-write a paper (the COD currently contains about ~40 such entries);. What is more, no combination of the data items currently defined in the category seem to satisfy the conditions of the primary key -- all items seem to describe properties that can match for both authors (i.e. the address) or be simply be undefined (i.e. ORCID).

One of the possible solutions would be to create a new data item (i.e. _publ_author.ordinal) that takes the author order into account as is done in the CITATION_AUTHOR category (using the _citation_author.ordinal data item). The author order by itself could act as a sufficient primary key, however, this change would render most of the CIFs produced so far invalid. Using both the author order and the name as a composite key might be a more reasonable change since it would require the addition of ordinance only to those loops that contain the duplicate author names.

The text was updated successfully, but these errors were encountered:

jamesrhester · 2019-03-26T05:30:44Z

I agree with the proposal to add _publ_author.ordinal as a new data name and to make this part of a composite key. I will flag this to the core CIF group.

jamesrhester · 2019-06-28T04:38:18Z

Update - cif core had no opinion, tried cif-developers.

jamesrhester · 2020-03-13T03:55:55Z

Update - cif-developers didn't seem much perturbed either way. So here are some updated definitions which I plan to run by the Core DMG. I have written them to replace _publ_author.name as the category key, as it wasn't really working as a key anyway. The only possible problem is that dictionary-aware software will complain about the missing key.

save_PUBL_AUTHOR

_definition.id                          PUBL_AUTHOR
_definition.scope                       Category
_definition.class                       Loop
_definition.update                      2020-03-xx
_description.text                       
;
     Category of data items recording the author information.
;
_name.category_id                       PUBL
_name.object_id                         PUBL_AUTHOR
_category.key_id                        '_publ_author.id'
loop_
  _category_key.name
         '_publ_author.id' 

save_

save_publ_author.id
    _definition.id              '_publ_author.id'
    _definition.update          2020-03-xx
    _description.text
;              Arbitrary identifier for this author
;
    _name.category_id                       publ_author
    _name.object_id                         id
    _type.purpose                           Key
    _type.source                             Assigned
    _type.container                         Single
    _type.contents                          Code
save_

jamesrhester · 2020-06-23T05:18:48Z

Note that audit_author would also be updated in the same way.

jamesrhester · 2020-08-13T06:19:21Z

Identifiers and links added in c70b1e8.

vaitkus · 2023-03-16T14:05:09Z

I tried validating the existing CIF files against the new version of the dictionary and ran into a bit of a problem. Since the newly added _publ_author.id data item is now the primary key of the PUBL_AUTHOR category, all older CIF files with the PUBL_AUTHOR loop now cause validation issues related to the missing category key. Note, that these key values are not considered as automatically derivable (at least under our current approach).

Some possible solutions:

Implemented the primary key solution that I originally suggested:

One of the possible solutions would be to create a new data item (i.e. _publ_author.ordinal) that takes the author order into account as is done in the CITATION_AUTHOR category (using the _citation_author.ordinal data item). The author order by itself could act as a sufficient primary key, however, this change would render most of the CIFs produced so far invalid. Using both the author order and the name as a composite key might be a more reasonable change since it would require the addition of ordinance only to those loops that contain the duplicate author names.

The ordinal data item would be useful either way, so I will file a separate issue for its inclusion if this approach is not accepted.

Make a composite key of _publ_author.id and _publ_author.name. This would make the _publ_author.id mandatory only when the _publ_author.name are not unique.

However, both of these approaches have the drawback of no longer requiring the _publ_author.id to be unique and thus no longer fit to serve as an foreign key to the _publ_contact_author.id key data item.

@jamesrhester , what is your opinion on this? It would be nice to resolve this prior to the release of the new version of the dictionary which formally introduces the _publ_author.id data item.

jamesrhester · 2023-03-17T00:52:21Z

This looks to be the identical situation to _space_group_symop.id, which was also introduced as a new key data name, and is missing from older CIF files. In that case, dREL was introduced which simply assigned a value based on the current row. _space_group_symop.id was a particularly delicate case, as the the symop number is widely used in other data values, and so the number really had to be the row number, as that had been assumed from the very beginning of CIF. In our case, we would use current row simply as a convenient way to obtain a unique identifier.

In terms of validation, I think there is value in raising a validation issue (perhaps in some sort of strict mode?) as eventually we would like software to move to using the new data names. Nevertheless, if dREL were added to the _publ_author.id would that satisfy the validation software?

Ideally, dREL would have a function, named something like unique_id(), which would return an identifier guaranteed to be unique within the category, so that we maintain the idea that row order is irrelevant. Of course, under the hood this function could just return the current row number.

vaitkus · 2023-03-19T22:39:06Z

You seem to have identified the best approach. Adding a dREL evaluation code would silence the validator in a semantically sound way.

A unique_id() function would indeed be useful, but I guess it would be better for it to return an id that is unique not only within the category, but a globally unique one (for example, a uuid). This way we could be reasonably sure that the generation of a new id does not introduce unexpected relations to other categories (e.g. the file may contain an item from another category that is linked to _publ_author.id).

jamesrhester added the In consultation label Mar 26, 2019

jamesrhester mentioned this issue Sep 30, 2019

Revamp author categories #158

Open

jamesrhester closed this as completed Aug 13, 2020

vaitkus reopened this Mar 16, 2023

jamesrhester mentioned this issue Mar 17, 2023

[RELEASE ISSUE]: Checklist for next release of cif_core #317

Closed

15 tasks

jamesrhester mentioned this issue Mar 20, 2023

Provide default values for author identifiers #359

Merged

jamesrhester added Change approved dREL and removed In consultation labels Mar 22, 2023

vaitkus closed this as completed in #359 Mar 30, 2023

jamesrhester mentioned this issue Aug 26, 2024

Need for _refln.id items? #499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

vaitkus commented Oct 9, 2018

jamesrhester commented Mar 26, 2019

jamesrhester commented Jun 28, 2019

jamesrhester commented Mar 13, 2020 •

edited

Loading

jamesrhester commented Jun 23, 2020

jamesrhester commented Aug 13, 2020

vaitkus commented Mar 16, 2023

jamesrhester commented Mar 17, 2023

vaitkus commented Mar 19, 2023

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

cif_core: _publ_author.name data item is insufficient as a unique key to the PUBL_AUTHOR category #103

Comments

vaitkus commented Oct 9, 2018

jamesrhester commented Mar 26, 2019

jamesrhester commented Jun 28, 2019

jamesrhester commented Mar 13, 2020 • edited Loading

jamesrhester commented Jun 23, 2020

jamesrhester commented Aug 13, 2020

vaitkus commented Mar 16, 2023

jamesrhester commented Mar 17, 2023

vaitkus commented Mar 19, 2023

jamesrhester commented Mar 13, 2020 •

edited

Loading