Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of datestamp from Record during harvest using oai_dc format #8693

Open
tcoupin opened this issue May 12, 2022 · 4 comments
Open

Usage of datestamp from Record during harvest using oai_dc format #8693

tcoupin opened this issue May 12, 2022 · 4 comments

Comments

@tcoupin
Copy link
Member

tcoupin commented May 12, 2022

When I setup a client using oai_dc format, the datestamp from Record is used as release time and is display above the citation:

<header>
<identifier>https://doi.org/10.15454/1TPZGG</identifier>
<datestamp>2020-10-08T13:01:11Z</datestamp>
<setSpec>Doi2Pmh</setSpec>
<setSpec>test-dduni</setSpec>
</header>

and result :
image

I think it is the good behavior because of the definition of datestamp tag: "the date of creation, modification or deletion of the record for the purpose". This is coherent with hosted datasets, this date is used as release time of datasetversion object and so for sorting dataset on display.

But when I harvest an other dataverse with this format, the datestamp is the export date of the record and this date can be updated without a new version of the dataset.

So I have 2 questions:

1- Why can the datestamp be updated without new version of the dataset?
2- Can I modify the code so that the datestamp will be the release time? A full haversting will be necessary for the client to rollback to the release time older than the current record datestamp

Which version of Dataverse are you using?

5.5 or 5.10.1

Any related open or closed issues to this bug report?

@poikilotherm
Copy link
Contributor

poikilotherm commented May 12, 2022

Hi @tcoupin,

as I have my fingers dirty with XOAI and harvesting code right now, I tried to dig around the Dataverse OAI-PMH Client code to look if this might be a problem with the xoai-service-provider module. Turns out: it's not even used, as the complete parsing of OAI responses is done with a custom parser.

Digging deeper, I tried to find places where the <header><datestamp> field would be used. And what I found is not supporting your issue description: it seems to be ignored!

private void processHeader (XMLStreamReader xmlr) throws XMLStreamException {
for (int event = xmlr.next(); event != XMLStreamConstants.END_DOCUMENT; event = xmlr.next()) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (xmlr.getLocalName().equals("identifier")) {/*do nothing*/}
else if (xmlr.getLocalName().equals("datestamp")) {/*do nothing -- ?*/}
else if (xmlr.getLocalName().equals("setSpec")) {/*do nothing*/}
} else if (event == XMLStreamConstants.END_ELEMENT) {
if (xmlr.getLocalName().equals("header")) return;
}
}
}

The import of the Dublin Core metadata written out as a file by FastGetRecord.harvestRecord() is quite complex. Maybe you can dig through this how the dates are mapped?

@poikilotherm
Copy link
Contributor

Digging through the server code, I find that <datestamp> is created via harvest.server.xoai.Xitem.getDatestamp(), which requests OAIRecord.getLastUpdateTime(). This is a database entity, updated via the OAIRecordServiceBean.updateOaiRecordForDataset() with the current Date of method execution.

Possible ways to get there:

  1. API trigger to /api/exportOAI/<setname>:
    Metadata.exportOaiSet() -> ... -> OAISetServiceBean.exportOaiSet() -> OAIRecordServiceBean.updateOaiRecords() -> OAIRecordServiceBean.updateOaiRecordForDataset()
  2. UI interaction on the Sets page (creating, editing, ...):
    HarvestingSetsPage.runSetExport() -> OAISetServiceBean.exportOaiSet() -> OAIRecordServiceBean.updateOaiRecords() -> OAIRecordServiceBean.updateOaiRecordForDataset()
  3. Export Timer Service:
    DataverseTimerServiceBean.handleTimeout() -> OAISetServiceBean.exportAllSets() -> OAISetServiceBean.exportOaiSet() -> OAIRecordServiceBean.updateOaiRecords() -> OAIRecordServiceBean.updateOaiRecordForDataset()

There is a check in place comparing the last export time of the record and the dataset, so the entry will only be updated if the last metadata export is newer than the OAI records last creation.

Of course this might be prone to a race condition or other error, where a dataset keeps getting exported etc. Someone would need to debug this more closely...

@tcoupin
Copy link
Member Author

tcoupin commented May 12, 2022

Hi @poikilotherm

In the client side:
the datestamp is extracted from the listidentifier response so it's not necessary to parse it again in the getrecord response:

The release date is finaly set in import service bean:

ds.getVersions().get(0).setReleaseTime(oaiDateStamp);

In the server side:
I found a lot of dataset where the datestamp does not match the last update date. Ex:

With oai_dc format, there are only 2 dates: in the header and in date tag (not set in dataverse oai server responses). The date tag refers to the creation or publication date so it's not appropriate to use it to store the last modification date. Currently, the date is map to the publication date field of the dataset (https://github.com/IQSS/dataverse/blob/develop/src/main/resources/db/migration/afterMigrate__1-7256-upsert-referenceData.sql#L20)

I still don't unterdand why the datestamp is the date of the export and not the date of the last modification.

@poikilotherm
Copy link
Contributor

poikilotherm commented May 13, 2022

You Sir @tcoupin are absolutely right about the client code using the header <datastamp>. Thanks for pointing that out! This part of the code is often confusing to read (which is why @landreev and me are trying to refactor it).

About the date missing from the oai_dc: it looks like this is a bug no one actually noticed for a LONG time.

Looking at DublinCoreExportUtil.createOAIDC() and the test class DublinCoreExportUtilTest the whole method was never covered, so no one noticed missing metadata. There are different dates that should be written, but this fails silently - empty elements are simply skipped.... Just - they should not be empty. xmlunit to the rescue... 🙈

And with regards to the <datestamp> getting updated, I bet there is a tricky to find race condition. @landreev looks like this whole metadata export thing needs love and tests. 🤯

@mreekie mreekie added NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... and removed NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... labels Oct 25, 2022
@mreekie mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants