Problems with Harvested Records #9261

yikangfengnie · 2023-01-05T08:50:58Z

Hello from NIE,

We have an arrangement where NTU Dataverse (https://researchdata.ntu.edu.sg/) harvests records from NIE Dataverse (https://researchdata.nie.edu.sg/).

NIE is using Dataverse v5.8. It was upgraded from v5.4 in June 2022.

In November 2022, NTU reported that their records harvested from NIE do not contain some information, for eg. "versionnumber" was null even though "versionstate" was "RELEASED" in the table "datasetversion".

However, when NIE checked the same set of records in NIE Dataverse instance via SQL, the "versionnumber" was not null when "versionstate" is "RELEASED" in the table "datasetversion".

Apparently, the harvesting via OAI did not capture the "versionnumber" correctly.

To check if the above issue is specific to NIE Dataverse instance, we used the NIE Dataverse Test Server (https://researchdatatest.nie.edu.sg/) to harvest records from NTU Dataverse. In our Test Server, the records harvested from NTU Dataverse have the same problem: "versionnumber" was null even though "versionstate" is "RELEASED" in the table "datasetversion".

Below is the client config (as at 5 Jan 2023) of NTU Dataverse that harvests records from NIE Dataverse. Many thanks!

{
  "status": "OK",
  "data": {
    "harvestingClients": [
      {
        "nickName": "harvested-nie",
        "dataverseAlias": "harvested-nie",
        "type": "oai",
        "harvestUrl": "https://researchdata.nie.edu.sg/oai",
        "archiveUrl": "https://researchdata.nie.edu.sg",
        "archiveDescription": "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.",
        "metadataFormat": "oai_dc",
        "set": "N/A",
        "schedule": "none",
        "status": "inActive",
        "lastHarvest": "Sun Nov 13 00:00:00 SGT 2022",
        "lastResult": "SUCCESS",
        "lastSuccessful": "Sun Nov 13 00:00:00 SGT 2022",
        "lastNonEmpty": "Sun Oct 02 00:00:00 SGT 2022",
        "lastDatasetsHarvested": "2",
        "lastDatasetsDeleted": "0",
        "lastDatasetsFailed": "0"
      },
      {
        "nickName": "NIE_dataverse_ddi",
        "dataverseAlias": "harvested-nie",
        "type": "oai",
        "harvestUrl": "https://researchdata.nie.edu.sg/oai",
        "archiveUrl": "https://researchdata.nie.edu.sg",
        "archiveDescription": "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.",
        "metadataFormat": "oai_ddi",
        "set": "NIE_Test_Set2",
        "schedule": "none",
        "status": "inActive",
        "lastHarvest": "Fri Nov 18 12:13:05 SGT 2022",
        "lastResult": "SUCCESS",
        "lastSuccessful": "Fri Nov 18 12:13:05 SGT 2022",
        "lastNonEmpty": "Fri Nov 18 12:13:05 SGT 2022",
        "lastDatasetsHarvested": "1",
        "lastDatasetsDeleted": "0",
        "lastDatasetsFailed": "0"
      }
    ]
  }
}

Best wishes,
Yikang
NIE Library

The text was updated successfully, but these errors were encountered:

landreev · 2023-01-20T23:13:06Z

Hello,
Sorry for the delay with this. There's a lot going on here...

Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server).

Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally.

The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment.

landreev · 2023-01-20T23:25:09Z

In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer?

Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports? Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version).

Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results.

Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense.

yikangfengnie · 2023-01-30T08:03:06Z

Hi @landreev, many thanks for your inputs! My replies are in point-form below.

Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server).

NTU created "harvested-nie" (oai_dc format) first. However, there was an error message and no records were harvested. NTU subsequently created "NIE_dataverse_ddi" (oai_ddi format) which managed to harvest records.

Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally.

The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment.

When NTU tried "dataverse_json" format, there was an error message and no records were harvested.

In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer?

Yes.

Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports?

Yes.

Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version).

Noted. We have informed NTU about this.

Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results.

According to NTU, there is no other practical reason.

Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense.

Thank you very much for your informative and clear response!

pdurbin added the Feature: Harvesting label Jan 19, 2023

pdurbin added Type: Bug a defect User Role: API User Makes use of APIs labels Oct 9, 2023

cmbz mentioned this issue Mar 12, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

56 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Harvested Records #9261

Problems with Harvested Records #9261

yikangfengnie commented Jan 5, 2023 •

edited by pdurbin

Loading

landreev commented Jan 20, 2023

landreev commented Jan 20, 2023

yikangfengnie commented Jan 30, 2023

Problems with Harvested Records #9261

Problems with Harvested Records #9261

Comments

yikangfengnie commented Jan 5, 2023 • edited by pdurbin Loading

landreev commented Jan 20, 2023

landreev commented Jan 20, 2023

yikangfengnie commented Jan 30, 2023

yikangfengnie commented Jan 5, 2023 •

edited by pdurbin

Loading