Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Harvested Records #9261

Open
yikangfengnie opened this issue Jan 5, 2023 · 3 comments
Open

Problems with Harvested Records #9261

yikangfengnie opened this issue Jan 5, 2023 · 3 comments

Comments

@yikangfengnie
Copy link

yikangfengnie commented Jan 5, 2023

Hello from NIE,

We have an arrangement where NTU Dataverse (https://researchdata.ntu.edu.sg/) harvests records from NIE Dataverse (https://researchdata.nie.edu.sg/).

NIE is using Dataverse v5.8. It was upgraded from v5.4 in June 2022.

In November 2022, NTU reported that their records harvested from NIE do not contain some information, for eg. "versionnumber" was null even though "versionstate" was "RELEASED" in the table "datasetversion".

However, when NIE checked the same set of records in NIE Dataverse instance via SQL, the "versionnumber" was not null when "versionstate" is "RELEASED" in the table "datasetversion".

Apparently, the harvesting via OAI did not capture the "versionnumber" correctly.

  • To check if the above issue is specific to NIE Dataverse instance, we used the NIE Dataverse Test Server (https://researchdatatest.nie.edu.sg/) to harvest records from NTU Dataverse. In our Test Server, the records harvested from NTU Dataverse have the same problem: "versionnumber" was null even though "versionstate" is "RELEASED" in the table "datasetversion".

Below is the client config (as at 5 Jan 2023) of NTU Dataverse that harvests records from NIE Dataverse. Many thanks!

{
  "status": "OK",
  "data": {
    "harvestingClients": [
      {
        "nickName": "harvested-nie",
        "dataverseAlias": "harvested-nie",
        "type": "oai",
        "harvestUrl": "https://researchdata.nie.edu.sg/oai",
        "archiveUrl": "https://researchdata.nie.edu.sg",
        "archiveDescription": "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.",
        "metadataFormat": "oai_dc",
        "set": "N/A",
        "schedule": "none",
        "status": "inActive",
        "lastHarvest": "Sun Nov 13 00:00:00 SGT 2022",
        "lastResult": "SUCCESS",
        "lastSuccessful": "Sun Nov 13 00:00:00 SGT 2022",
        "lastNonEmpty": "Sun Oct 02 00:00:00 SGT 2022",
        "lastDatasetsHarvested": "2",
        "lastDatasetsDeleted": "0",
        "lastDatasetsFailed": "0"
      },
      {
        "nickName": "NIE_dataverse_ddi",
        "dataverseAlias": "harvested-nie",
        "type": "oai",
        "harvestUrl": "https://researchdata.nie.edu.sg/oai",
        "archiveUrl": "https://researchdata.nie.edu.sg",
        "archiveDescription": "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.",
        "metadataFormat": "oai_ddi",
        "set": "NIE_Test_Set2",
        "schedule": "none",
        "status": "inActive",
        "lastHarvest": "Fri Nov 18 12:13:05 SGT 2022",
        "lastResult": "SUCCESS",
        "lastSuccessful": "Fri Nov 18 12:13:05 SGT 2022",
        "lastNonEmpty": "Fri Nov 18 12:13:05 SGT 2022",
        "lastDatasetsHarvested": "1",
        "lastDatasetsDeleted": "0",
        "lastDatasetsFailed": "0"
      }
    ]
  }
}

Best wishes,
Yikang
NIE Library

@landreev
Copy link
Contributor

Hello,
Sorry for the delay with this. There's a lot going on here...

Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server).

Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally.

The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment.

@landreev
Copy link
Contributor

In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer?

Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports? Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version).

Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results.

Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense.

@yikangfengnie
Copy link
Author

Hi @landreev, many thanks for your inputs! My replies are in point-form below.

Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server).

  • NTU created "harvested-nie" (oai_dc format) first. However, there was an error message and no records were harvested. NTU subsequently created "NIE_dataverse_ddi" (oai_ddi format) which managed to harvest records.

Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally.

The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment.

  • When NTU tried "dataverse_json" format, there was an error message and no records were harvested.

In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer?

  • Yes.

Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports?

  • Yes.

Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version).

  • Noted. We have informed NTU about this.

Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results.

  • According to NTU, there is no other practical reason.

Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense.

  • Thank you very much for your informative and clear response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🔍 Interest
Development

No branches or pull requests

3 participants