-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with Harvested Records #9261
Comments
Hello, Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server). Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally. The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment. |
In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer? Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports? Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version). Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results. Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense. |
Hi @landreev, many thanks for your inputs! My replies are in point-form below. Just looking at the NTU harvesting clients - why do they have TWO harvesting clients both configured for your server? (presumably, your set "harvested-nie" is a subset of the "everything", or the default set; generally, you never want to harvest overlapping sets from the same server).
Also, note that the 2 client configs use 2 different metadata formats. The one that harvests only "harvested-nie" uses oai_ddi; but the one that harvests everything uses oai_dc. oai_dc ("Dublin Core") is a fairly primitive format; oai_ddi encodes much more information and is a much better format for harvesting between Dataverses. But do note that NEITHER oai_dc nor oai_ddi will preserve the versionnumber. I don't think this has anything to do with either of the 2 Dataverses involved upgrading from one version to another. I'm pretty sure that this has always been the case. Also please note that with harvesting there is really no expectation that the harvested datasets will have all the database values replicated 1:1 from the source server. Harvester obtains whatever metadata can be encoded and re-imported with a given format and indexes it in the local search engine; it does not really try to replicate these datasets locally. The 3rd harvestable format, "dataverse_json" encodes the most metadata, and does actually replicate the version number. It's a less reliable harvesting format (it's not going to work if, for example, the 2 installations are using different custom metadata blocks). If it were really, absolutely necessary for them to harvest your datasets with the version numbers, they could try to re-harvest everything in that format... But, I strongly suspect that they don't really need them for any practical purposes. OK, this is already quite long, so let me explain in another comment.
In addition to the Google group thread you linked above, is this the same issue as what was discussed here (also from NTU): https://groups.google.com/g/dataverse-community/c/O_-lDsRTLeM/m/UjR_HxVMBQAJ?utm_medium=email&utm_source=footer?
Do I understand correctly that the only real problem that the null versionnumber in these harvested datasets was causing was the NullPointer error in the Make Data Count reports?
Based on my conversations with another developer more familiar with MDC, that should no longer be an issue. (It was only due to a problem in the legacy, old counter-processor script, but it should no longer happen once they upgrade to the newer version).
Are you aware of any other practical reason why they may need to have the version numbers of harvested datasets in the database? I honestly can't think of any. I don't think Dataverse ever uses it for anything. The only way the metadata of a Harvested dataset are used at all is for indexing under the search engine. So that the users of this Dataverse can find these remote datasets when they search locally. But the harvesting Dataverse never attempts to display these datasets; the users are instead redirected back to the Dataverse from which the dataset was harvested when they click on it in the search results.
Sorry for such a long reply. I didn't know how to explain this using fewer words. I hope this makes sense.
|
Hello from NIE,
We have an arrangement where NTU Dataverse (https://researchdata.ntu.edu.sg/) harvests records from NIE Dataverse (https://researchdata.nie.edu.sg/).
NIE is using Dataverse v5.8. It was upgraded from v5.4 in June 2022.
In November 2022, NTU reported that their records harvested from NIE do not contain some information, for eg. "versionnumber" was null even though "versionstate" was "RELEASED" in the table "datasetversion".
However, when NIE checked the same set of records in NIE Dataverse instance via SQL, the "versionnumber" was not null when "versionstate" is "RELEASED" in the table "datasetversion".
Apparently, the harvesting via OAI did not capture the "versionnumber" correctly.
Below is the client config (as at 5 Jan 2023) of NTU Dataverse that harvests records from NIE Dataverse. Many thanks!
Best wishes,
Yikang
NIE Library
The text was updated successfully, but these errors were encountered: