Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request/Idea: Make OAI-PMH harvesting more configurable #10677

Open
philippconzett opened this issue Jul 10, 2024 · 11 comments
Open

Feature Request/Idea: Make OAI-PMH harvesting more configurable #10677

philippconzett opened this issue Jul 10, 2024 · 11 comments
Labels
Type: Feature a feature request

Comments

@philippconzett
Copy link
Contributor

philippconzett commented Jul 10, 2024

Overview of the Feature Request
The idea is to make it possible to make OAI-PMH metadata harvesting more configurable, so that 1) the metadata about the datasets included in a given harvesting set can come from any selection of fields from any metadata schema defined in a Dataverse installation, and 2) the metadata can be based on other standards than Dublin Core (DC). See discussion in Dataverse Users Community Google Group.

What kind of user is the feature intended for?
API User, Superuser, Sysadmin

What inspired the request?
DataverseNO would like to implement interoperability support for data to be made searchable and reusable through the Svalbard Integrated Arctic Earth Observing System (SIOS), which is an international observing system for long-term measurements in and around the Norwegian archipelago of Svalbard addressing Earth System Science questions. There is a growing community in Europe and beyond who makes or is interested in making their data reusable through SIOS. Currently, SIOS only supports harvesting of discovery metadata using OAI-PMH.

What existing behavior do you want changed?
Currently, Dataverse supports OAI-PMH harvesting using a DC representation of (some of?) the metadata in the Citation Metadata block.

Any brand new behavior you want to add to Dataverse?
Yes, the requested feature would extend the possibility of how to configure OAI-PMH metadata harvesting.

Any open or closed issues related to this feature request?
Some of the issues below might be related:

IQSS/dataverse:

IQSS/dataverse-pm:

@philippconzett philippconzett added the Type: Feature a feature request label Jul 10, 2024
@philippconzett philippconzett changed the title Feature Request/Idea: Make OAI-PMH havesting sets more configurable Feature Request/Idea: Make OAI-PMH havesting more configurable Jul 10, 2024
@philippconzett
Copy link
Contributor Author

@pdurbin and I were having a chat about this issue on Zulip, but found out it would be good to share the conversation in a public channel, so I've pasted it below:

Philip Durbin:

It is important to be able to configure which fields can be harvested or is simply "all fields" (from all metadata blocks) sufficient?

Philipp Conzett:

I'm not sure. I think SIOS would need that we provide a OAI-PMH harvesting set which exposes metadata according to the XML Schema of the GCDD DIF standard about all relevant datasets in DataverseNO. But maybe it's possible to do some sort of selection and/or mapping of relevant fields based on what Dataverse exposes through OAI-PMH?

Philipp Conzett:

I've uploaded an example XML file about one dataset in SIOS (the file extension of the original file is .xml, but GitHub wouldn't allow me to upload .xml files):
NPI_4e28fed2-cf18-52e8-8370-744ca8a4c7cf_dif10.txt

Philip Durbin:

Does SIOS use OAI-PMH to harvest this GCMD DIF format from any other data repositories? Or would Dataverse be the first?

Philipp Conzett:

SIOS supports OAI-PMH harvesting based on two metadata standards, GCMD DIF and ISO 19115. On the SIOS Data Portal page, I see in the right filter section that there are about 20 data centers being harvested. I don't know how many of these use GCMD DIF, but let's say half of them do.

Philip Durbin:

Interesting. I don't know what ISO 19115 is but is that also an option? I found https://en.wikipedia.org/wiki/Geospatial_metadata#ISO_19115:_Geographic_information_%E2%80%93_Metadata
I bet Amber and others who are into geospatial data would like this.

Philipp Conzett:

Based on what I've found out about the two standards, GCDD DIF seems easier to implement.

Philip Durbin:

"Easier to implement" sounds good.

@johannes-darms
Copy link
Contributor

@philippconzett we are also interested in this feature.

Could we implement something similar to the Exporter SPI, i.e. add custom modules (Importers) responsible of the transformation of a harvested metadata format into corresponding metadatablocks?

cc:@vera @julian-schneider

@qqmyers
Copy link
Member

qqmyers commented Jul 10, 2024

Per

if (exporter != null && (exporter instanceof XMLExporter) && exporter.isHarvestable()) {
- if an exporter is set as isHarvestable()=true and it is an XML format, I think it is made available as an option for harvesting. I'm not sure if XML is a requirement based on the spec or just a Dataverse choice.

We don't yet have the equivalent of the exporter spi to make importers, but if the idea here is just to let non-Dataverse catalogs harvest DV content, and it's XML, I think you just have to create/install the exporter you want.

@pdurbin pdurbin changed the title Feature Request/Idea: Make OAI-PMH havesting more configurable Feature Request/Idea: Make OAI-PMH harvesting more configurable Jul 10, 2024
@philippconzett
Copy link
Contributor Author

Thanks for the feedback! I'm not sure if I understand the technical details. Could we schedule a call with Jim and/or Phil and those interested?

@johannes-darms
Copy link
Contributor

@qqmyers: That's great, I wasn't aware of this feature! I thought we were talking about the other way round, collecting more metadata from other repositories...

@philippconzett That would be nice, we or at least one of us (@vera, @julian-schneider, @johannes-darms ) would like to join.

@philippconzett
Copy link
Contributor Author

Great! I've created a when2meet calendar to help us schedule a call. I'll be on and off in vacation mode from today, but maybe Thursday or Friday next week could work for most of us?

It would be good if someone knowing the details of metadata export could join. I see that @poikilotherm, @qqmyers, and @pdurbin have contributed to the GDCC dataverse-exporters GitHub repo.

@philippconzett
Copy link
Contributor Author

Just to make sure I'm on the right track: The functionality @qqmyers refers to above, is the one described in section Metadata Export Formats in the Developer Guide?

@poikilotherm
Copy link
Contributor

Yes indeed!

@philippconzett
Copy link
Contributor Author

@qqmyers Thanks for filling in the when2meet calendar!

Pinging @vera, @julian-schneider, @johannes-darms, @DS-INRA, @gwendoux

I've created a collaborative notes doc. It currently contains a brief description of the DataverseNO-SIOS use case and how we could approach it to make the requested feature useful for other, similar use cases in the Dataverse community. Please feel free to contribute! Thanks!

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 14, 2024

Leaving a note here that I shamelessly made use of my admin rights and #2721 to the initial description.

Another note: I've been talking about creating an XML-RDF exporter for a long time now. That's the way to go when you want to expose all metadata in XML without much need for configuration.

Not sure if we'd prefer some standalone thing specialised in XML stuff or if we want to look into using sth like https://github.com/gdcc/exporter-transformer.

Also, not sure if these issues are related with regards to technical implementation: #10042, #9344, #10000

@philippconzett
Copy link
Contributor Author

Thanks all for indicating your availability. I've sent you a calendar invite. Please let me know if you haven't got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
Status: Interested
Status: High priority
Status: Important
Status: 🔍 Interest
Development

No branches or pull requests

4 participants