Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

jggautier · 2017-03-16T14:40:43Z

Upon request from other machine clients and servers (e.g. other archives) accessing datasets through their persistent identifiers, Dataverse should be able to provide dataset metadata in available formats (JSON, DDI, etc.).

This is number 10 of the 11 recommendations made in A Data Citation Roadmap for Scholarly Data Repositories (https://doi.org/10.1101/097196).

pdurbin · 2017-03-17T12:28:07Z

@landreev doesn't our Harvesting (OAI-PMH) implementation already do some content negotiation?

The SWORD spec talks about content negotiation but in practice our implementation of SWORD is very simple and for files the only content we accept is the one which is required by SWORD, which is a zip file. We require that SWORD clients uploading files to Dataverse to send this header: "Packaging: http://purl.org/net/sword/package/SimpleZip" as mentioned at http://guides.dataverse.org/en/4.6.1/api/sword.html

pdurbin · 2017-06-25T15:05:04Z

@jggautier does Harvesting count?

jggautier · 2017-06-26T21:37:56Z

@landreev very helpfully provided context when I was trying to understand the difference between this and the harvesting Dataverse does now, so he's letting me post his comments on it :)

Content negotiation is a mechanism that allows clients and servers (as in, non-human, machine clients) to agree on the communication format/protocol that they both understand. In this context, the server would assume by default that the client is a web browser, with a human user, and send them to the default landing page. A machine client from another archive would send an additional flag in the request saying "I'm a dataverse harvester, I understand the following metadata formats, ordered by preference: JSON-LD, DDI, Dublin Core"; and the server will output the metadata in one of the formats, if available, or "sorry, this content is not available in any of the formats you requested". This may be possible to implement using the already existing, standard "Accept:" http header; or maybe a special flag would need to be designed just for this purpose... that's probably more technical than you need at this point/than I can talk about comfortably without reading up on it some more.

Dataverses already do something of this nature when they harvest from each other. And we've been thinking about extending this content negotiation mechanism further. (at this point a dataverse client says to the dataverse server "I understand the Dataverse Astronomical Sci metadata block" - it should really say "I understand the ... block, version NNN" - because we've realized that the blocks are going to keep getting modified as people use them...

This type of content negotiation seems a lot more flexible.

pameyer · 2017-06-26T21:42:18Z

One potential difference between this and harvesting is that there may be an assumption that the content-type negotiation is happening at the dataset landing page, instead of a separate harvesting endpoint.

pdurbin · 2017-06-26T23:12:41Z

@jggautier so what would you consider "definition of done" to be for this issue? I think we could easily argue that Dataverse already meets the recommendation. We could write it up in the User Guide if you want. In addition to Harvesting, we have Export in various formats that are machine readable. The standards-based ones are DDI and Dublin Core: http://guides.dataverse.org/en/4.7/admin/metadataexport.html

jggautier · 2020-09-09T12:16:40Z

Sorry for this very late reply. Guess I didn't understand enough back then, and still have some questions.

A Data Citation Roadmap for Scholarly Data Repositories recommends that "data repositories and identifier service providers such as identifiers.org or DataCite in addition may implement content negotiation for the persistent identifier expressed as HTTP URI, returning machine readable metadata in various formats." The article uses DataCite's implementation as an example:

curl -LH "Accept: application/ld+json" http://doi.org/10.5061/DRYAD.8290N returns DataCite's Schema.org JSON-LD metadata for that dataset, which is published in the Dryad repository. (See DataCite's page on content negotiation for more info.)

This already works for Dataverse-based repositories that publish datasets with DataCite DOIs. So systems can use this content negotiation to get metadata about datasets with DataCite DOIs published in Dataverse-based repositories. But what's returned is the metadata that DataCite publishes. This doesn't work for getting the metadata that the Dataverse repository publishes. For example:

Running curl -LH "Accept: application/ld+json" https://doi.org/10.7910/DVN/1XVYU3 in your terminal gets you DataCite's Schema.org JSON-LD metadata...
not the Schema.org JSON-LD metadata that the Dataverse-based repository publishes, https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/1XVYU3 (which includes information about the dataset that DataCite does not have, like license metadata and file download URLs).

Systems could use Dataverse's API or OAI-PMH, but in general the value of the kind of content negotiation that the article recommends is that it's standardized and more stable, right, while systems' APIs might be organized differently from each other and could change over time? And OAI-PMH supports metadata in only XML, while this type of content negotiation allows for metadata in any format, like the JSON in the Schema.org examples above.

These are the questions I'd ask to help define the "definition of done" for this issue:

Are there other systems that could benefit more from getting dataset metadata from Dataverse-based repositories using this sort of content negotiation (versus using Dataverse's APIs or OAI-PMH)?
The article's recommendation is to use PIDs with this content negotiation. If a Dataverse-based repository is registering DataCite DOIs for its datasets, then you'll only get the metadata that DataCite publishes. Does that mean it's not possible for such Dataverse-based repositories to follow these recommendations? The DataCite page on content negotiation mentions that support for custom content types is no longer supported. Did custom content types let repositories somehow pass their own metadata?
What about Dataverse-based repositories that have datasets that have PIDs that aren't DataCite DOI?

jggautier · 2020-09-22T15:41:06Z

I've been emailing the article's corresponding author Tim Clark, who's looking into the questions in the last comment.

This has also been discussed in the context of tools for assessing the "FAIR"ness of datasets, as part of the FAIRsFAIR project.

hvdsomp · 2021-06-08T08:20:18Z

One can't implement content negotiation for URIs that are not under one's control. So, Dataverse can not implement content negotiation for a DOI HTTP-URI because it doesn't control those DOI URIs. DataCite and CrossRef can (and do) and in doing so allow access to metadata about the metadata they have about a DOI-identified object.

Signposting offers (among others) a way to get to metadata about the object that is available at the end of a (Dataverse) repository:

A client follows the redirect from a DOI of an object to end up at a (Dataverse) landing page for it
At the landing page, the client looks for typed link(s) with the "describedby" relationship (in HTTP Link header, HTML , or Link Set)
To access metadata, the client dereferences/follows a link with the "describedby" relationship, possibly on the basis of a media type that was expressed for the link by means of a "type" attribute

jggautier · 2021-07-07T18:41:45Z

Thanks @hvdsomp. You wrote that "One can't implement content negotiation for URIs that are not under one's control." That's an incredibly helpful way to put it. It doesn't seem like this is a recommendation that data repositories can actually implement then, right?

We can encourage the people who do control those URIs but haven't implemented content negotiation to implement content negotiation. I'm not sure how Handles work differently than DOIs, but there are at least 7 Dataverse repositories using them, and I'm not sure if content negotiation works for their Handle URIs. curl -LH "Accept: application/ld+json" https://hdl.handle.net/11529/10548581 doesn't seem to work.

Does anyone keeping an eye on this Github issue know more about Handles or know someone who knows more? I'll wait a week before asking in other channels (Dataverse Google Group, Code4Lib mailing list, emailing admins of repositories using Handles).

jggautier · 2021-07-27T19:37:42Z

I asked in the Dataverse Google Group but haven't had any replies, yet.

At @pdurbin's suggestion I also posted questions in the PID Forum, where I also referenced an older post in that forum that makes me question my understanding of this tenth recommendation and of content negotiation in general.

hvdsomp · 2021-07-29T06:32:27Z

The technology underlying handles and DOIs is the same, or, to put it differently, DOIs are handles. But organizations like CrossRef and DataCite have implemented a lot of functionality on top of DOIs, including content negotiation with the DOI-HTTP-URI as a means to obtain metadata in various formats, see e.g. https://www.crossref.org/documentation/retrieve-metadata/content-negotiation/.

cmbz · 2024-08-20T15:25:15Z

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

jggautier added the Feature: Metadata label Mar 27, 2017

jggautier added the User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh label Jul 3, 2017

jggautier mentioned this issue Jun 3, 2021

Signposting #7919

Closed

pdurbin added the Feature: Harvesting label Apr 12, 2022

pdurbin mentioned this issue Apr 13, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

pdurbin added the Type: Suggestion an idea label Nov 14, 2023

This was referenced Mar 12, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

Epic: GREI 4 - Analytics and Reporting IQSS/dataverse-pm#118

Open

cmbz added the GREI 3 Search and Browse label Jul 1, 2024

philippconzett mentioned this issue Jul 10, 2024

Feature Request/Idea: Make OAI-PMH harvesting more configurable #10677

Open

cmbz closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

jggautier commented Mar 16, 2017 •

edited

Loading

pdurbin commented Mar 17, 2017

pdurbin commented Jun 25, 2017

jggautier commented Jun 26, 2017

pameyer commented Jun 26, 2017

pdurbin commented Jun 26, 2017

jggautier commented Sep 9, 2020 •

edited

Loading

jggautier commented Sep 22, 2020 •

edited

Loading

hvdsomp commented Jun 8, 2021

jggautier commented Jul 7, 2021

jggautier commented Jul 27, 2021

hvdsomp commented Jul 29, 2021

cmbz commented Aug 20, 2024

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

Comments

jggautier commented Mar 16, 2017 • edited Loading

pdurbin commented Mar 17, 2017

pdurbin commented Jun 25, 2017

jggautier commented Jun 26, 2017

pameyer commented Jun 26, 2017

pdurbin commented Jun 26, 2017

jggautier commented Sep 9, 2020 • edited Loading

jggautier commented Sep 22, 2020 • edited Loading

hvdsomp commented Jun 8, 2021

jggautier commented Jul 7, 2021

jggautier commented Jul 27, 2021

hvdsomp commented Jul 29, 2021

cmbz commented Aug 20, 2024

jggautier commented Mar 16, 2017 •

edited

Loading

jggautier commented Sep 9, 2020 •

edited

Loading

jggautier commented Sep 22, 2020 •

edited

Loading