Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse should return machine-readable metadata to requesting clients/servers (content negotiation) #3699

Closed
jggautier opened this issue Mar 16, 2017 · 12 comments
Labels
Feature: Harvesting Feature: Metadata GREI 3 Search and Browse Type: Suggestion an idea User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh

Comments

@jggautier
Copy link
Contributor

jggautier commented Mar 16, 2017

Upon request from other machine clients and servers (e.g. other archives) accessing datasets through their persistent identifiers, Dataverse should be able to provide dataset metadata in available formats (JSON, DDI, etc.).

This is number 10 of the 11 recommendations made in A Data Citation Roadmap for Scholarly Data Repositories (https://doi.org/10.1101/097196).

@pdurbin
Copy link
Member

pdurbin commented Mar 17, 2017

@landreev doesn't our Harvesting (OAI-PMH) implementation already do some content negotiation?

The SWORD spec talks about content negotiation but in practice our implementation of SWORD is very simple and for files the only content we accept is the one which is required by SWORD, which is a zip file. We require that SWORD clients uploading files to Dataverse to send this header: "Packaging: http://purl.org/net/sword/package/SimpleZip" as mentioned at http://guides.dataverse.org/en/4.6.1/api/sword.html

@pdurbin
Copy link
Member

pdurbin commented Jun 25, 2017

@jggautier does Harvesting count?

@jggautier
Copy link
Contributor Author

@landreev very helpfully provided context when I was trying to understand the difference between this and the harvesting Dataverse does now, so he's letting me post his comments on it :)

Content negotiation is a mechanism that allows clients and servers (as in, non-human, machine clients) to agree on the communication format/protocol that they both understand. In this context, the server would assume by default that the client is a web browser, with a human user, and send them to the default landing page. A machine client from another archive would send an additional flag in the request saying "I'm a dataverse harvester, I understand the following metadata formats, ordered by preference: JSON-LD, DDI, Dublin Core"; and the server will output the metadata in one of the formats, if available, or "sorry, this content is not available in any of the formats you requested". This may be possible to implement using the already existing, standard "Accept:" http header; or maybe a special flag would need to be designed just for this purpose... that's probably more technical than you need at this point/than I can talk about comfortably without reading up on it some more.

Dataverses already do something of this nature when they harvest from each other. And we've been thinking about extending this content negotiation mechanism further. (at this point a dataverse client says to the dataverse server "I understand the Dataverse Astronomical Sci metadata block" - it should really say "I understand the ... block, version NNN" - because we've realized that the blocks are going to keep getting modified as people use them...

This type of content negotiation seems a lot more flexible.

@pameyer
Copy link
Contributor

pameyer commented Jun 26, 2017

One potential difference between this and harvesting is that there may be an assumption that the content-type negotiation is happening at the dataset landing page, instead of a separate harvesting endpoint.

@pdurbin
Copy link
Member

pdurbin commented Jun 26, 2017

@jggautier so what would you consider "definition of done" to be for this issue? I think we could easily argue that Dataverse already meets the recommendation. We could write it up in the User Guide if you want. In addition to Harvesting, we have Export in various formats that are machine readable. The standards-based ones are DDI and Dublin Core: http://guides.dataverse.org/en/4.7/admin/metadataexport.html

@jggautier jggautier added the User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh label Jul 3, 2017
@jggautier
Copy link
Contributor Author

jggautier commented Sep 9, 2020

Sorry for this very late reply. Guess I didn't understand enough back then, and still have some questions.

A Data Citation Roadmap for Scholarly Data Repositories recommends that "data repositories and identifier service providers such as identifiers.org or DataCite in addition may implement content negotiation for the persistent identifier expressed as HTTP URI, returning machine readable metadata in various formats." The article uses DataCite's implementation as an example:

curl -LH "Accept: application/ld+json" http://doi.org/10.5061/DRYAD.8290N returns DataCite's Schema.org JSON-LD metadata for that dataset, which is published in the Dryad repository. (See DataCite's page on content negotiation for more info.)

This already works for Dataverse-based repositories that publish datasets with DataCite DOIs. So systems can use this content negotiation to get metadata about datasets with DataCite DOIs published in Dataverse-based repositories. But what's returned is the metadata that DataCite publishes. This doesn't work for getting the metadata that the Dataverse repository publishes. For example:

Systems could use Dataverse's API or OAI-PMH, but in general the value of the kind of content negotiation that the article recommends is that it's standardized and more stable, right, while systems' APIs might be organized differently from each other and could change over time? And OAI-PMH supports metadata in only XML, while this type of content negotiation allows for metadata in any format, like the JSON in the Schema.org examples above.

These are the questions I'd ask to help define the "definition of done" for this issue:

  • Are there other systems that could benefit more from getting dataset metadata from Dataverse-based repositories using this sort of content negotiation (versus using Dataverse's APIs or OAI-PMH)?
  • The article's recommendation is to use PIDs with this content negotiation. If a Dataverse-based repository is registering DataCite DOIs for its datasets, then you'll only get the metadata that DataCite publishes. Does that mean it's not possible for such Dataverse-based repositories to follow these recommendations? The DataCite page on content negotiation mentions that support for custom content types is no longer supported. Did custom content types let repositories somehow pass their own metadata?
  • What about Dataverse-based repositories that have datasets that have PIDs that aren't DataCite DOI?

@jggautier
Copy link
Contributor Author

jggautier commented Sep 22, 2020

I've been emailing the article's corresponding author Tim Clark, who's looking into the questions in the last comment.

This has also been discussed in the context of tools for assessing the "FAIR"ness of datasets, as part of the FAIRsFAIR project.

@jggautier jggautier mentioned this issue Jun 3, 2021
@hvdsomp
Copy link

hvdsomp commented Jun 8, 2021

One can't implement content negotiation for URIs that are not under one's control. So, Dataverse can not implement content negotiation for a DOI HTTP-URI because it doesn't control those DOI URIs. DataCite and CrossRef can (and do) and in doing so allow access to metadata about the metadata they have about a DOI-identified object.

Signposting offers (among others) a way to get to metadata about the object that is available at the end of a (Dataverse) repository:

  • A client follows the redirect from a DOI of an object to end up at a (Dataverse) landing page for it
  • At the landing page, the client looks for typed link(s) with the "describedby" relationship (in HTTP Link header, HTML , or Link Set)
  • To access metadata, the client dereferences/follows a link with the "describedby" relationship, possibly on the basis of a media type that was expressed for the link by means of a "type" attribute

@jggautier
Copy link
Contributor Author

Thanks @hvdsomp. You wrote that "One can't implement content negotiation for URIs that are not under one's control." That's an incredibly helpful way to put it. It doesn't seem like this is a recommendation that data repositories can actually implement then, right?

We can encourage the people who do control those URIs but haven't implemented content negotiation to implement content negotiation. I'm not sure how Handles work differently than DOIs, but there are at least 7 Dataverse repositories using them, and I'm not sure if content negotiation works for their Handle URIs. curl -LH "Accept: application/ld+json" https://hdl.handle.net/11529/10548581 doesn't seem to work.

Does anyone keeping an eye on this Github issue know more about Handles or know someone who knows more? I'll wait a week before asking in other channels (Dataverse Google Group, Code4Lib mailing list, emailing admins of repositories using Handles).

@jggautier
Copy link
Contributor Author

I asked in the Dataverse Google Group but haven't had any replies, yet.

At @pdurbin's suggestion I also posted questions in the PID Forum, where I also referenced an older post in that forum that makes me question my understanding of this tenth recommendation and of content negotiation in general.

@hvdsomp
Copy link

hvdsomp commented Jul 29, 2021

The technology underlying handles and DOIs is the same, or, to put it differently, DOIs are handles. But organizations like CrossRef and DataCite have implemented a lot of functionality on top of DOIs, including content negotiation with the DOI-HTTP-URI as a means to obtain metadata in various formats, see e.g. https://www.crossref.org/documentation/retrieve-metadata/content-negotiation/.

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Feature: Metadata GREI 3 Search and Browse Type: Suggestion an idea User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
Status: No status
Development

No branches or pull requests

5 participants