As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

jggautier · 2017-12-08T21:01:45Z

In issue #2243, some metadata fields important for dataset discovery were excluded from mapping to Schema.org. We said we'd include them in a later issue. This is that issue, and these are those fields (dun dun):

Creator types (person or organization) for dataset authors (a separate ticket has been opened for this, Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029, since the missing creator types are making Google's Dataset Search engine not display creator names and there are UI implications)
Dataset identifier (DOI or HDL as a URL) using the @id property
Dataset identifier (DOI or HDL as a URL) using the url property
Name of funding source (that's all schema.org supports for now; how to include other funding source details (like grant numbers) are discussed in this github issue in schema.org's repo)
Author identifiers
Geographic coverage
Multiple dataset descriptions
File metadata (see schema.org from Zenodo and from ICPSR for an example of how these fields are used)
- File PID
- File download URL (when there is one - excludes restricted files and files in datasets with guestbooks)
- File name
- File description
- File format

We'll also need to fix:

The property used for dataset authors. Dataverse is using author, but I think Google Dataset Search is ignoring author and prefers creator. (See this comment on Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029)
Keywords and Topic Classifications (Dataverse concatenates each keyword into one value, and each topic classification into one value; see this dataset's schema.org metadata)
Provider: Change the value to use the installation name (instead of hardcoding "Dataverse")
- For the provider property we hardcode "Dataverse" and put the installation name in the DataCatalog name property, but Dataset Search is displaying a "Data provided by" field and is using what's in the provider property.

Which fields are added to the Schema.org metadata template (draft) and how they're mapped will probably be adjusted after community discussion (within Dataverse community and hopefully with a proposed RDA group focused on ways to make data more discoverable by search engines).

@scolapasta asked me to add to the definition of done that we should make sure that the methods used to pull metadata values from different fields into different exports (DDI, DC, DataCite, Schema.org, native JSON (?)) are consistent.

The text was updated successfully, but these errors were encountered:

jggautier · 2018-07-03T18:20:52Z

After a discussion in today's Community Call about sending DataCite file-level metadata that includes the file checksum, @mercecrosas added in this Google Groups comment a table recommending ways to map to schema.org more dataset.

The table includes more metadata fields that have been added.

An open question, that might deserve its own github issue, is if Dataverse should produce schema.org metadata at the file level.

djbrooke · 2018-10-03T18:34:01Z

Let's exclude File Download URL for now. It can follow on in a separate issue.

pameyer · 2018-10-10T18:47:03Z

Thanks to @jggautier I was able to track down some tools for validating schema.org JSON-LD. https://github.com/jessedc/ajv-cli can be used from the command line to validate the JSON against a schema ; after using https://github.com/scrapinghub/extruct to retrieve the JSON-LD from the generated html/xhtml.

"I think Google Dataset Search is ignoring author and prefers creator"

from 41.16% to 42.94% for DatasetVersion

Use the installation brand name instead.

pdurbin · 2018-11-05T18:58:22Z

I just made the changes @jggautier @scolapasta @djbrooke @kcondon and I agreed to after standup:

3a95edf remove "schemaVersion" from output As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371
50bc8ca remove "url" from output As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371
882cbfb clarify publisher vs. provider comment As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371
7cd5622 single "isValidAuthorIdentifier" method, pass in regex As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

kcondon · 2018-11-05T22:30:12Z

So, aside from internal code restructuring, this pr:
-adds new fields to schema.org
(create/export dataset, verify against list julian provides)
-changes the structure of some fields in schema.org (multiple, object type)
(same as above, add multiple where appropriate, paste into Google validation tool)
-adds optional hide files jvm option to block download urls in export
(verify on/off behavior and pubic/restricted file behavior)
-publisher and provider will be the instance name (root dv)
(verify against export)

kcondon · 2018-11-06T19:38:22Z

Issues/questions:

files/distribution section contains additional, unspecified info: file name, pid.
author id is missing if value is entered in a nonconforming format but not indication exists what the conforming format is.

Discussed above with Julian and he will complete review. Will discuss with Julian and Phil to see what needs to be addressed.

jggautier · 2018-11-06T20:26:57Z

Another issue:

The URLs for the related publications show up wrapped in html:

"citation": [
    {
      "@type": "CreativeWork",
      "text": "Related pub citation 1",
      "@id": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>",
      "identifier": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>"

#4371

jggautier · 2018-11-06T22:26:22Z

@id is missing from the files/distribution section @kcondon mentioned in his comment (@pdurbin and I agreed to keep the extra file info.) We're always using @id whenever identifier is used. I discussed with @pdurbin and he'll update.

"distribution": [
    {
      "@type": "DataDownload",
      "name": "file1.txt",
      "fileFormat": "text/plain",
      "contentSize": 26,
      "description": "File description 1",
      "identifier": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "@id": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "contentUrl": "https://demo.dataverse.org/api/access/datafile/:persistentId?persistentId=doi:10.5072/FK2/CFWNSH/ZEHFD0"

(contentUrl should appear only when the installation indicates that they want download URLs appearing in their schema.org exports.)

pdurbin · 2018-11-06T22:28:01Z

Yep, I got rid of the "href" stuff in fcae94e and added @id at the file level in 0e0b55d.

jggautier · 2018-11-07T15:53:58Z

I looked at the schema.org export and all four issues are resolved!

@kcondon noticed that contentUrl isn't showing up in the schema.org export of a test dataset, although we expect it to. (It's the dataset titled "Test Schema Org Julian 5 Schema" on the "internal" test instance.)

pdurbin · 2018-11-07T20:17:49Z

For the record, as discussed with @kcondon and @jggautier , the FileUtil.isPubliclyDownloadable logic is used to contentUrl wasn't being shown because the dataset had terms of use. It also checks for guestbooks. Both of these require a popup to agree to or fill out in the UI.

jggautier added the Feature: Metadata label Dec 8, 2017

jggautier changed the title ~~As a researcher, I want more metadata in schema.org exports so that my data is more discoverable~~ As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable Jul 17, 2018

jggautier mentioned this issue Sep 18, 2018

Spike: Investigate how Dataverse stakeholders and users need to collect and use funder metadata #4859

Closed

djbrooke added Status: Backlog and removed Status: Backlog labels Sep 25, 2018

djbrooke assigned jggautier Oct 1, 2018

djbrooke removed the ready for estimation label Oct 3, 2018

djbrooke unassigned jggautier Oct 3, 2018

pdurbin mentioned this issue Oct 4, 2018

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

Closed

sekmiller added Status: Development and removed Status: This/Next Sprint labels Oct 4, 2018

sekmiller self-assigned this Oct 4, 2018

This was referenced Oct 5, 2018

Direct access to content associated with a DOI datacite/freya#2

Closed

Binder integration whole-tale/whole-tale#35

Open

resourceType for dataset files #5086

Open

pdurbin assigned pdurbin and unassigned sekmiller Oct 10, 2018

pdurbin added a commit that referenced this issue Oct 11, 2018

change "author" to "creator" #4371

a6915d7

"I think Google Dataset Search is ignoring author and prefers creator"

pdurbin mentioned this issue Oct 11, 2018

improved Schema.org JSON-LD output #5169

Merged

pdurbin added a commit that referenced this issue Oct 11, 2018

assert how "keywords" works, increase code coverage #4371

fe01e80

from 41.16% to 42.94% for DatasetVersion

pdurbin added a commit that referenced this issue Oct 11, 2018

stop hard-coding "Dataverse" as the provider #4371

101bd81

Use the installation brand name instead.

pdurbin added a commit that referenced this issue Oct 11, 2018

add @id and url as persistent URL #4371

4ca8f38

pdurbin added a commit that referenced this issue Oct 11, 2018

add funder #4371

043080d

pdurbin added a commit that referenced this issue Oct 11, 2018

add author identifier #4371

0d260fa

pdurbin added a commit that referenced this issue Oct 11, 2018

add spatialCoverage #4371

a84f1d1

pdurbin added a commit that referenced this issue Nov 5, 2018

remove "schemaVersion" from output #4371

3a95edf

pdurbin added a commit that referenced this issue Nov 5, 2018

remove "url" from output #4371

50bc8ca

pdurbin added a commit that referenced this issue Nov 5, 2018

clarify publisher vs. provider comment #4371

882cbfb

jggautier mentioned this issue Nov 5, 2018

As a curator, I want to more easily add metadata about related resources so that it's more discoverable #5277

Open

pdurbin added a commit that referenced this issue Nov 5, 2018

single "isValidAuthorIdentifier" method, pass in regex #4371

7cd5622

pdurbin added a commit that referenced this issue Nov 5, 2018

Merge branch 'develop' into 4371-schemaorg #4371

55fbb93

pdurbin added Status: QA and removed Status: Development labels Nov 5, 2018

pdurbin unassigned pdurbin and jggautier Nov 5, 2018

kcondon self-assigned this Nov 5, 2018

pdurbin added a commit that referenced this issue Nov 5, 2018

switch to NullSafeJsonBuilder #4371

4ed14e2

kcondon assigned jggautier Nov 6, 2018

pdurbin added a commit that referenced this issue Nov 6, 2018

Prevent href, target=_blank from getting into Schema.org JSON-LD output

fcae94e

#4371

pdurbin added a commit that referenced this issue Nov 6, 2018

add @id along side identifier at file level #4371

0e0b55d

kcondon closed this as completed Nov 7, 2018

kcondon removed the Status: QA label Nov 7, 2018

pdurbin added this to the 4.10 - Additional Data Transfer Options milestone Dec 19, 2018

jggautier mentioned this issue Jan 11, 2019

Schema.org dataset metadata doesn't include file PIDs on Harvard Dataverse #5458

Closed

pdurbin mentioned this issue Nov 12, 2019

Author/User Identifiers: Suggested Additions #1375

Closed

jggautier mentioned this issue Oct 22, 2020

Improve/update Schema.org JSON-LD export #7349

Closed

jggautier mentioned this issue Feb 5, 2024

Feature Request/Idea: Include Grant ID in "Schema.org JSON-LD" metadata export format #10296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

jggautier commented Dec 8, 2017 •

edited

Loading

jggautier commented Jul 3, 2018 •

edited

Loading

djbrooke commented Oct 3, 2018

pameyer commented Oct 10, 2018

pdurbin commented Nov 5, 2018

kcondon commented Nov 5, 2018

kcondon commented Nov 6, 2018

jggautier commented Nov 6, 2018

jggautier commented Nov 6, 2018

pdurbin commented Nov 6, 2018

jggautier commented Nov 7, 2018 •

edited

Loading

pdurbin commented Nov 7, 2018

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

Comments

jggautier commented Dec 8, 2017 • edited Loading

jggautier commented Jul 3, 2018 • edited Loading

djbrooke commented Oct 3, 2018

pameyer commented Oct 10, 2018

pdurbin commented Nov 5, 2018

kcondon commented Nov 5, 2018

kcondon commented Nov 6, 2018

jggautier commented Nov 6, 2018

jggautier commented Nov 6, 2018

pdurbin commented Nov 6, 2018

jggautier commented Nov 7, 2018 • edited Loading

pdurbin commented Nov 7, 2018

jggautier commented Dec 8, 2017 •

edited

Loading

jggautier commented Jul 3, 2018 •

edited

Loading

jggautier commented Nov 7, 2018 •

edited

Loading