Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema.org markup to Dataset pages #2243

Closed
borsna opened this issue Jun 5, 2015 · 60 comments
Closed

Add schema.org markup to Dataset pages #2243

borsna opened this issue Jun 5, 2015 · 60 comments

Comments

@borsna
Copy link

borsna commented Jun 5, 2015

This would make it easier for search engines to parse information about the title, author, timeperiod etc.

Relevant types to do markup for:
http://schema.org/Dataset and http://schema.org/DataCatalog

Validation and testing of markup can be done on this page:
https://developers.google.com/structured-data/testing-tool/

The markup can be done directly in the html template.

@posixeleni
Copy link
Contributor

Related to: #1393

@bencomp
Copy link
Contributor

bencomp commented Jul 15, 2015

I don't know whether having this, or using the meta tags as in #1393 would have a greater impact. In theory this isn't a big task and it is invisible to users in the browser (unless some browser plugin detects the markup and acts on it, of course).

In terms of semantics: this is one vocabulary for expressing metadata. For the citation metadata (block), the fields map very well to either Schema.org or DC Terms/Elements. This mapping would be implicit business logic if you were to just go ahead and make changes to the UI without making the connection between metadata fields in Dataverse and ontology properties like the ones in DC Terms. The idea I'm trying to get at is similar to what I mentioned in comments on #947, but for the field names instead of field values.

Let me create a new issue for this. I can't believe I didn't do so yet :)

@borsna
Copy link
Author

borsna commented Jul 15, 2015

@bencomp I think both and tagging in the markup should be implemented.
If more portals provides this kind of tagging search engines will do a better job in finding public dataset using schema.org tagging, its already widely used for events, movies etc.

ICPSR and other data archives are already doing it the markup for the basic sets of fields .
Test this url in the validator: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/36057

If i understand your comment the idea is to provide a mapping for custom fields added to a dataset?

@bencomp
Copy link
Contributor

bencomp commented Jul 15, 2015

Just came across this article about SEO for libraries, including adding Schema.org to pages: https://journal.lib.uoguelph.ca/index.php/perj/article/view/3328/0

@borsna the ideas outlined in #2357 concern "custom" fields added to a installation of Dataverse - I'm actually not sure you can add fields to a single dataset only. These Dataverse-wide fields actually come from existing ontologies, like DDI and ISA-Tab, but the only way to read the definitions of the fields is to parse the text files in the source code.

@borsna
Copy link
Author

borsna commented Jul 16, 2015

Interesting article, thanks for the link :)

@bencomp okay, was not thinking about unique custom fields for a single dataset, rather configured fields for a dataverse installation or similar.

@pdurbin
Copy link
Member

pdurbin commented Nov 5, 2016

@borsna #2717 is related and there's been some discussion there in the past week or two.

@pdurbin
Copy link
Member

pdurbin commented Jan 26, 2017

This was posted two days ago: https://research.googleblog.com/2017/01/facilitating-discovery-of-public.html . Thanks for pointing it out, @eugene-barsky

@dlmurphy
Copy link
Contributor

dlmurphy commented Feb 2, 2017

Related to @pdurbin 's previous comment: Google has recently published new guidelines for describing scientific datasets using Schema.org vocabulary: https://developers.google.com/search/docs/data-types/datasets

These guidelines refer to Schema.org's list of markup properties related to datasets: http://schema.org/Dataset

Schema.org represents a collaboration between all major search engine companies (Google, Microsoft, Yahoo, and Yandex) and has been developed to support each of these search engines. As such, marking up dataset pages using Google's recommended Schema.org metadata fields would likely improve each of these search engines' ability to display relevant results from Dataverse. It would also likely increase the pagerank of our dataset pages, meaning Dataverse datasets would appear more frequently and more visibly in search results. This would help our datasets be more discoverable to the public.

Libraries and data repositories frequently make use of Schema.org markup for these reasons. Viewing the page source of a Mendeley Data dataset page provides a solid example of how Schema.org markup can be implemented by a data repository.

It's also worth noting that Schema.org can incorporate Dublin Core (AKA DC) terms by using a "dc" prefix, though this does not conform to Google's recommendations for marking up datasets.

@djbrooke djbrooke changed the title Add schema.org markup on landing pages Add schema.org markup to Dataset pages Feb 15, 2017
@jggautier
Copy link
Contributor

jggautier commented Feb 23, 2017

This is also one of the 11 recommendations made in A Data Citation Roadmap for Scholarly Data Repositories (https://doi.org/10.1101/097196).

I started mapping the Schema.org elements Google recommends to elements in Dataverse's citation metadata block. (Each tab on that spreadsheet is a different metadata block.)

We can also expose file and variable level metadata with Schema.org. I'm thinking those mappings (when Schema.org terms exist for it) can be recorded in that spreadsheet's other tabs and possibly here, where ingested tabular file metadata is listed and mapped to other standards. I'm not sure how accurate this last spreadsheet is.

@jggautier
Copy link
Contributor

Right now Google's recommended Schema.org properties don't include dataset persistent IDs, but persistent IDs should be embedded in dataset landing pages in json-ld as well.

@pdurbin
Copy link
Member

pdurbin commented Apr 18, 2017

persistent IDs should be embedded in dataset landing pages in json-ld as well

@jggautier (and others) you might be interested in @csarven saying, "Virtually nothing in particular is consuming granular citations in Linked Data." More at https://gitter.im/linkedresearch/chat?at=58f5f3acad849bcf42962e56

@jggautier
Copy link
Contributor

Interesting conversation. Thanks @pdurbin!

@pdurbin
Copy link
Member

pdurbin commented Jun 23, 2017

@borsna heads up that #1393 is shipping with Dataverse 4.7.

Are you or others still interested in this schema.org feature?

@pdurbin
Copy link
Member

pdurbin commented Jun 25, 2017

Related: #3700

@pdurbin pdurbin added User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured and removed zTriaged labels Jun 30, 2017
@landreev
Copy link
Contributor

OK, checked in the code to address the items from the latest checklist;
the json fragment is now passing validation by the google test tool.
(that is, my test dataset is passing validation).

@landreev
Copy link
Contributor

The only thing I had to add, to what's specified above - the "author" entry needs to have an additional "@type: Person" attribute for the whole thing to be valid. (I've updated the Googledoc to reflect this)

landreev added a commit that referenced this issue Nov 17, 2017
@landreev
Copy link
Contributor

I've checked in the last (I hope) change, that makes the ld json fragment appear in the LATEST published version ONLY.
OK to drag it directly into QA? - it looks like it's been reviewed to death already, right - ?

@jggautier
Copy link
Contributor

jggautier commented Nov 17, 2017

The only thing I had to add, to what's specified above - the "author" entry needs to have an additional "@type: Person" attribute for the whole thing to be valid. (I've updated the Googledoc to reflect this)

Thanks for checking @landreev. I spoke with Natasha about this issue, and we agreed it's okay to ignore the warning that Google's tool gives when @type is "Thing" (which it defaults to when there's no @type) and an affiliation is included. The less-preferred alternatives are (1) saying that every author is a person, which isn't true, and Dataverse has no way of knowing which author is a person and which is an organization (the other @type), or (2) not including an affiliation.

@landreev
Copy link
Contributor

OK, I reversed the type=person change.

pdurbin added a commit that referenced this issue Nov 17, 2017
Conflicts (just imports:
src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
@kcondon kcondon self-assigned this Nov 20, 2017
@kcondon kcondon closed this as completed Nov 20, 2017
@djbrooke djbrooke added this to the 4.8.4 schema.org support milestone Nov 21, 2017
xibriz added a commit to uit-no-old/dataverse that referenced this issue Dec 4, 2017
commit e19a346
Author: Ruben Andreassen <rubean85@gmail.com>
Date:   Mon Dec 4 12:20:54 2017 +0100

    Forgot username

commit 0d478a7
Merge: 45288aa 8aa4150
Author: Ruben Andreassen <rubean85@gmail.com>
Date:   Mon Dec 4 10:56:10 2017 +0100

    Merge dataporten into 4334-oauth-dataporten

commit 45288aa
Merge: caf6371 4648b6a
Author: Ruben <rubean85@gmail.com>
Date:   Fri Dec 1 14:45:44 2017 +0100

    Merge pull request #1 from IQSS/develop

    test

commit 4648b6a
Merge: 0f36aa0 fff836c
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Thu Nov 30 18:44:35 2017 -0500

    Merge pull request IQSS#4331 from IQSS/4330-no-affiliation

    add null check for datasetAuthor.getAffiliation() IQSS#4330

commit fff836c
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 30 16:39:26 2017 -0500

    add null check for datasetAuthor.getAffiliation() IQSS#4330

commit 0f36aa0
Merge: e2878ce fad8669
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Thu Nov 30 15:07:54 2017 -0500

    Merge pull request IQSS#4325 from IQSS/4324-header-padding

    Fixed padding layout issue with dataverse name text link in header IQSS#4324

commit fad8669
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Thu Nov 30 10:14:53 2017 -0500

    Fixed padding layout issue with dataverse name text link in header. [ref IQSS#4324]

commit e2878ce
Merge: d785c5c cb9647f
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Wed Nov 29 18:22:53 2017 -0500

    Merge pull request IQSS#4305 from IQSS/4304-navbar-search

    use "?" (`&IQSS#63;`) rather than "&" (`&IQSS#38;`) before "q" IQSS#4304

commit d785c5c
Merge: a881f36 3cc02d0
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Wed Nov 29 18:19:25 2017 -0500

    Merge pull request IQSS#4302 from IQSS/3700-export-schema.org

    implement export of schema.org JSON-LD IQSS#3700

commit 3cc02d0
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 29 12:53:04 2017 -0500

    have dataset page get cached JSON-LD, if available IQSS#3700

commit 84224bd
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 29 12:45:53 2017 -0500

    guard against null terms.getTermsOfUse() IQSS#3700

commit ba9c6bd
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 29 12:28:16 2017 -0500

    API: document "schema.org" as a supported export format IQSS#3700

commit e5c2528
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 29 12:11:17 2017 -0500

    capitalize Schema.org in guides IQSS#3700

commit 086824d
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 29 10:57:32 2017 -0500

    note that we know "affliation" throws a warning IQSS#3700

commit a881f36
Merge: b20ab14 23b865c
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Tue Nov 28 16:28:04 2017 -0500

    Merge pull request IQSS#4312 from IQSS/4197-bundle-error

    Fixed bundle reference to "parent" dataverse for Theme + Widget pg IQSS#4197

commit 34859e7
Merge: 2f278cc b20ab14
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 28 16:24:56 2017 -0500

    Merge branch 'develop' into 3700-export-schema.org IQSS#3700

commit 23b865c
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Tue Nov 28 14:42:12 2017 -0500

    Fixed bundle reference to "parent" dataverse for Theme + Widget pg. [ref IQSS#4197]

commit b20ab14
Merge: caf6371 8e6354a
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Tue Nov 28 14:01:39 2017 -0500

    Merge pull request IQSS#4277 from IQSS/4197-dv-header

    4197 dv header

commit 8e6354a
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Tue Nov 28 13:23:15 2017 -0500

    Changed references from "customization" to "theme" in Theme + Widgets pg. [ref IQSS#4197]

commit c312a85
Author: Derek Murphy <dlmurphy@g.harvard.edu>
Date:   Tue Nov 28 13:05:39 2017 -0500

    Doc rewrites [IQSS#4197]

    Rewrote some text on the config page for clarity, changed terminology
    usage in dataverse management page to make it more consistent

commit f68b81d
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Tue Nov 28 12:15:40 2017 -0500

    Removed commented out theme logic found in QA. [ref IQSS#4197]

commit 624922f
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 28 11:09:26 2017 -0500

    when adding row to dataversetheme, use white instead of gray IQSS#4197

commit cb9647f
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 27 10:27:30 2017 -0500

    use "?" (&IQSS#63;) rather than "&" (&IQSS#38;) before "q" IQSS#4304

commit d8028f1
Merge: 36d9228 caf6371
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 27 09:33:03 2017 -0500

    Merge branch 'develop' into 4197-dv-header IQSS#4197

commit 2f278cc
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 22 12:33:56 2017 -0500

    cleanup IQSS#3700

commit b00d4d6
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 22 12:28:25 2017 -0500

    capitalize "Schema.org" IQSS#3700

commit 8f52663
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 22 11:06:41 2017 -0500

    implement export of schema.org JSON-LD IQSS#3700

commit caf6371
Merge: c67a39f d80b9d1
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Tue Nov 21 16:29:07 2017 -0500

    Merge pull request IQSS#4297 from IQSS/orcid_v21

    orcid v2.1 changes (mainly https for profile page link)

commit c67a39f
Merge: 0918fae a756751
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Mon Nov 20 15:48:37 2017 -0500

    Merge pull request IQSS#4252 from IQSS/2243-schema.org-json-ld

    2243 schema.org json ld

commit d80b9d1
Author: Pete Meyer <pameyer@crystal.harvard.edu>
Date:   Mon Nov 20 14:32:09 2017 -0500

    orcid v2.1 changes (mainly https for profile page link)

commit 0918fae
Merge: 3013c0d dcfcbaf
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Mon Nov 20 14:31:41 2017 -0500

    Merge pull request IQSS#4276 from IQSS/4250-ingest-failed

    make it clear that file upload is complete IQSS#4250

commit 3013c0d
Merge: b4cea62 3f0f7e8
Author: kcondon <kcondon@hmdc.harvard.edu>
Date:   Mon Nov 20 14:21:37 2017 -0500

    Merge pull request IQSS#4275 from IQSS/4262-describe-method

    move `describe` from EjbDataverseEngine to Command interface IQSS#4262

commit 36d9228
Merge: d612189 b4cea62
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 17 16:38:34 2017 -0500

    Merge branch 'develop' into 4197-dv-header IQSS#4197

commit dcfcbaf
Merge: 268c3dc b4cea62
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 17 16:36:21 2017 -0500

    Merge branch 'develop' into 4250-ingest-failed IQSS#4250

commit 3f0f7e8
Merge: 633a19d b4cea62
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 17 16:33:37 2017 -0500

    Merge branch 'develop' into 4262-describe-method IQSS#4262

commit a756751
Merge: eec1163 b4cea62
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 17 16:32:43 2017 -0500

    Merge branch 'develop' into 2243-schema.org-json-ld IQSS#2243

    Conflicts (just imports:
    src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

commit eec1163
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Fri Nov 17 15:58:38 2017 -0500

    Per conversation with jgautier stipped the '@type="person"' attribute in the author fragment;
    since it can be a person or an organization; this results in a warning from google validation tool
    (because "Thing" is not supposed to have an affiliation) but it appears to be ok to live with it.

commit 0801d56
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Fri Nov 17 15:36:04 2017 -0500

    ldjson should will only be embedded into the page if this is the LATEST PUBLISHED version (IQSS#2243)

commit a2742c5
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Fri Nov 17 15:08:40 2017 -0500

    latest changest to ld json formatting, making the fragment pass the google validation tool test. (IQSS#2243)

commit d612189
Author: Derek Murphy <dlmurphy@g.harvard.edu>
Date:   Fri Nov 17 13:01:55 2017 -0500

    Docs: extremely nitpicky word change [IQSS#4197]

    Changed a couple words in the config page.

commit d277669
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Thu Nov 16 16:21:29 2017 -0500

    Added tip to Installation Guide > Configuration > Custom Header related to disable root theme. [ref IQSS#4197]

commit 80219c5
Author: Derek Murphy <dlmurphy@g.harvard.edu>
Date:   Thu Nov 16 11:43:59 2017 -0500

    Syntax + typo fix

    Small edit, fixed a typo and a syntax error in (ironically) a header in
    the docs

commit e0399c1
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Wed Nov 15 19:50:54 2017 -0500

    ...and a quick fix for the "temporalCoverage" entry (IQSS#2243)

commit 67882ff
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Wed Nov 15 19:41:05 2017 -0500

    the ld json fragment should now be structured as specified in the issue IQSS#2243.

commit 8b8391f
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Wed Nov 15 13:24:22 2017 -0500

    added topicClassifications and kewords to JSONLD. (IQSS#2243)

commit 28f705c
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 15 12:58:11 2017 -0500

    implement :DisableRootDataverseTheme db setting IQSS#4197

commit 268c3dc
Author: Michael Heppler <mheppler@hmdc.harvard.edu>
Date:   Wed Nov 15 12:54:50 2017 -0500

    Revised ingest error popover message text. Fixed icon spacing issue. [ref IQSS#4250]

commit 7cd2fea
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 15 12:01:57 2017 -0500

    Revert "stub out UI for disabling root dataverse theme IQSS#4197 "

    This reverts commit b9c3c56.

    We're going to use a database setting instead.

commit b9c3c56
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 15 08:53:36 2017 -0500

    stub out UI for disabling root dataverse theme IQSS#4197

commit 1f938e9
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 15 08:18:25 2017 -0500

    Revert "only show header for non-root dataverses IQSS#4197 "

    This reverts commit 8eccacd.

commit 633a19d
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 14 19:02:10 2017 -0500

    affectedDvObjects is a better name for this field IQSS#4262

commit 9a3f4a3
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 14 17:10:06 2017 -0500

    add the role to the message IQSS#4262

commit 7cfc8ba
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 14 10:09:18 2017 -0500

    override `describe` in AssignRoleCommand IQSS#4262

commit 023cb8f
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 13 16:09:43 2017 -0500

    remove parameters since the Command has them IQSS#4262

commit 8eccacd
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 13 15:52:37 2017 -0500

    only show header for non-root dataverses IQSS#4197

commit 7795e70
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 13 15:22:08 2017 -0500

    change header background from gray to white IQSS#4197

commit e434dd0
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 13 14:28:23 2017 -0500

    make it clear that file upload is complete IQSS#4250

commit 26eb11d
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Mon Nov 13 14:18:57 2017 -0500

    move `describe` from EjbDataverseEngine to Command interface IQSS#4262

commit 7d03e70
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Nov 7 16:21:37 2017 -0500

    consistency between DC.subject and JSON-LD keywords IQSS#2243

commit 9f1d057
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Mon Nov 6 21:58:32 2017 -0500

    one more addition for IQSS#2243 - added temporalCoverage.

commit 8c74e37
Author: Leonid Andreev <leonid@hmdc.harvard.edu>
Date:   Mon Nov 6 21:28:06 2017 -0500

    A few quick fixes for getJsonLd() (and the corresponding test in DatasetVersionTest());
    (ref IQSS#2243)

commit c941781
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 3 12:21:12 2017 -0400

    explain why ui:insert lines are in the template IQSS#2243

commit 1aa323a
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 3 12:20:52 2017 -0400

    remove unused imports used in this branch IQSS#2243

commit f8ca59f
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 3 12:13:05 2017 -0400

    add tests for getJsonLd and getPublicationDateAsString IQSS#2243

commit b1db8ee
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 3 11:26:37 2017 -0400

    rename to publicationDateAsString and improve javadoc IQSS#2243

commit 8f3083c
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Fri Nov 3 11:14:13 2017 -0400

    delete cruft (unused method) IQSS#2243

commit 6c5f044
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 2 15:41:12 2017 -0400

    use dateModified and proper schemaVersion URL IQSS#2243

commit 171c8f3
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 2 15:29:35 2017 -0400

    move getJsonLd method to DatasetVersion entity IQSS#2243

commit 485a5ca
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 2 15:25:37 2017 -0400

    don't even try to figure out if the author is a person or not IQSS#2243

commit 80b5a88
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 2 15:19:49 2017 -0400

    limit to non-published, not just non-drafts IQSS#2243

    Also add helper method.

commit ad71c6a
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Thu Nov 2 15:17:32 2017 -0400

    use same date format as meta name="DC.date" IQSS#2243

commit 2cc958d
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 1 13:30:15 2017 -0400

    fix a number of issues (listed below) IQSS#3793 IQSS#2243

    - only show published versions
    - show URL to DOI dynamically (was hard coded)
    - show publication date
    - show correct publisher
    - show correct provider

commit 5ad88fc
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Wed Nov 1 13:15:00 2017 -0400

    better author name parsing (could be an org!) IQSS#3793 IQSS#2243

commit 1b62596
Author: Philip Durbin <philip_durbin@harvard.edu>
Date:   Tue Oct 31 14:57:01 2017 -0400

    stub out dataset in json-ld format IQSS#3793
@mfenner
Copy link

mfenner commented Jan 18, 2018

Regarding how DataCite tries to determine whether an author is a person or an organization: we spent a lot of effort on this, and have gone through multiple iterations. The code is here: https://github.com/datacite/bolognese/blob/master/lib/bolognese/author_utils.rb#L72-L87.

We assume the author is a person if

  • we have familyName metadata
  • we have an ORCID ID
  • the author is in the format familyName, givenName
  • the author is in the format givenName familyName and the givenName is in a dictionary of first names.

The above gives us a > 90% accuracy. The reason we need this is not so much that it is required for schema.org, but that we need this to do proper citation formatting and bibtex export.

@mfenner
Copy link

mfenner commented Jan 18, 2018

Also, while @id is a complex topic, I think it really matches well to what the DOI (expressed as URL). For practical reasons I like to use the DOI in both @id and identifier.

@pdurbin
Copy link
Member

pdurbin commented Jan 18, 2018

@mfenner thanks. This is helpful and interesting.

@mfenner
Copy link

mfenner commented Jan 18, 2018

Using a dictionary of given names worked really well for us. False negatives were mainly names from China and India, false positives the rare organization where the name starts with a given name, e.g. Alfred P. Sloan Foundation.

Because this is so painful, the DataCite Schema 4.1 released in September 2017 added an attribute to creator and contributor: nameType (controlled list of either personal or organizational).

The simplest solution is obviously to use givenName and familyName from the start.

@pdurbin
Copy link
Member

pdurbin commented Jan 18, 2018

Thanks, that nameType attribute sounds useful. I just mentioned it over at #4318 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests