Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting DOI metadata from non-OAI-PMH sources #5402

Closed
RightInTwo opened this issue Dec 14, 2018 · 26 comments
Closed

Harvesting DOI metadata from non-OAI-PMH sources #5402

RightInTwo opened this issue Dec 14, 2018 · 26 comments

Comments

@RightInTwo
Copy link
Contributor

RightInTwo commented Dec 14, 2018

See https://github.com/IQSS/doi2pmh-server for the continuation!

I would like to harvest heterogeneous sources that don't necessarily present the datasets I need through OAI-PMH or in the form I need them. The issues I see with OAI-PMH:

  • OAI-PMH interfaces don't necessarily exist for every source of datasets I want to use
  • OAI-PMH sets need to be defined at the source
  • Harvested data will go into one dataverse, with no ability to map specific datasets to dataverses
  • Supplied metadata is sometimes insufficient and harvested metadata can not be augmented (e.g. the information that our institute has an existing DUA for that data)
  • The granularity of the original data is not necessarily the one wanted in the repository (e.g. for a longitudinal study, we want to group all years into one dataset that describes the study as a whole. This is only a reference for scientists to increase discoverability of these data that are maintained at the external source)

These datasets would be described and updated using the metadata for the DOIs supplied by Datacite and Crossref through the import API (which is currently not its purpose!). One solution would also be to set up an own harvesting server., but that would limit the abilities (metadata fields) to those supplied by OAI-PMH and create quite a big overhead.

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Dec 14, 2018

What we are currently doing to prepare our repository:

  • Collect existing DOIs for relevant objects
  • Get basic metadata from Datacite (next step: also query Crossref and as a last resort: query the repository directly)
  • Categorize the objects (A. data we use from external sources and would like to reference in our repository, B. data from our institute published in external repositories)
  • Enrich metadata: research unit within our institute

Now that is the stuff I want to get into our institutional dataverse. This is only about metadata! The data would reside at its original source.

@RightInTwo
Copy link
Contributor Author

#5104 seems to be closely related

@RightInTwo
Copy link
Contributor Author

@donsizemore You mentioned Python code in the chat. What does it do exactly?

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Jul 18, 2019

@djbrooke @pdurbin Hey Guys! Would it make sense to break this down in some way? Or is an issue consolidation in progress/to be expected for this as well?

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2019

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. 😄 For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

Also, breaking down issues is almost always good. It makes them easier to estimate. 👍

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Jul 18, 2019

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. smile

Very nice. Sign me up! Don't think that your threat will stop me 😆

For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

I feel honored to be mentioned :D

Also, breaking down issues is almost always good. It makes them easier to estimate. +1

I'd be glad to. It would be great if some other people with general interest in harvesting features joined on this issue to make it easier to smash it into digestible pieces and prioritize them. Maybe there are also already good solutions to (some of) the problems in existence... @pdurbin, could you help me out with some more of your community magic?

@RightInTwo
Copy link
Contributor Author

How about re-framing this as "Harvest metadata from a list of DOIs"?

@pdurbin
Copy link
Member

pdurbin commented Jul 25, 2019

How about re-framing this as "Harvest metadata from a list of DOIs"?

Maybe. Maybe we should try to tell a user story. How's this?

"As a user, I'd like to collect datasets in Dataverse based on metadata available in DataCite. These datasets would behave somewhat like harvested datasets in that they are read only and would clearly indicate that they did not originate in Dataverse."

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

@RightInTwo
Copy link
Contributor Author

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

We don't publish any data ourselves. Therefore, it is necessary to collect references (DOIs) from the diverse places where data has been published.

For example, our unit DD is responsible for Components 1 and 6 of the German Longitudinal Election Study. The data resides at GESIS, but we would like to reference it in our insitutional repository (based on Dataverse). So we would like to add the following DOIs from that page to our catalogue and map (link) them to the dataverse of the unit DD and to those of individual researchers (if they want to). Also, we want to use that information to feed the CRIS.

link

https://doi.org/10.4232/1.13089
https://doi.org/10.4232/1.12722
https://doi.org/10.4232/1.12808
https://doi.org/10.4232/1.12809
https://doi.org/10.4232/1.13168
https://doi.org/10.4232/1.13137
https://doi.org/10.4232/1.13138
https://doi.org/10.4232/1.13139
https://doi.org/10.4232/1.12804
https://doi.org/10.4232/1.12805
https://doi.org/10.4232/1.12806
https://doi.org/10.4232/1.12443
https://doi.org/10.4232/1.12043
https://doi.org/10.4232/1.11443
https://doi.org/10.4232/1.11444

@RightInTwo
Copy link
Contributor Author

Metadata should be retrieved from the best source available:

  • By content negotiation on the landing page (get rich metadata through DDI, OAI-ORE, ... - for a first implementation, this should be skipped because there are a lot of dependecies based on the repository software used)
  • From Datacite
  • From Crossref

@pdurbin
Copy link
Member

pdurbin commented Jul 25, 2019

@RightInTwo thanks, this is helping. For now could you use the "Related Datasets" field to collect those DOIs? I just tried this at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/24U2VG and here's a screenshot:

Screen Shot 2019-07-25 at 10 45 26 AM

The "Related Datasets" field is multivalued, which is nice, and it supports HTML, so I was able to link to the DOIs, but there isn't much structure to it. It all just goes in a single text area. What do you think? What does @jggautier think? 😄

@RightInTwo
Copy link
Contributor Author

For now I would just use the "Other ID" field, but it would be best to have the DOI in the actual "Dataset Persistent ID" field.

We are currently collecting them in a database outside of Dataverse, but at some point it would be great to get them in there together with the metadata - before we manage that, we don't really want to make our Dataverse public (not even within the institute).

(Improving on #5998 would be appreciated anyways...)

@RightInTwo RightInTwo changed the title Harvesting from non-OAI-PMH sources Harvesting DOI metadata from non-OAI-PMH sources Dec 5, 2019
@pdurbin
Copy link
Member

pdurbin commented Dec 6, 2019

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH

I have no idea if this factoid is helpful or not but Dataverse can harvest its own native JSON format over OAI-PMH. This means that every single metadata field is available, even custom metadata blocks. (That's my understanding anyway.) The downside, of course, is that you'd have to implement our crazy native JSON format in the harvesting server you create. 😄

@poikilotherm
Copy link
Contributor

Wouldn't it be easier to implement a separate service for this?

I'm also thinking in the direction of maybe slicing up Dataverse a bit and maybe move the complete harvesting into a separated module. It could run on its own, offer easier scaling and use the Dataverse API to load new stuff into the database. (No microservice, but a modulith)

It could either use Quarkus/Spring (stay'n in Java) or Python (excellent pyDataverse) 😉

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Dec 6, 2019

@poikilotherm Hi Oliver, that is kind of what I'm building now, except that I don't use any of the libraries but rather try to build it without constraints or validation in JS with jquery. Not because it is so great, but because the colleagues who will take over for me (my contract ends in March) don't have any programming background except for some jquery runtime manipulation in the browser...

Now, that all could be totally different if...
a) ...Dataverse would support and maintain that functionality. That would be of course a lot of additional work and I understand that it may (currently) be out of scope.
b) ...we would develop something together! :D I'm not deep into python, but pyDataverse seems very promising. And if there was a community effort on this, I'd be glad to be part of it and dump all that messy JS for good. @poikilotherm I could do the grunt work if you do the code structure and the QA :D

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH
Dataverse can harvest its own native JSON format over OAI-PMH

Good point! Maybe a small OAI-PMH server could be part of the solution then.

@donsizemore We once had a chat about this topic - are you still interested?

@pdurbin
Copy link
Member

pdurbin commented Dec 6, 2019

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! 🎉

Developed primarily by @tainguyenbui it may be new but it's moving fast! And it's on npm.

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Dec 6, 2019

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! tada

Yes, I just discovered that yesterday! Let's see what the other people on here think about the choices regarding language and architecture. Maybe @skasberger could also contribute with his opinion?

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Dec 16, 2019

Here is some code as an example for how a quick and dirty import of Datacite metadata via DDI-XML works. After some experiments with mapping from one python dict to another, wanting to create a dict in the dataverse json format, I ended up with a solution that really earns the "quick and dirty" tag: Just insert everything into a string in the DDI-XML format accepted by /datasets/:importddi.

Click here for the python code (just as an example - should I put it in an own repo?)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
from jsonpath_ng.ext import parse as parseJsonPath
from requests import get, post

# pyDataverse doesn't provide an API to import DDI-XML yet, so we just use requests
#from pyDataverse.api import Api
#from pyDataverse.models import Dataverse

#%%
######################################################
# Setup

## Define API URLs
apimethods = {
    'datacite_get_datacitejson': {
        'usage': "get the datacite+json representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
        'url': 'https://data.datacite.org/application/vnd.datacite.datacite+json/'
    },
    'datacite_get_xbibliography': {
        'usage': "get the x-bibliography representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
        'url': 'https://data.datacite.org/text/x-bibliography/'
    }
}
    
## Provide API key
apikey = {
    'wzbdataverse': '{insert API key here}'
}

## Provide base URL
baseurl = {
    'wzbdataverse': 'https://dataverse.wzb.eu'
}
    
#%%
######################################################    
# QUICK AND DIRTY function to map from Datacite+json (as a python dict) to DDI-XML (as a string)

def ddiXmlFromDoi(doi):
    md = json.loads(get(apimethods['datacite_get_datacitejson']['url'] + doi).content)
    citation = get(apimethods['datacite_get_xbibliography']['url'] + doi).content
    
    issueDate =     parseJsonPath('$.dates[?dateType="Issued"].date').find(md)[0].value
    pubYear =       parseJsonPath('$.publicationYear').find(md)[0].value
    title =         parseJsonPath('$.titles.[0].title').find(md)[0].value
    creators =      parseJsonPath('$.creators').find(md)[0].value
    keywords =      parseJsonPath('$.subjects').find(md)[0].value
    descriptions =  parseJsonPath('$.descriptions').find(md)[0].value
    version =       1
    subTitle =      ''
    
    ddixml = f"""<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<codeBook
	xmlns="ddi:codebook:2_5"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">
	<docDscr>
		<citation>
			<titlStmt>
				<titl>{title}</titl>
				<IDNo agency="DOI">doi:{doi}</IDNo>
			</titlStmt>
			<distStmt>
				<distrbtr>{md['publisher']}</distrbtr>
				<distDate>{issueDate}</distDate>
			</distStmt>
			<verStmt source="Datacite">
				<version date="{issueDate}" type="RELEASED">{version}</version>
			</verStmt>
			<biblCit>{citation}</biblCit>
		</citation>
	</docDscr>
	<stdyDscr>
		<citation>
			<titlStmt>
				<titl>{title}</titl>
				<subTitl>{subTitle}</subTitl>
				<IDNo agency="DOI">doi:{doi}</IDNo>
			</titlStmt>
			<rspStmt>
"""
  
    ### creators          
    for creator in creators:
        affiliation = ''
        if('affiliation' in creator):
            affiliation = creator['affiliation']
        name = creator['name']
        ddixml += f'<AuthEnty affiliation="{affiliation}">{name}</AuthEnty>'
    
    
    ddixml += f"""
        </rspStmt>
			<prodStmt>
				<prodDate>{pubYear}</prodDate>
			</prodStmt>
			<distStmt>
				<distrbtr>{md['publisher']}</distrbtr>
				<distDate>{issueDate}</distDate>
			</distStmt>
		</citation>
		<stdyInfo>
			<subject>
"""

    ### subjects
    for keyword in keywords:
        word = keyword['subject']
        scheme = ''
        if 'subjectScheme' in keyword:
            scheme = keyword['subjectScheme']
        ddixml += f'<keyword subjectScheme="{scheme}">{word}</keyword>'
    
    
    ddixml += f"""
			</subject>
"""
    
    ### abstracts / descriptions
    for desc in descriptions:
        descText = desc['description']
        ddixml += f'<abstract>{descText}</abstract>'
        
    ddixml += """
			<distrbtr>
				<distrbtr>{md['publisher']}</distrbtr>
			</distrbtr>
		</stdyInfo>
	</stdyDscr>
</codeBook>"""

    return ddixml
    
#%%
###################################################### 
# Loop through the list of DOIs, get the DDI-XML and import it into Dataverse    

dois = ['doi1', 'doi2', 'doi...', 'doin', ] 

for doi in dois:    
    ddiXml = ddiXmlFromDoi(doi).encode(encoding='UTF-8') 
    params = {}
    params['key'] = apikey['wzbdataverse']
    import_api = f'/api/dataverses/open/datasets/:importddi?pid=doi:{doi}&release=yes'
    response = post(baseurl['wzbdataverse'] + import_api, data=ddiXml, params=params, verify=False)

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Jan 21, 2020

With a custom OAI-PMH server (which holds the metadata for a specified list of DOIs, also see #6425 ), the solution could be archived with a harvesting client in Dataverse. Step 1 & 2 would be run regularly (daily?).

image (green: exists, red: todo)

@RightInTwo
Copy link
Contributor Author

RightInTwo commented Jan 21, 2020

@tcoupin @pdurbin @djbrooke Since this is not going to be a core feature, where should this project reside and under what name? In the IQSS github, named something like "doi2pmh-server"?

@pdurbin
Copy link
Member

pdurbin commented Jan 21, 2020

@RightInTwo I could create an empty repo for you if you want. You'd want to mention prominently in the README that it's community supported. Nice diagram! (And nice code earlier. 😄 )

@RightInTwo
Copy link
Contributor Author

@RightInTwo I could create an empty repo for you if you want.

@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.

@tcoupin
Copy link
Member

tcoupin commented Jan 22, 2020 via email

@pdurbin
Copy link
Member

pdurbin commented Jan 22, 2020

@RightInTwo @tcoupin ok I just created https://github.com/IQSS/doi2pmh-server and made you admins of it. Again, please make sure you indicate that this a community supported project. Have fun you two. 😄

@djbrooke
Copy link
Contributor

Thanks @pdurbin for setting up that repo.

@tcoupin @RightInTwo Thanks for working on this. I think a solution that allows institutions to easily set up their collections in an OAI-PMH server and then have the metadata reflected in Dataverse for discoverability purposes is great.

@RightInTwo
Copy link
Contributor Author

@pdurbin @djbrooke Thanks for making this happen!

@poikilotherm @tcoupin See you on the other side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants