Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Dataverse content provider #739

Merged
merged 7 commits into from
Sep 18, 2019
Merged

Conversation

Xarthisius
Copy link
Contributor

@Xarthisius Xarthisius commented Jul 10, 2019

This PR add support for Dataverse datasets as content provider for repo2docker, e.g.:

repo2docker doi:10.7910/DVN/6ZXAGT
repo2docker https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT/3YRRYJ

It's very much a WIP, but I'd like to solicit feedback at this point if that's a good way to go. Related to IQSS/dataverse#4714

Since it's very similar to Zenodo provider, I abstracted parts of it into base DOIHandler class. Additional refactoring is probably still possible, as I duplicated a lot of code.

TODO:

  • add tests
  • refactor fetch to remove code duplicated with Zenodo content provider.

@betatim
Copy link
Member

betatim commented Jul 11, 2019

Do we want to allow repo2docker doi:10.7910/DVN/6ZXAGT? It looks like a URL but isn't. I don't think we can make it doi:// can we/should we? Do we even need a prefix or can we keep it prefix-less? The only time we end up with a problem is when there is a directory with the same name as the DOI and the user didn't mean to use it. I'd be Ok with saying that is a rare edge case.

Is it "standard" to do something like doi:...? I have no idea :-/

@Xarthisius
Copy link
Contributor Author

Xarthisius commented Jul 11, 2019

Do we want to allow repo2docker doi:10.7910/DVN/6ZXAGT? It looks like a URL but isn't. I don't think we can make it doi:// can we/should we? Do we even need a prefix or can we keep it prefix-less? The only time we end up with a problem is when there is a directory with the same name as the DOI and the user didn't mean to use it. I'd be Ok with saying that is a rare edge case.

Is it "standard" to do something like doi:...? I have no idea :-/

I'm certainly not an expert in this field, but there are other types of persistent digital objects identifiers (DOI is kinda a loaded term...), one example is handle.net (e.g. hdl:1902.1/12379). They have similar structure as you can see some_number_prefix/some_other_number and they resolve the same way. Check out:

curl -IL https://doi.org/hdl:1902.1/12379

With Zenodo you just got lucky ;)

I don't know if that's the best example, cause doi:1902.1/12379 and 1902.1/12379 works too, but I'm sure @pdurbin can give us some examples.

@betatim
Copy link
Member

betatim commented Jul 11, 2019

I am now thinking that I misunderstood what you are proposing :) I thought you were suggesting that we should make repo2docker doi:1902.1/12379 the required way of specifying a DOI and remove repo2docker 1902.1/12379 as a legal option. Now I think you are not suggesting that and instead pointing out that DOIs come in many flavours and that we can treat them all at the same time. If that isn't right then please help me ;)

curl -IL https://doi.org/hdl:1902.1/12379
curl -IL https://doi.org/doi:1902.1/12379
curl -IL https://doi.org/1902.1/12379
curl -IL https://doi.org/doi:10.5281/zenodo.3208700
curl -IL https://doi.org/10.5281/zenodo.3208700

all seem to be legal and "do the right thing". Which is pretty cool actually :) Now I'm thinking that we should allow both forms and encourage people to use one with "prefix" when there could be ambiguity (is it true that no prefix means "use the default" and the default is "doi:"?).

i will try and take a look at the code tomorrow.

@pdurbin
Copy link
Contributor

pdurbin commented Jul 11, 2019

DOIs come in many flavours

Yes. The craziest DOI I'm aware of is https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

That said, in practice, I don't think you'll see DOIs in the wild like this very often, thankfully. 😄

@betatim mostly I dropped by to say thanks for taking a look at this pull request!! 🎉 🎉 🎉 (And thank you to @Xarthisius for making it, of course!! 🎉 🎉 🎉 ) I have some thoughts on URLs to pass to mybinder.org (what's possible today from "external tools" in Dataverse) but I decided to throw it into a comment in our "Binderverse" issue for now: IQSS/dataverse#4714 (comment) . I'm wondering where to have the conversation about Binder URLs. I assume the BinderHub repo but wherever is fine with me. Please let me know. 😄

@Xarthisius is aware that Dataverse supports an alternative to DOI called Handle. Collectively, people seem to refer to DOIs, Handles, and other identifiers as Persistent IDs (PIDs). There's even a conference called https://pidapalooza.org some colleague of mine attended earlier in the year and a new forum at https://www.pidforum.org where we can ask questions about PIDs.

http://hdl.handle.net/11529/10016 is an example of what a Handle might look like. When you click it, it should resolve to a landing page on an installation of Dataverse that looks like the screenshot below.

Screen Shot 2019-07-11 at 6 16 09 PM

I don't have the numbers on the 46 installations of Dataverse but I'd say most of them use DOIs rather than Handles.

I can't emphasize enough how thrilled I am about this pull request. A few weeks ago a gave a demo of launching a Jupyter Notebook from a dataset in Dataverse using Whole Tale and while I did touch on Binder in my talk, I'd love to do a similar demo with Binder in the future. You'll find a Binder screenshot and a transcript in the write up of my talk: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

@betatim
Copy link
Member

betatim commented Jul 12, 2019

After a quick browse on the train to EuroPython: LGTM 🎉!

A few questions for my education, maybe we can turn some into comments in the code so people from the future find them.

  • the dataverse.json is needed to detect if the link we get from resolving a DOI points to a dataverse instance right? We should add instructions on how to get and update this file. I assume it'll need updating every few months? Is there a way to make a request to a "potential dataverse" host to find out if it is indeed one? I checked the Server response header but thtat doesn't contain anything. Maybe there is a well known endpoint like /versions or /status or some such that we could query instead of having a list that we will need to update?
  • when fetching files we build a list of file IDs we want to fetch and then receive a zip file containing all of them? I left a comment in fetch where maybe we can add a comment. I read the code a few times to convince myself that it did do that (I think), so a comment would help

For the code formatting pedant in me: can we stick to snake_case as much as possible please?

I look forward to tests and refactoring. (Will the is_doi() functions keep working with doi:1234.2/3425?)

@pdurbin
Copy link
Contributor

pdurbin commented Jul 12, 2019

  • I checked the Server response header but thtat doesn't contain anything. Maybe there is a well known endpoint like /versions or /status or some such that we could query instead of having a list that we will need to update?

Yes! Once you know the hostname of the potential Dataverse installation (due to the DOI resolving to it), you can check if the host indeed is running Dataverse by going to /api/info/version like this:

https://demo.dataverse.org/api/info/version

The expected response is something like this:

{
  "status": "OK",
  "data": {
    "version": "4.15.1",
    "build": "1377-701b56b"
  }
}

Obviously, this could be some other software but I suspect it's enough of a fingerprint to be pretty sure it's Dataverse. 😄

  • when fetching files we build a list of file IDs we want to fetch and then receive a zip file containing all of them?

One of the things to look out for here is that Dataverse installations can limit the size of the zip file that is constructed on the fly based this (potentially long) list of file IDs. The default is 100 MB as mentioned at http://guides.dataverse.org/en/4.15.1/installation/config.html#zipdownloadlimit . If you can't get all the files at once, you could switch to downloading them individually.

@nuest
Copy link
Contributor

nuest commented Sep 9, 2019

Needs an update if #788 is merged, overlapping refactoring of an abstract DOI-provider class.

@Xarthisius
Copy link
Contributor Author

Needs an update if #788 is merged, overlapping refactoring of an abstract DOI-provider class.

Thanks for the heads up @nuest ! I'll rebase when it happens. I'd really love to finish this off, but I'm constantly swamped with other work :(

@pdurbin
Copy link
Contributor

pdurbin commented Sep 12, 2019

@nuest hi! Now that your pull request has been merged, are you able to help me and @Xarthisius resolve merge conflicts in this one? Any help would be much appreciated!! 🎉 😄

@Xarthisius Xarthisius force-pushed the dataverse branch 2 times, most recently from 5c68444 to bf1a5a1 Compare September 12, 2019 13:54
@pdurbin
Copy link
Contributor

pdurbin commented Sep 13, 2019

@Xarthisius thanks for resolving those merge conflicts!! 🎉 🎉 🎉 One thing I noticed over at pull request #788 is that @nuest had "WIP" in the title of his pull request initially but @betatim encourage him to switch it to "MRG" at #788 (comment) saying, "We also change the title from WIP to MRG when it is ready for reviewing (and then maybe merge)."

So should we switch the title to MRG? Are you ready for review? If not, is there any help you need?

It looks like the entire process is documented at https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html#guidelines-to-getting-a-pull-request-merged and here's a screenshot with the MRG bit highlighted in blue:

Screen Shot 2019-09-13 at 8 20 21 AM

Thanks!!!

@Xarthisius
Copy link
Contributor Author

Thank you @pdurbin, I'm aware of r2d contributing process. This PR is still a WIP cause it's literally a work in progress. If you could kindly read this PR's description, you'll notice a TODO list that I'd like to get done before switching status to MRG.

However, if you want to push it across the finish line faster, by all means, just take the code and continue working on it.

@pdurbin
Copy link
Contributor

pdurbin commented Sep 13, 2019

@Xarthisius perfect! Thanks for explaining! I was completely unaware of the repo2docker contribution process before this week but it makes sense that because you build on top of this project you know all about it. Probably the best way for me to help is with advocating for the "add config to test Dataverse support in repo2docker" issue over at IQSS/dataverse-jenkins#13

Additionally, I could put a call out on the dataverse-community mailing list to get you some help on the code itself. Just say the word and I'll put out the call!

Thanks!!!

@Xarthisius Xarthisius changed the title [WIP] Dataverse content provider [MRG] Dataverse content provider Sep 13, 2019
@Xarthisius
Copy link
Contributor Author

Xarthisius commented Sep 13, 2019

I believe it's now in a "reviewable" state. I significantly reduced the complexity of detect(). Right now it doesn't handle all the possible cases, but I focused on the two that are IMHO gonna be most common: doi pointing to a Dataverse resource and url created by Dataverse's External Tools.

Copy link
Contributor

@nuest nuest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions and comments from my side.

Thanks for merging the upstream changes. It's great to see more data sources coming :-).

repo2docker/contentproviders/dataverse.py Outdated Show resolved Hide resolved
search_url = urlunparse(
parsed_url._replace(path="/api/search", query=search_query)
)
resp = self.urlopen(search_url).read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to log.debug the URL before you're calling it, could be important error-finding information for users.

self.log.debug(" - resp = " + json.dumps(data))
return

self.record_id = deep_get(data, "items.0.dataset_persistent_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a case where more than 1 item will be returned? Should you warn about that here?


self.record_id = deep_get(data, "items.0.dataset_persistent_id")
elif is_doi(doi):
self.record_id = "doi:" + normalize_doi(doi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other DOI-supporting content providers we only store the repository specific identifier here, not the full DOI. Can you briefly confirm that this does not work here, because multiple Dataverse installations can have different DOI schemes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuest Dataverse supports both DOIs and Handles and collectively we refer to these as "persistent identifiers" or PIDs.

Does this help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I see. Thanks!

# and then receive a zip file containing all of them.
# TODO: Dataverse has a limit for the zipfile size (see
# https://github.com/jupyter/repo2docker/pull/739#issuecomment-510834729)
# If size of the dataset is grater than 100MB individual files should be downloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to open an issue for that in repo2docker and remove the TODO here.

OTOH, why not only implement an individual file download?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here, but I got the impression that a scenario with a lot of small files is way more popular than just a few big files case in Dataverse repositories. Going with that hunch, I'd rather see one API call than a bazillion calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to explain the pros and cons of downloading files individually from a Dataverse dataset vs. downloading them as a zip.

Pros of downloading files individually

  • Easier on Dataverse servers that store files on S3 because a "redirect" feature can be enabled such that files are downloading directly from S3 rather than passing through Dataverse servers.
  • Easier on Dataverse servers because Dataverse servers do not have to create a zip file on the fly

Cons of downloading files individually

  • In order to know the file hierarchy, you have to parse metadata about the files and look for directoryLabel to get path/to/file information. (There can be other potentially interesting metadata about files in there, by the way, such as descriptions.)
  • Lots of requests. Bazillion calls.

Pros of downloading zip files

  • Single request.
  • Easier to deal with a single file. Just unpack it. The file hierarchy is in the zip file.

Cons of downloading zip files

  • May crash Dataverse servers. Dataverse servers have to construct the zip on the fly. Also when not using the S3 redirect option mentioned above, files pass through Glassfish and Apache.

I hope this helps! I might be forgetting a point or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to explain the pros and cons of downloading files individually from a Dataverse dataset vs. downloading them as a zip.

Pros of downloading files individually

  • Easier on Dataverse servers that store files on S3 because a "redirect" feature can be enabled such that files are downloading directly from S3 rather than passing through Dataverse servers.
  • Easier on Dataverse servers because Dataverse servers do not have to create a zip file on the fly

Cons of downloading files individually

  • In order to know the file hierarchy, you have to parse metadata about the files and look for directoryLabel to get path/to/file information. (There can be other potentially interesting metadata about files in there, by the way, such as descriptions.)
  • Lots of requests. Bazillion calls.

Pros of downloading zip files

  • Single request.
  • Easier to deal with a single file. Just unpack it. The file hierarchy is in the zip file.

Cons of downloading zip files

  • May crash Dataverse servers. Dataverse servers have to construct the zip on the fly. Also when not using the S3 redirect option mentioned above, files pass through Glassfish and Apache.

I hope this helps! I might be forgetting a point or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the elaborate description! That's very helpful.

Re. the directoryLabel: that is what is done now in the Zenodo and Figshare providers although I don't think directories are even possible in those repositories, see how host["download"] and host["filename"] are used in https://github.com/jupyter/repo2docker/blob/master/repo2docker/contentproviders/doi.py#L53

@betatim and @minrk should comment on what is preferable for BinderHub operators.

From a user perspective, I tend towards downloading files individually, because I want things not to break (e.g. if the ZIP would larger then 1 GB) and don't care about load on the servers. The comment says

"Dataverse has a limit for the zipfile size"

so I think there is no way around implementing individual download (maybe as a second step if the zipfile download fails) unless you'd be fine with not supporting large repos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if it's easier to implement zip download first just to get something working, that's fine with me. It should work fine with datasets with small files. 😄 As you see, I'm just trying to explain the pitfalls of creating zip files on the fly from the Dataverse server perspective. (And it's a bad experience for users if the zip file isn't complete, if it only has 5 of the 17 files the user wants because they exceeded the zip size limit.) So yes, downloading individual files, even if it's a bazillion requests, is probably safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reimplemented fetch to use individual files download


@property
def content_id(self):
"""The Dataverse persistent identifier (could use internal dataset_id too)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you stick to the doi: prefix, suggest to mention that here. Also, remove the "could use internal dataset_id", if that is not used eventually - see comment above.

setup.py Show resolved Hide resolved
tests/unit/contentproviders/test_dataverse.py Outdated Show resolved Hide resolved
@pdurbin
Copy link
Contributor

pdurbin commented Sep 16, 2019

@Xarthisius @nuest I left a couple comments just now. I hope they're helpful. Please let me know if you have any questions! Thanks!

None,
)
if host is None:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I parsed the above lines properly (the use of next() to handle the "didn't work" case made me read it a few times), is the following code equivalent (I think yes):

for host_ in self.hosts:
    if urlparse(host_["url"]).netloc == parsed_url.netloc:
        host = host_
else:
    return

@betatim
Copy link
Member

betatim commented Sep 18, 2019

LGTM! Not quite sure why codecov is (that0 unhappy. Merging this now. Thanks for the work and then coming back to finish it off :)

Two more todos for a quick PR:

@betatim betatim merged commit c98e6ac into jupyterhub:master Sep 18, 2019
@nuest
Copy link
Contributor

nuest commented Sep 18, 2019

Working on the docs update PR...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow:wip Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants