Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse -> DCM APIs #3725

Closed
pameyer opened this issue Mar 27, 2017 · 26 comments
Closed

Dataverse -> DCM APIs #3725

pameyer opened this issue Mar 27, 2017 · 26 comments
Assignees

Comments

@pameyer
Copy link
Contributor

pameyer commented Mar 27, 2017

Distinct from #3353 (roughly DCM -> DV APIs, in this context). This is a dependency for DCM UI implementation, but doesn't include UI.

Create dataset function/command needs to:

  • send a message to the DCM requesting setup for a data deposition (for dcm/rsync+ssh, creation of either a transfer script or auth payload)
  • asynchronously check if the setup has been done, and if so return/present the script/auth payload to the depositor

Some configuration (DCM url, etc) will be needed.

@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2017

See #3352 for links to code that's already been written. I'm still partial to https://github.com/pdurbin/dataverse/tree/3145-dcm which has RequestRsyncScriptCommand and other goodies.

@pameyer
Copy link
Contributor Author

pameyer commented Apr 25, 2017

failure modes / open questions:

  • DCM down (create dataset command fail?)
  • upload script expires (script tells depositor to contact support via email)
  • possible interactions with workflows (RSAL)
  • DV may not need to store script persistently; can hand users off to DCM.

@pdurbin
Copy link
Member

pdurbin commented May 15, 2017

@sekmiller and I are talking about this. I'm sure the screenshots in "2017-05-02 Review rsync prototype (Bill's UI changes)" at https://docs.google.com/a/harvard.edu/document/d/1Mi7I2w2FVYbN1Qb9oWLik3UsU2oPY3Wn9Z6_5JXrEAY/edit?usp=sharing will be helpful. They're oriented toward building a UI some day which is not in scope for this issue but will help give some background. I believe this meeting with @pameyer was even recorded.

@pdurbin
Copy link
Member

pdurbin commented May 15, 2017

Here's some whiteboarding I just did for @sekmiller .

I tried to focus on what the end user will see, even without the fancy UI work that will come in a future issue (if anyone know the issue number please advise):

  • create dataset
  • download rsync script
  • run script
  • look for single "data package" file

On the back end, here's roughly what we want to happen:

  • On dataset creation, Dataverse asks DCM for an rsync script.
  • DCM replies with the rsync script (original prototype) or a URL to the script (proposed, but be careful of security since the script contains credentials).
  • User downloads the rsync script.
  • User runs the rsync script, which creates a manfest file of checksums and then transfers the manifest and the files themselves to the DCM.
  • The DCM verifies the checksum and notifies Dataverse that they're fine, passing JSON with the dataset ID and "validation passed".
  • Dataverse kicks off the importer/crawler developed in File Import Batch job in support of rsync #3353 (pull request 3353 batch job import #3497) which will create a single DataPackage file.

That's the happy path. If the DCM tells Dataverse that the checksum fail, Dataverse sends a notification to the user via the normal Dataverse notification system.

Here's the (fugly) whiteboarding of the above:

issue3525

@pdurbin
Copy link
Member

pdurbin commented May 15, 2017

@raprasad @sekmiller to get ready for development, please install and run the Data Capture Module (DCM) on your laptops. Instructions are at https://github.com/sbgrid/data-capture-module and @pameyer can help field questions since he wrote the code! 😄

@pdurbin
Copy link
Member

pdurbin commented May 16, 2017

It just hit my radar that there are many additional resources that are sure to be helpful to developers that are being gathered by the design team in a folder called "rsync" https://drive.google.com/open?id=0B3A1TxMQgvUVa2ltQjc4cmliTTg including:

There's also a wealth of information at https://trello.com/c/Nbte37k1/9-rsync-file-upload-download-4-8

@pdurbin
Copy link
Member

pdurbin commented May 16, 2017

I took code I wrote a year ago at https://github.com/pdurbin/dataverse/tree/3145-dcm and pushed it into a new branch after getting it up to date with the "develop" branch: https://github.com/IQSS/dataverse/tree/3725-dcm-apis

I'm realizing that some of my unanswered questions will be resolved once #3724 has passed through Code Review. I'll keep an eye on that issue as well as its pull request at #3830 to make sure I'm being consistent with whatever decisions are made there especially with regard to the rules for when Dataverse should ask the DCM for an rsync script. @pameyer made the pull request and should be able to fill me in.

@raprasad
Copy link
Contributor

removed assignment after ticket scope changed

@pdurbin
Copy link
Member

pdurbin commented May 18, 2017

@raprasad thanks. Yes, complete scope change as of yesterday afternoon's sprint planning meeting. In the morning I was sketching diagrams like this...

img_20170517_120629

... which give a more complete picture of what's called "Large Data Upload Integration" at http://dataverse.org/goals-roadmap-and-releases and hints at the work @michbarsinai is doing in #3561 but this current issue has been clarified to be much smaller. Above I had stated that the definition of done is seeing a data file with a MIME Type of application/vnd.dataverse.file-package in the UI like this on the dataset page:

screen shot 2017-05-17 at 5 08 25 pm

The clarified definition of done is this:

Assuming the vagrant up has been run on https://github.com/sbgrid/data-capture-module and Dataverse has been configured to use that host as the Data Capture Module (DCM) and configured site-wide with rsync as a supported upload method, a user who has sufficient permission on a dataset Permission.AddDataset as of this writing) should be able to download an rsync script from the DCM via Dataverse using an API Token, as verified by API tests and documented. Additionally, since #3724 has been closed, at least enough configuration options to support the story above will be implemented and documented. Additionally, even though :RepositoryStorageAbstractionLayerUrl and :DownloadMethods config options are necessary for this issue, they were mentioned in #3724 so we'll add them as well, but perhaps not document them until they are meaningful. It would be more meaningful to document them as part of #3561, I believe.

@pdurbin
Copy link
Member

pdurbin commented May 22, 2017

I'm still working away on my 3725-dcm-apis branch but I'm thinking I'll make a fresh one with a single commit once I'm done. Good progress today. Thanks to @pameyer and @kcondon for talking through the issues.

Here are some requirements for testings we talked about:

  • Shared filesystem between Dataverse and Data Capture Module (DCM)
  • The server that the DCM is running on may need to be wiped periodically because it creates Unix accounts.

@pdurbin
Copy link
Member

pdurbin commented May 23, 2017

I just made pull request #3851 and put this issue in the Code Review column at https://waffle.io/IQSS/dataverse

@pdurbin
Copy link
Member

pdurbin commented May 30, 2017

@pameyer it's a good point and as luck would have it my installation of DCM is hosed right now so it was easy to test the case of when it doesn't return an rsync script to Dataverse. 😄

I just pushed 4126ad1 which gives the API user a better clue as to what went wrong:

{
"status": "ERROR",
"message": "Something went wrong attempting to download rsync script: User id 305 had a problem retrieving rsync script for dataset id 452 from Data Capture Module. The script was null or empty."
}

@pdurbin pdurbin removed their assignment May 30, 2017
@pameyer
Copy link
Contributor Author

pameyer commented May 30, 2017

@pdurbin thanks

@pdurbin pdurbin self-assigned this May 30, 2017
pdurbin added a commit that referenced this issue May 30, 2017
@pdurbin pdurbin removed their assignment May 30, 2017
@pdurbin
Copy link
Member

pdurbin commented May 30, 2017

@pameyer I made a few more improvements, ending with 087f876, to increase code coverage etc. I'm done. Ready for QA.

@pameyer
Copy link
Contributor Author

pameyer commented Jun 6, 2017

Just noticed the intersection with DV sending dataset.id to ur.py; and expecting dataset.identifier in batch-import.

And to slightly clarify; the batch-import API (and DCM in general) was expecting the datasetIdentifier key in the call to ur.py to correspond to the values in dataset.identifier column of Dataverse's database; they currently correspond to the dataset.id column.

This is something that can be worked-around; but increased global system complexity and so it's probably better to not need to work around it.

kcondon added a commit that referenced this issue Jun 6, 2017
Allow download of rsync scripts from Data Capture Module (DCM) #3725
@pdurbin
Copy link
Member

pdurbin commented Jun 6, 2017

@pameyer to me, an identifier is a DOI and an id is a database id.

@pameyer
Copy link
Contributor Author

pameyer commented Jun 6, 2017

@pdurbin - the database and native API both have something called "identifier"; what should I be calling it?

@djbrooke djbrooke closed this as completed Jun 7, 2017
@kcondon kcondon self-assigned this Jun 7, 2017
@pdurbin pdurbin added this to the 4.7 - Dashboard and Customization milestone Jun 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants