-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAI-ORE and BagIT development #4706
Comments
After discussions at the 2018 Dataverse meeting (thanks!), I've tried to identify a list of things to get to a minimum viable solution. Comments welcome. ORE updates:
Bag updates:
DPN updates:
|
An update: w.r.t. code - the URI and namespace for metadatablocks and support for sha256/512, including an api call to verify the existing hash and replace it, if the file is still valid, with one form the new algorithm, are both done/in the branch. I'm currently working on default URIs for the citation metadata block, and adapting the bag. John Crabtree and I had a good call with Dave Pcolar of DPN today and it appears that either sending individual files or an ~RDA bag, either of which would be wrapped in a DPN bag for preservation, should be doable today. Sending a DPN bag directly is something that DPN is willing to work on, but is not currently supported. From the discussion, it appears that sending files directly could have higher performance (due to parallel transfer) but the idea of using an ~RDA bag as a general export, common intermediate/standard across possible preservation systems sounds compelling and I think sending a bag is currently the consensus option. We had some discussion of potential next steps w.r.t. versioning (perhaps just publishing the version changes given Dataverse's ability to identify them) and how to assure that variable level metadata is included (by including the ddi metadata file and/or adding to the ORE map). |
After a second discussion with DPN and Odum, I've gone ahead with a consensus plan to enable optional submission to DPN as a post-publication Dataverse workflow as a v 1 effort. Based on how DPN works, including the fact that initial submission is synchronous and reversible while creating a 'snapshot' to archive a space can have a delay and is irreversible (except for manual removal), the workflow creates a space, named after the dataset globalId and uploads a BagIt bag, named after the globalId+ version, and a datacite.xml file to it. The success of this step is reported in a new column on the versions tab, visible only to admins, that reports failure or provides a link to find the data in the DPN Duracloud admin GUI. A curator would click the button to create a snapshot and monitor progress from there. Once the snapshot exists, the space is automatically emptied and can be deleted. Publishing a new version of a dataset will recreate the space and the process can be repeated with a snapshot of the new Bag and datacite.xml file. (Versions are therefore stored as different snapshots of the same space.) I've been testing this in a 4.9.2-based QDR branch and it works reliably, though I did hit a DPN bug at one point. As a side-effect of the main effort, the datacite.xml file can be made available as a metadata export format (and it may be worth looking at it to add more fields as we just did with the citation formats). I've removed the Bag generation from the metadata export menu where I initially tested it for several reasons - it's not just metadata, it includes restricted files and access to it should be restricted, it's better to stream it to DPN/generate on demand rather than caching it (as its similar in size to the whole dataset). I have a few things to finish up before this is ready for review/QA:
If anyone would like to see it early, I'd be happy to demo/discuss. |
@qqmyers what would label on the button be? If it's easy to provide a screenshot of what you have so far, can you please add it here? Today in a design meeting we were thinking about UI impact on the dataset page and I mentioned that at one point you were planning to put a button under "Export" even though we might want to consider a different place and name for it. |
@pdurbin - the button you reference is one in the duracloud admin webapp, not Dataverse. I was originally thinking an 'export button' on the dataset page would be good, but since it could contain restricted files and is version specific, I've gone for the admin-only column in the version table, which is currently something like below. Non-admins would just see the normal table. |
@qqmyers thanks, I've been talking to @mheppler a bit about your screenshot. I know you wrote extensive documentation about what you're up in the "Data and Metadata Packaging for Archiving" that's ultimately linked from the Google Group post in the description of this issue but here's a direct link: https://github.com/QualitativeDataRepository/dataverse/wiki/Data-and-Metadata-Packaging-for-Archiving People should also check out the discussion about the plan in the Google Group thread: https://groups.google.com/d/msg/dataverse-community/NZydpK_zXO0/vuvhnHL7AQAJ Another resource is @qqmyers 's talk at the 2018 Dataverse Community Meeting. "Dataverse and Preservation in the Qualitative Data Repository" at https://drive.google.com/open?id=1fVhtw-R3Jf7wO4tgkNxpk3Mm93bUIjXP |
@pdurbin - FYI - I've extended the workflow mechanism as discussed on the community call to allow system settings and an apiKey for the current user to be sent to a workflow step and, after some EJB fun, I think I have DPN submission as a workflow working along with the ability to submit past versions via the GUI. I have some cleanup to do, but I'm about ready to submit a PR(s) and would like to ask: There are a few things like the workflow changes and making the export mechanism use streaming data that were needed for DPN submission but could be submitted and reviewed as separate PRs. Would it be helpful to do that? That could be a little extra work for me, but I don't think it's that much since I have to compare between QDR's 4.9.2-based branch and develop anyway. It may help with review, but there would be dependencies between the PRs too. Let me know what you all think. Thanks! |
@scolapasta any thoughts/guidance re: the approach in the above comment from @qqmyers? hooray workflows! |
@qqmyers yes, generally, having separate, smaller PRs are easier for us to review, QA, and merge. So since it isn't too much work on your side, we would prefer that approach. |
@qqmyers I saw you made pull request #5049 and I assume it's the main one for this issue so I dragged it to code review at https://waffle.io/IQSS/dataverse |
Hi @qqmyers - thanks for talking about this earlier this week. The other PRs are being reviewed. The workflow-based integration here will be extremely useful and is a fulfillment of a long-standing community need. I have some concerns about the UI piece here. We’ll have a lot of moving pieces on the dataset and file pages as part of the redesign effort, so we don’t want to add any additional UI elements to the page right now, even if it’s only for superusers. It doesn’t appear there are API endpoints for the archiving via the UI that’s shown in the screenshot above. If these endpoints could be added, I think it would allow the desired functionality while not adding additional challenge to the design team's work in flight. Let me know if you have any thoughts on the above. Thanks for all the PRs! |
@djbrooke - there is an api, just missed it in the merge (and just added). For QDR, I think the ability for curators to see the status and be able to send things via the GUI will be important, but I can pull that part from the PR. |
Thanks @qqmyers. I saw some commits come in on the associated PR over the weekend. Is this ready for code review? Let me know if you'd like us to take another look. |
@qqmyers API for existing dataset still failing. I can hold off testing if you are still working on it, just wanted to give new account one last try. Saw this in server log: [2019-01-16T14:37:22.284-0500] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataver se.util.bagit.BagGenerator] [tid: _ThreadID=836 _ThreadName=Thread-50] [timeMill is: 1547667442284] [levelValue: 800] [[ [2019-01-16T14:37:22.369-0500] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=836 _ThreadName=Thread-8] [timeMillis: 1547667442369] [levelValue: 800] [[ [2019-01-16T14:37:22.369-0500] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=836 _ThreadName=Thread-8] [timeMillis: 1547667442369] [levelValue: 800] [[ [2019-01-16T14:46:22.632-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.data verse.util.bagit.BagGenerator] [tid: _ThreadID=837 _ThreadName=pool-63-thread-1] [timeMillis: 1547667982632] [levelValue: 900] [[ [2019-01-16T14:46:22.634-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.datav erse.util.bagit.BagGenerator] [tid: _ThreadID=837 _ThreadName=pool-63-thread-1] [timeMillis: 1547667982634] [levelValue: 1000] [[ [2019-01-16T14:46:22.635-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=83 7 _ThreadName=Thread-9] [timeMillis: 1547667982635] [levelValue: 1000] [[ [2019-01-16T14:46:22.635-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.datav erse.util.bagit.BagGenerator] [tid: _ThreadID=837 _ThreadName=pool-63-thread-1] [timeMillis: 1547667982635] [levelValue: 1000] [[ [2019-01-16T14:46:22.636-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.datav erse.engine.command.impl.DuraCloudSubmitToArchiveCommand] [tid: _ThreadID=836 _T hreadName=Thread-50] [timeMillis: 1547667982636] [levelValue: 1000] [[ [2019-01-16T14:46:22.637-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=83 6 _ThreadName=Thread-9] [timeMillis: 1547667982637] [levelValue: 1000] [[ [2019-01-16T14:46:25.362-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.data verse.engine.command.impl.DuraCloudSubmitToArchiveCommand] [tid: _ThreadID=31 _T hreadName=http-listener-1(4)] [timeMillis: 1547667985362] [levelValue: 900] [[ [2019-01-16T14:46:25.365-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=31 _ThreadName=Thread-9] [timeMillis: 1547667985365] [levelValue: 1000] [[ [2019-01-16T14:46:25.365-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=31 _ThreadName=Thread-9] [timeMillis: 1547667985365] [levelValue: 1000] [[ |
@kcondon - thanks for sending it back. I'm also seeing a problem with the API, but not the workflow at the moment, so some debugging needed. I'll let you know when I figure out what's changed. |
@kcondon - just uploaded a fix for the api. It looks like at some point the fact that the api is a call to the server which then triggers file retrieval calls via http caused some deadlock. I was mostly testing from our GUI which was the same code except the initial http call so I missed the issue. In any case, the new async mechanism that works like indexing - the api call just starts the process and returns - works for me. (The workflow part should have been working all along...). So - I think you can look at this again. I'll go back tomorrow to look at your comments on the docs and see if I can make those clearer, but I won't touch the code unless you find issues. |
@kcondon - made some clarifications/corrections in the docs including 1) the :ArchiverClassName doesn't need to be listed in :ArchiverSettings or the workflow definition, 2) the DuraCloud port and context are optional since they have defaults, but setting them only works if they are also listed in the :ArchiverSettings. (FWIW: This split is so thing can be generic - the :ArchiverSettings tells the generic code which properties to send to the archive-specific class and then the archive-specific class uses those settings.). Hopefully it's all good at this point... |
@qqmyers API is working now, thanks. Am having trouble with workflow but likely a simple config issue: I've added workflow using sample file, replacing "string" with values that were mentioned in the api section. However, when I publish a dataset it fails with log error: [2019-01-22T15:01:25.073-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.util.ArchiverUtil] [tid: _ThreadID=142 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548187285073] [levelValue: 900] [[ [2019-01-22T15:01:25.074-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.workflow.internalspi.ArchivalSubmissionWorkflowStep] [tid: _ThreadID=142 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548187285074] [levelValue: 1000] [[ [2019-01-22T15:01:25.075-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=142 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548187285075] [levelValue: 900] [[ [2019-01-22T15:01:25.074-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=142 _ThreadName=Thread-9] [timeMillis: 1548187285074] [levelValue: 1000] [[ [2019-01-22T15:01:25.080-0500] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=142 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548187285080] [levelValue: 800] [[ |
@kcondon - The issue may be that the "string" entries in requiredSettings part of the json file aren't meant to be substituted. They just specify the data type for that setting so it can be passed appropriately. The actual values will come from the named settings you've already set up for the API. |
@qqmyers No luck, still seeing this error: [2019-01-22T16:38:03.900-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=143 _ThreadName=Thread-9] [timeMillis: 1548193083900] [levelValue: 1000] [[ [2019-01-22T16:38:03.901-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.workflow.internalspi.ArchivalSubmissionWorkflowStep] [tid: _ThreadID=143 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548193083901] [levelValue: 1000] [[ [2019-01-22T16:38:03.902-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=143 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548193083902] [levelValue: 900] [[ [2019-01-22T16:38:03.903-0500] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=143 _ThreadName=__ejb-thread-pool1] [timeMillis: 1548193083903] [levelValue: 800] [[ |
@kcondon - hmmm. One other thought - did you update the PostPublishDataset workflow: My workflow json looks as follows: If you have these two, I think you should get the class created OK . (You'll need the :ArchiverSettings and :DuraCloudHost set as well to make it all go, but those should be good already with the API working.) Could there be a typo somewhere? The way this works is that the requiredSettings listed in json are the only settings the workflow step gets to see (so steps can't read settings outside the set an admin has allowed through the workflow definition). If the setting exists, and the workflow definition lists it, you shouldn't be getting a null. (I did a quick check between my 'good' branch and this one and don't see any differences in how the settings are read/passed, so I don't think its a code issue. If nothing here helps, I look again.) |
@qqmyers Thanks for the specifics. It looks like for some reason the workflow was not created the same way as yours. I believe I did a wget on the raw sample file from github and then ran the add workflow endpoint. This is what I get: curl http://localhost:8080/api/admin/workflows/4 | jq . I'm wondering whether since I am defaulting on port and context if that is causing the create workflow to skip part of the file? I'll try deleting and re-adding, both as is and with settings explicitly set. Other setting: |
@kcondon -got it. It looks like the code in JsonParser to parse the requiredSettings didn't make it in one of the earlier PRs (ie. the general workflow update since this isn't specific to the archiver). Hopefully that's the last issue. To get this to work, I think you'll have to rebuild and then resubmit the workflow/set the PostPublishDataset workflow to the new one. Thanks again for persevering! |
@qqmyers OK, rebuilt, readded workflow, set as default. Still fails with null archiver. [2019-01-22T18:14:17.503-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.util.ArchiverUtil] [tid: _ThreadID=150 _ThreadName=__ejb-thread-pool12] [timeMillis: 1548198857503] [levelValue: 900] [[ curl http://localhost:8080/api/admin/workflows/5 | jq . |
Ran a few quick tests on 581e5dc in this branch. Existing RSAL workflows (RSAL 0.1) continue to work as expected (at least for success path). |
@kcondon - Argghh - I think you need colons in front of the requiredSettings, which is not what the example in the branch has... but is what's working for me (see earlier comment). I'll commit an update... |
@qqmyers That did the trick! |
@qqmyers Have found some weirdness in API, need to narrow it down: I'm heading out now so will look at it again tomorrow. Thanks for the help and fixes. |
@kcondon - not sure I get the full picture, so here's some possibly useful info:
|
@qqmyers Thanks for the detail. It was a combination of the archivalCopyLocation value and the enabled workflow. I was not fully clearing out the prior entry. It's working, merged. |
@kcondon - Great - thanks for the all the work on this - it was interesting breaking this into multiple PRs and trying to keep them all in sync. Unfortunately, I think you ended up being the one who found problems when I didn't keep things straight. With the merge, can I delete the qdr.duracloud account we created for testing? I don't think there's any issue with recreating it if/when needed, but since it needs admin privs to create spaces, I'd like to close it down if you're not planning further testing/demoing with it. |
@qqmyers Sure, feel free to delete the account. |
This is an issue to track feedback related to developing a way to archive published datasets in DPN (http://dpn.org). I've done some proof-of-concept work to generate an OAI-ORE map file and BagIt bag (which uses and includes the ORE map file) for published datasets that I hope can form the basis for a DPN submission.
From https://groups.google.com/forum/#!topic/dataverse-community/NZydpK_zXO0 :
The text was updated successfully, but these errors were encountered: