Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-side multifile zip download #9245

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Dec 22, 2022

What this PR does / why we need it: A possible addition to/replacement for zipping on the server. In this PR, the multi-file download button invokes JavaScript that will download files individually (using direct download if enabled) and create a zip locally, using file names/directoryPaths from the specific datasetVersion being downloaded.

Current issues/limitations:

  • It isn't clear that this will work on all browsers
  • There's no error handling - should be possible, for example, to default to using the server side zip if things go wrong or if the browser type/version doesn't support what's needed.
  • It should be more efficient, but I've done minimal testing on scalability so far. Nominally, one could allow users to download all files and not have a size limit. The underlying zip code is using Blobs and Promises and is supposed to scale, but I'm not sure I'm doing everything to configure it to scale, etc.
  • The logic not allowing download of a zip when your over the size limit has not been changed, so this method is also subject to any limit so far.
  • The download file is always named dataverse_files.zip as before - could potentially use the dataset PID/version to create a unique name (with full or partial to indicate some/all files)
  • There is not currently any manifest file in the zip - should be possible to add one if desired (or to someday make a Bag)

Which issue(s) this PR closes:

Closes #5864

Special notes for your reviewer:
To enable this, I needed to know the datasetVersion in the download code which required trying to fix #5864 - the multifile button doesn't set the datasetVersion in the guestbook by default. If the rest gets delayed, it may be worth pulling out this one line fix (it's a separate commit).

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Dec 22, 2022

Coverage Status

Coverage: 20.005% (-0.004%) from 20.009% when pulling bd13967 on GlobalDataverseCommunityConsortium:clientsidezip into 1aabf69 on IQSS:develop.

@donsizemore
Copy link
Contributor

this is wonderful! does it default to original file format, or does it send surrogate copies?

@qqmyers
Copy link
Member Author

qqmyers commented Dec 23, 2022

Right now it is adding ?format=original when it retrieves each file.

@donsizemore
Copy link
Contributor

@qqmyers @jdmar3 corrects me: From archival theory there is a case to be made either way, but I do argue that prioritizing original file formats over plaintext tabular data incurs tech debt and may make data unusable. Would perhaps an "original format" check-box or some such require much additional work?

@jdmar3
Copy link

jdmar3 commented Jan 10, 2023

@qqmyers Would it be possible to present an option to add ?format=archival in addition to/instead of ?format=original. Also, we're working on some automated user testing for different browsers, so I would be happy to help with testing on different browsers if needed.

EDIT: @donsizemore beat me to the punch!

@qqmyers
Copy link
Member Author

qqmyers commented Jan 10, 2023

The Access Dataset menu at the top of the page allows getting either original or archival format. Currently I have not changed those buttons to use the client-size zipping but that's a useful addition once we know that it works well for most browsers and sizable datasets.

Both forms are also available at the individual file level, so it is mostly a limitation of the bulk 'Download' button for selected files, regardless of whether the existing server-side zipping or this client-side method is used. I don't want to change that as part of this PR for client-side zipping, but I think both client-size and the existing server-side algorithm could handle both cases if the user interface work is done to allow it. FWIW: I think the API call to download all files allows you to specify either form as well.

W.r.t. archiving, I would argue that the Bag exports are better than the zip available from the front end (the Bag has fixity info, all the metadata for the dataset, etc.) and since it is privileged, it doesn't run the risk of files being excluded if you don't have permissions (if I recall the zip options in the UI include a manifest that lists files that weren't included due to permissions or size limits). There has also been discussion of the archival Bag exports w.r.t. whether including the ingested formats would be better than the original, but there are issues that have slowed that work, e.g. the fact that Dataverse isn't storing the fixity info for the ingested versions. It would definitely be useful to have some discussion/review of the Bags to decide requirements and priorities.

W.r.t. to testing - thanks! The draft PR should work as is so if we can get a test server(s) set up somewhere, it could be tested with different browsers, larger data, etc. I think DataverseNO was going to try to fire one up, Don could probably do that at Odum as well. Assuming that looks promising, I can look into updating the download all buttons - that shouldn't involve any new risks - if it works for the one format, it will work for the other.

@qqmyers
Copy link
Member Author

qqmyers commented Feb 7, 2023

This is not ready for Review/QA (hence draft). Testing has show the local browser uses significant amounts of memory with large files and can fail with an out-of-memory error. I'm still investigating how to handle this. Perhaps it should not be on the board yet?

@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2023

We're excited about it. Let's let Jim size it.

@mreekie mreekie added Size: Queued PM has called this issue out specifically for sizing bklog: NeedsDiscussion labels Mar 14, 2023
@mreekie
Copy link

mreekie commented Mar 14, 2023

Sizing:

  • Slid this back to Jim's column as not ready for sizing.

@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DataverseNO Size: Queued PM has called this issue out specifically for sizing Type: Feature a feature request
Projects
Status: No status
6 participants