Refactor file upload from web UI and temporary storage #6656

poikilotherm · 2020-02-17T14:07:15Z

Intro

This issues evokes from https://groups.google.com/d/topic/dataverse-community/IB1wYpoU_H8

A rough summary: when you upload data via web UI, the data is stored at a few different places before commited to real storage, as it needs to be processed, analyzed and maybe worked on.

Nonetheless, this could be tweaked, especially when starting to deal with larger data. This is kind of related to #6489. As not every big data installation might want to expose its S3 storage to the internet or even does not use this new functionality at all, this should still be addressed. From a UI and UX perspective, which is touched here, issues #6604 and #6605 seem related.

Mentions: @pdurbin @djbrooke @TaniaSchlatter @scolapasta @qqmyers @landreev

Discovery

I tested this on my local Kubernetes cluster and created a little video of what happens when. After that I digged a bit through the code to see if there are potential problems ahead for our installation.

Please enjoy the following demo (click on the image):

Things to note:

Data is uploaded and stored in glassfish/domains/domain1/generated/jsp/dataverse folder, as already stated in the Google Group posting.
After upload is done, but before marked completed, the data is copied to the temp folder, living at the configured place of JVM option dataverse.files. Unless this copy is done, the user experience is a "hanging progress bar at 100%".
The uploaded data is in most cases cloned, eating more space. When the upload has been marked finished to the user, often the temporary data in the generated folder is not deleted.
Upload to real file storage starts when the user clicks on "Save". The user experiences a moving circle, which might appear as hanging for large data.

Analysis

Now let's see what happens in the background from an implementation view. The following diagram is a bit shortened and leaves out edge cases to make it more understandable.

Click on the image to access the source.

Things to note here:

PrimeFaces uses Apache Commons FileUpload, so the temporary files created from it are deleted once the InputStream closes. This is however not the case in current codebase, which seems to be at least one cause for dangling files as noticed in the discovery above.
The GC closes the dangling Streams when the editing is done, but there is a great danger of leftovers. This should be avoided.
As noticed in the Google Group and during discovery, the place where PrimeFaces uploads its data is not the same as dataverse.files has been configured for. The docs lack of that information and it has a great danger of disks filling up, breaking things.
The same data is copied around a few times, and from what I saw in the code it is even copied when moved to real storage when storing on local filesystem. That sound very much like introducing lag for users.

Ideas and TODOs

I feel like this is a medium sized issue, as it needs quite a bit of refactoring.
A collection of ideas what to do next:

Think about optimizing upload, avoiding multiple copies. Things like this or this and this seem to be good starting points.
Think about the UX - is this acceptable for large data? The upload itself is quite fast, as seen, even for bigger files (this was from my workstation to a remote on premises location with 1Gbit LAN).
To what issues is this connected, too?

The text was updated successfully, but these errors were encountered:

PaulBoon · 2020-02-17T15:05:31Z

When doing upload tests for the S3 storage option I noticed the upload (step 4 in your things to note list) to the storage, that only shows a spinner, mostly was longer that the user upload to dataverse.
One of my colleagues suggested that it might not be needed to have the user wait for this spinner to end, but just let the user resume with other UI actions while the storing is being silently handled by dataverse in the background.

TaniaSchlatter · 2020-02-17T15:17:59Z

@poikilotherm, thanks for this detailed issue. I'd like to separate out the UX/UI/ system behaviors a bit.

Form the UX standpoint, it sounds like there are inefficiencies in how upload works that can cause lags.

From the UI standpoint, feedback is not clear/does not represent what the system is doing:

After upload, but while the data are being copied to the temp folder, the user sees a hanging progress bar at 100% – is this until copying is complete?
note to check: what is the feedback when complete?
When the user clicks "Save" the UI shows a moving circle, which might appear as hanging if saving takes a long time.
• progress is not indicated visually to the user
• does the user really need to wait until all is complete? (@PaulBoon) Question: if the user can proceed without saving completing, as @PaulBoon suggests, when/how would they get feedback that saving is complete?

I've clarified that issues #6604 and #6605 are primarily about addressing scaleability in the UI.

qqmyers · 2020-02-17T15:52:49Z

@poikilotherm - good stuff! In putting together the s3 direct uploads, I've been looking at the file workflow and thinking about optimizations as well. It seems like there are a couple things that could be picked off to start:

Documenting that java.io.tmpdir can be set in glassfish to move the first of the tempfiles you mention to another dir.
Adding a filter (some of your links) to directly create the temp file Dataverse uses, avoiding the initial copy created by primefaces.
Possibly adding an optional optimization to mv files when possible (temp and persistent file stores on the same file system) rather than using streams and copying as is required for the general case

Beyond that, I think there are multiple ways one could optimize, at least for some situations (Apologies for this part rambling a bit):

If the temp files and permanent files are in the same file system, a mv could be done instead of a copy.
One could write files directly to their final locations (as with direct s3 upload).
Transfers could be parallelized. Some aspects of this already occur - I think Primefaces uploads files in parallel (would potentially speed your video), but won't parallelize transfer of one file as protocols like Globus FTP does. I don't think internal transfers are handled in parallel though. (There's also a limitation in the API where one can only upload one file at a time/each upload causes a Dataset save)

There are a few considerations there that might inform design:

Keeping temp files makes it possible to put them on fast storage (e.g. an SSD) for processing.
Zip files aren't permanent and would need to be removed once the internal files are created, but, at present, the original file is stored for all other types (potentially with a translation (ingest) or additional metadata)
The user can delete files after upload and before saving and can just abandon a session. My sense is that most of the ways temp files become abandoned have been fixed in the latest versions of Dataverse (incremental fixes starting back 10+ versions) with abandoned sessions being the biggest potential issue. It is definitely much improved since ~4.9, however, if files are stored directly to their final location, it becomes more important to close all holes since it is harder to find abandoned files without a dedicated api/tool (i.e. you can't just use location or naming convention)
The current storage scheme requires knowing the dataset DOI to start (since it is in the file path/s3 name)
Functionality like ingest need access to the whole file, but this is only for some mimetypes and it may be that very few large files will undergo such processing
Some functionality like viewing file content to assign a mime type currently get access to the whole file, but none of them actually use more than the first ~1K bytes (some only look at the first few bytes). Other processing, like pulling metadata from FITS files may need more bytes, but could potentially still retrieve just the required parts of the file. (For direct s3 uploads, providing ranged queries, which is supported by the protocol itself, could eliminate having to retrieve the whole file to the dataverse machine for such processing (essentially creating a temp copy after the direct download)).
Truly remote stores (like TRSA) might require less custom functionality if such functionality didn't require making a temporary copy of the whole file at Dataverse.
Some stores (like rsync and I think the alternate s3 direct upload design discussed at the European meeting) don't make the user wait for file uploads and internal transfers to occur, but they require locking the dataset until that's complete and that definitely opens up questions of what you should do if something in that fails (i.e. after the user has gone away thinking things were OK)

My general sense for these types of considerations is that some refactoring of the StorageIO interface could help here and might enable experimentation/stores optimized for specific content over time along with helping the general case. I don't have a specific design in mind yet but things I've thought about include:

having the storageIO classes mint file identifiers and control the path
providing a ranged query to get file bytes
having storageIO provide temp storage/manage cleanup of files that are not eventually saved as part of a dataset.
making storageIO classes responsible for transferring data from temp to permanent (e.g. so a file store could do a mv instead of a copy, and other stores could do things like just use metadata to tag a file as temp/permanent, etc.)

Overall, I think that changes of this nature are ones that should be done in concert with thinking through the implications of remote stores, sensitive data, large data, data tags, reproducibility (where data location near compute may be important), and anything else on the roadmap that might affect requirements for storage and it's functionality (provenance? multiple datasets referencing the same files (e.g. starting reference data or controls). I don't want to make things too complicated but, I agree that refactoring storage beyond a couple simple steps (as at the beginning of my comment) is probably big enough that we won't want to do it multiple times.

PaulBoon · 2020-02-17T15:54:14Z

To answer the question from @TaniaSchlatter; "How they get feedback":
If the 'save' succeeds, then there is no problem, only downloads while saving might fail, which is only a real problem if saving takes a considerable time.
When saving fails, the 'system failed' and we want the administrator (manager) of dataverse to have a look at this problem and we might want the user to try again when the problem has been fixed.

Maybe (for large saves) we should provide the user with the option to wait, or resume with other things.
To find if it is 'done saving' later on the user could reload the dataset page and check the progress which we should then state somewhere on the page as 'in progress of saving' if still not done.

poikilotherm · 2020-02-17T16:33:24Z

Thank you @TaniaSchlatter, @qqmyers and @PaulBoon for your loads of input on this! On a side note, my comment to #6505 might be related here, mentioning making things more modular at different places to speed up things.

From all your input one might get the impression that this is not medium, but big and needs better planing and architecture. @scolapasta, how do you feel about this?

@djbrooke how might this fit into the roadmap? This might become sth that even needs funding. Someone at Tromso mentioned some folks granting funds for refactoring.

…e and Kubernetes deployment. Will be removed when #178 gets solved.

djbrooke · 2020-02-18T15:21:53Z

@poikilotherm thanks for the tag and for the summary above. We have two UI/UX intensive efforts underway right now (Computational Reproducibility and the Dataset/File Redesign) so I'm hesitant to add additional work that has front end impact. I'd suggest that we focus any refactoring as much as possible on improvements that don't have front end impact and revisit any front end improvements and changes when we have a bit more bandwidth on the UI/UX side. Generally, small changes that can be implemented independently are preferred, instead of a larger rewrite. This allows us to review things and get things into releases more quickly. If there's something you, @qqmyers @scolapasta and others can come up we'll be happy to review and QA through the regular processes.

If there is a larger rewrite needed, we can discuss the best plan for that as well. It's also good to hear about the possibility of funding. If there's something specific, we can talk about how to best support it, whether from an IQSS perspective or from a GDCC perspective.

TaniaSchlatter · 2020-02-21T20:25:17Z

To follow up on @PaulBoon's comment on feedback, feedback that a file has been uploaded successfully is the population of the file table. If we allowed users to do other tasks while they waited for large files to upload, we would want to present the user with a Success or Failure message that they would see. This could be in a popup, or on the page they are actively on.

poikilotherm · 2020-04-23T15:45:26Z

Just recently @pdurbin linked on IRC to a related issue: #2848

landreev · 2020-04-29T17:41:28Z

Thank you for opening this issue, and to everybody who participated in the discussion.
We've been aware of the inefficiencies in the file upload as implemented (specifically, the multi-level, repeated temp file creation). We kept delaying addressing this since we were also talking about making file upload a modular service that could be separated from the main application. So there was an understanding that it would need to be redesigned. But we just had some internal discussions of this and the consensus appears to be that we do want to prioritize this issue and start addressing it in this dev. cycle - and possibly in one of the next dev. sprints.

landreev · 2020-04-29T18:42:39Z

To me personally, the biggest issue is the 2 temp copies created repeatedly - one by Glassfish/Primefaces, the second one by us. This is definitely a waste that should be eliminated.
Note that in addition to the Web UI upload as described in detail above, the API upload is just as bad: Jersey first creates a temp copy of the file from the incoming POST/multipart-form stream, then passes the InputStream pointing to that temp file to our code. The difference is that instead of domains/domain1/generated/jsp/dataverse folder, Jersey uses whatever java.io.tempdir is pointing to. So the important thing here is that there is more than one temp directories that can potentially be filled up (DOSed even) by active file uploads.

Yet another thing that may not be mentioned above is that while Dataverse has configurable file upload size limit, as implemented, it can only be enforced in our code - i.e. by the time the entire system temp file has been saved. I.e. you can set that limit to 8Kb in your Dataverse. But if a user attempts to upload a 100GB file, Jersey will dutifully try to save the entire stream in your /tmp before passing it to the application.
(a quick solution for the above problem is to configure the POST size limit on the http listener level in Glassfish/Payara; but we are looking for a more programmatic solution, that can be enabled from within the application)

I would be really happy to eliminate at least one of these temp file stages. Saving the file directly into its final location right away, thus eliminating temp storage completely, as discussed above, should also be doable. I personally like the idea of keeping the file in temp storage, at least for the UI user, until they decide to click save. If the files go to the destination folder right away, just like some junk files from incomplete uploads are piling up in the temp directories now, they will be piling up in the "real" storage folders - and it will still be the responsibility of the admin to go through files there and erase the orphaned ones. (But I'm open to either way)

TaniaSchlatter · 2020-04-30T16:56:02Z

Adding some context from a recent internal discussion, so we don't loose it:

The current state of the upload page was intended to capture the most useful aspects of a new dataset: some metadata and files.
There have been more recent discussions and mockups aimed at reducing the metadata on that page, to required metadata and file upload only.
From the technical standpoint, it is tricky to have users add data before the dataset is created.
From the user standpoint, we want to make it as easy as possible to create a dataset, but we want datasets to have data, and useful metadata (which ultimately benefits depositors). There are sort of 3 steps, even in the new flow being considered: 1) create draft with required metadata (option to upload files) 2) publish (option to add files) 3) option to fill in additional metadata, and add files.

landreev · 2020-05-01T22:26:24Z

@TaniaSchlatter I'm really happy to hear that you are thinking of how to improve the file upload work flow, in the context of creating a new dataset and otherwise. It should probably be a whole separate issue for that - right? My reading of this one was that it was created for something fairly narrowly defined - specifically for eliminating the inefficiencies in how we use temp storage in the process of adding files.

poikilotherm · 2021-03-09T15:58:59Z

Just recently, I once again looked into this, trying to find the root cause for the temporary storage during uploads.

Since I opened this issue, we moved to Payara 5 and now definitely work with JSF 2.3 etc. Looking at the PrimeFaces FileUpload docs my initial statement about using the Apache Commons uploader isn't true anymore (and might have been a false claim from the start). This is all about using the native JSF ways of doing things.

Still I wondered why data is stored as parts under glassfish/domains/domain1/generated/jsp/dataverse and I finally found an answer to that... This directory is the "Servlet Context Temporary Path" jakarta.servlet.context.tempdir, (formerly known as javax.servlet.context.tempdir), which is not coming from a system property.

This attribute is used in PrimeFaces File Uploader with priority over java.io.tmpdir:

    default String getUploadDirectory(T request) {
        // Servlet 2.5 compatibility, equivalent to ServletContext.TMP_DIR
        File tmpDir = (File) request.getServletContext().getAttribute("javax.servlet.context.tempdir");
        if (tmpDir == null) {
            tmpDir = new File(System.getenv("java.io.tmpdir"));
        }
        return tmpDir.getAbsolutePath();
    }

So it looks like what we need to do here is to make this attribute configurable via our codebase to make it point to a location we want to have it. This should be a fairly small thing to do - the research was the tough part.

mheppler · 2021-03-09T16:41:05Z

Two things I do best... stirring the pot... expanding the scope... I'll just leave this here... for now...

File Upload: Investigate and potentially disable JSP file upload caching. File Upload: Investigate and potentially disable JSP file upload caching. #2125
File upload: Investigate cleaning up /temp directory more proactively through app. File upload: Investigate cleaning up /temp directory more proactively through app. #2140
Documentation: Document various temp directories used by Dataverse Documentation: Document various temp directories used by Dataverse #2848
File Upload: Clean up files left in /tmp after successful ingest File Upload: Clean up files left in /tmp after successful ingest #3818
AWS/S3 configuration and upload problem AWS/S3 configuration and upload problem #5062
tests: hard-coded file paths break tests for multiple users tests: hard-coded file paths break tests for multiple users running mvn package #5885
High levels of open file descriptors while uploading a zip with lots of files in it High levels of open file descriptors while uploading a zip with lots of files in it #6723
Documentation: Better document various timeouts affecting Dataverse, especially file upload/download Documentation: Better document various timeouts affecting Dataverse, especially file upload/download #7391
make temporary path for some file operations configurable make temporary path for some file operations configurable #7664

landreev · 2022-03-03T02:03:06Z

This directory is the "Servlet Context Temporary Path" jakarta.servlet.context.tempdir, (formerly known as javax.servlet.context.tempdir), which is not coming from a system property.

Yeah, this github issue is potentially HUGE/all-consuming, but we should have addressed this one small thing - making this PrimeFaces temp directory (more easily) configurable.
Just to make it clear, it IS already configurable. The location is stored in a servlet context attribute, as shown in the code fragment above. But the attribute is initiated from a property - not a system property indeed, but from one that can be defined in the glassfish-web.xml file, like this:

<property name="tempdir" value="/some/alt/dir"/>

So this is what I've been doing on IQSS own prod. servers; to maintain that web upload temp on a separate partition. This is obviously a difficult configuration method. So yes, it would be great to have this value stored in a normal system property and then added to the context in the application code. The comment I'm quoting is almost a year old, maybe it's time... I'd be happy to make a PR just for this feature. Just need to read up on where this code needs to go.

landreev · 2022-03-03T02:04:02Z

P.S. Thinking about it, we should've documented the hack above too in the guides somewhere.

poikilotherm · 2022-03-03T08:04:23Z

@landreev if this is coming from glassfish-web.xml it might be worth a try to use Payara Variable Substitution with it.

Sth like <property name="tempdir" value="${MPCONFIG=dataverse.upload.tempdir:/some/sane/default}"/>

(MPCONFIG would allow to use a system property, env var or other source for this)

pdurbin · 2022-03-03T12:45:48Z

we should've documented the hack above too in the guides somewhere

Just a reminder that an issue about documenting temp directories is mentioned above a couple times: #2848

Obviously, we could just document the one hack. Even that would probably help a lot.

landreev · 2022-03-03T13:59:49Z

I'm trying to remember, if I had tried variable substitution, and concluded it wasn't working; or not. Will try again.

landreev · 2022-03-03T14:01:31Z

I did have a recollection of an issue for documenting the other temp spaces (the issue with this one, it's not even on the radar for many installations). Surprised and embarrassed it hasn't been addressed yet.

As outlined in IQSS#6656, files will be stored in `domaindir/generated/jsp/dataverse` during upload before being moved to our temporary ingest file space at `$dataverse.files.directory/temp`. With this commit, we enable to configure a different place for these kind of generated temporary files by using MPCONFIG variable substitution inside of glassfish-web.xml. Also sorts the content of glassfish-web.xml into order as specified by the XSD. Documentation of the setting is provided.

…orage IQSS#6656

feat(upload): make upload file storage path configurable #6656

…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.

This was referenced Feb 18, 2020

Workaround for potentially slow PrimeFace uploads gdcc/dataverse-kubernetes#177

Closed

Remove workaround #177 for file uploads gdcc/dataverse-kubernetes#178

Open

poikilotherm added a commit to gdcc/dataverse-kubernetes that referenced this issue Feb 18, 2020

Add workaround for #177 and IQSS/dataverse#6656 to dataverse-k8s imag…

85f522d

…e and Kubernetes deployment. Will be removed when #178 gets solved.

poikilotherm mentioned this issue Feb 18, 2020

177 - Workaround for PrimeFaces uploads via Web UI gdcc/dataverse-kubernetes#179

Merged

poikilotherm mentioned this issue Mar 4, 2020

High levels of open file descriptors while uploading a zip with lots of files in it #6723

Closed

pdurbin mentioned this issue Jun 16, 2020

Documentation: Document various temp directories used by Dataverse #2848

Open

poikilotherm mentioned this issue Mar 9, 2021

make temporary path for some file operations configurable #7664

Closed

poikilotherm mentioned this issue Aug 25, 2022

8932 container base image #8933

Merged

11 tasks

poikilotherm mentioned this issue Sep 20, 2022

feat(upload): make upload file storage path configurable #6656 #8983

Merged

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 11, 2022

docs(storage): add some detailed notes about temporary upload file st…

4a50fca

…orage IQSS#6656

donsizemore mentioned this issue Nov 14, 2022

Checksum (MD5) calculation for local upload is slow #9166

Open

pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022

replace "leftovers" with "stale uploads" IQSS#6656

c8864f3

pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022

add release note IQSS#6656

afdae3e

pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022

Merge branch 'develop' into 6656-upload-dir IQSS#6656

6f41b89

landreev mentioned this issue Nov 29, 2022

Document configuring temp. directories used for file uploads #9196

Closed

kcondon added a commit that referenced this issue Dec 5, 2022

Merge pull request #8983 from poikilotherm/6656-upload-dir

ac2b31d

feat(upload): make upload file storage path configurable #6656

kcondon closed this as completed in #8983 Dec 5, 2022

pdurbin added this to the 5.13 milestone Dec 7, 2022

poikilotherm mentioned this issue Jun 20, 2023

Make docroot location configurable #9662

Closed

poikilotherm mentioned this issue Oct 11, 2023

Migrate storage configuration to MPCONFIG (milestone D) #9998

Open

T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Nov 16, 2023

Add workaround for gdcc#177 and IQSS/dataverse#6656 to dataverse-k8s …

4606139

…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.

T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Apr 25, 2024

Add workaround for gdcc#177 and IQSS/dataverse#6656 to dataverse-k8s …

da750fd

…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.

T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Apr 25, 2024

Add workaround for gdcc#177 and IQSS/dataverse#6656 to dataverse-k8s …

17a887a

…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor file upload from web UI and temporary storage #6656

Refactor file upload from web UI and temporary storage #6656

poikilotherm commented Feb 17, 2020

PaulBoon commented Feb 17, 2020

TaniaSchlatter commented Feb 17, 2020

qqmyers commented Feb 17, 2020

PaulBoon commented Feb 17, 2020

poikilotherm commented Feb 17, 2020 •

edited

Loading

djbrooke commented Feb 18, 2020

TaniaSchlatter commented Feb 21, 2020

poikilotherm commented Apr 23, 2020 •

edited

Loading

landreev commented Apr 29, 2020

landreev commented Apr 29, 2020

TaniaSchlatter commented Apr 30, 2020 •

edited

Loading

landreev commented May 1, 2020

poikilotherm commented Mar 9, 2021 •

edited

Loading

mheppler commented Mar 9, 2021

landreev commented Mar 3, 2022

landreev commented Mar 3, 2022

poikilotherm commented Mar 3, 2022 •

edited

Loading

pdurbin commented Mar 3, 2022

landreev commented Mar 3, 2022

landreev commented Mar 3, 2022

Refactor file upload from web UI and temporary storage #6656

Refactor file upload from web UI and temporary storage #6656

Comments

poikilotherm commented Feb 17, 2020

Intro

Discovery

Analysis

Ideas and TODOs

PaulBoon commented Feb 17, 2020

TaniaSchlatter commented Feb 17, 2020

qqmyers commented Feb 17, 2020

PaulBoon commented Feb 17, 2020

poikilotherm commented Feb 17, 2020 • edited Loading

djbrooke commented Feb 18, 2020

TaniaSchlatter commented Feb 21, 2020

poikilotherm commented Apr 23, 2020 • edited Loading

landreev commented Apr 29, 2020

landreev commented Apr 29, 2020

TaniaSchlatter commented Apr 30, 2020 • edited Loading

landreev commented May 1, 2020

poikilotherm commented Mar 9, 2021 • edited Loading

mheppler commented Mar 9, 2021

landreev commented Mar 3, 2022

landreev commented Mar 3, 2022

poikilotherm commented Mar 3, 2022 • edited Loading

pdurbin commented Mar 3, 2022

landreev commented Mar 3, 2022

landreev commented Mar 3, 2022

poikilotherm commented Feb 17, 2020 •

edited

Loading

poikilotherm commented Apr 23, 2020 •

edited

Loading

TaniaSchlatter commented Apr 30, 2020 •

edited

Loading

poikilotherm commented Mar 9, 2021 •

edited

Loading

poikilotherm commented Mar 3, 2022 •

edited

Loading