Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor file upload from web UI and temporary storage #6656

Closed
poikilotherm opened this issue Feb 17, 2020 · 20 comments · Fixed by #8983
Closed

Refactor file upload from web UI and temporary storage #6656

poikilotherm opened this issue Feb 17, 2020 · 20 comments · Fixed by #8983
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: File Upload & Handling Feature: Installation Guide Feature: Performance & Stability Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc. User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Milestone

Comments

@poikilotherm
Copy link
Contributor

Intro

This issues evokes from https://groups.google.com/d/topic/dataverse-community/IB1wYpoU_H8

A rough summary: when you upload data via web UI, the data is stored at a few different places before commited to real storage, as it needs to be processed, analyzed and maybe worked on.

Nonetheless, this could be tweaked, especially when starting to deal with larger data. This is kind of related to #6489. As not every big data installation might want to expose its S3 storage to the internet or even does not use this new functionality at all, this should still be addressed. From a UI and UX perspective, which is touched here, issues #6604 and #6605 seem related.

Mentions: @pdurbin @djbrooke @TaniaSchlatter @scolapasta @qqmyers @landreev

Discovery

I tested this on my local Kubernetes cluster and created a little video of what happens when. After that I digged a bit through the code to see if there are potential problems ahead for our installation.

Please enjoy the following demo (click on the image):
upload-test-1g-2g-5g

Things to note:

  1. Data is uploaded and stored in glassfish/domains/domain1/generated/jsp/dataverse folder, as already stated in the Google Group posting.
  2. After upload is done, but before marked completed, the data is copied to the temp folder, living at the configured place of JVM option dataverse.files. Unless this copy is done, the user experience is a "hanging progress bar at 100%".
  3. The uploaded data is in most cases cloned, eating more space. When the upload has been marked finished to the user, often the temporary data in the generated folder is not deleted.
  4. Upload to real file storage starts when the user clicks on "Save". The user experiences a moving circle, which might appear as hanging for large data.

Analysis

Now let's see what happens in the background from an implementation view. The following diagram is a bit shortened and leaves out edge cases to make it more understandable.

Click on the image to access the source.
plantuml

Things to note here:

  1. PrimeFaces uses Apache Commons FileUpload, so the temporary files created from it are deleted once the InputStream closes. This is however not the case in current codebase, which seems to be at least one cause for dangling files as noticed in the discovery above.
  2. The GC closes the dangling Streams when the editing is done, but there is a great danger of leftovers. This should be avoided.
  3. As noticed in the Google Group and during discovery, the place where PrimeFaces uploads its data is not the same as dataverse.files has been configured for. The docs lack of that information and it has a great danger of disks filling up, breaking things.
  4. The same data is copied around a few times, and from what I saw in the code it is even copied when moved to real storage when storing on local filesystem. That sound very much like introducing lag for users.

Ideas and TODOs

I feel like this is a medium sized issue, as it needs quite a bit of refactoring.
A collection of ideas what to do next:

  • Think about optimizing upload, avoiding multiple copies. Things like this or this and this seem to be good starting points.
  • Think about the UX - is this acceptable for large data? The upload itself is quite fast, as seen, even for bigger files (this was from my workstation to a remote on premises location with 1Gbit LAN).
  • To what issues is this connected, too?
@poikilotherm poikilotherm added Feature: File Upload & Handling Type: Bug a defect Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Performance & Stability Feature: Installation Guide User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh User Role: Depositor Creates datasets, uploads data, etc. Medium labels Feb 17, 2020
@PaulBoon
Copy link
Contributor

When doing upload tests for the S3 storage option I noticed the upload (step 4 in your things to note list) to the storage, that only shows a spinner, mostly was longer that the user upload to dataverse.
One of my colleagues suggested that it might not be needed to have the user wait for this spinner to end, but just let the user resume with other UI actions while the storing is being silently handled by dataverse in the background.

@TaniaSchlatter
Copy link
Member

@poikilotherm, thanks for this detailed issue. I'd like to separate out the UX/UI/ system behaviors a bit.

Form the UX standpoint, it sounds like there are inefficiencies in how upload works that can cause lags.

From the UI standpoint, feedback is not clear/does not represent what the system is doing:

  • After upload, but while the data are being copied to the temp folder, the user sees a hanging progress bar at 100% – is this until copying is complete?
    note to check: what is the feedback when complete?
  • When the user clicks "Save" the UI shows a moving circle, which might appear as hanging if saving takes a long time.
    • progress is not indicated visually to the user
    • does the user really need to wait until all is complete? (@PaulBoon) Question: if the user can proceed without saving completing, as @PaulBoon suggests, when/how would they get feedback that saving is complete?

I've clarified that issues #6604 and #6605 are primarily about addressing scaleability in the UI.

@qqmyers
Copy link
Member

qqmyers commented Feb 17, 2020

@poikilotherm - good stuff! In putting together the s3 direct uploads, I've been looking at the file workflow and thinking about optimizations as well. It seems like there are a couple things that could be picked off to start:

  • Documenting that java.io.tmpdir can be set in glassfish to move the first of the tempfiles you mention to another dir.
  • Adding a filter (some of your links) to directly create the temp file Dataverse uses, avoiding the initial copy created by primefaces.
  • Possibly adding an optional optimization to mv files when possible (temp and persistent file stores on the same file system) rather than using streams and copying as is required for the general case

Beyond that, I think there are multiple ways one could optimize, at least for some situations (Apologies for this part rambling a bit):

  • If the temp files and permanent files are in the same file system, a mv could be done instead of a copy.
  • One could write files directly to their final locations (as with direct s3 upload).
  • Transfers could be parallelized. Some aspects of this already occur - I think Primefaces uploads files in parallel (would potentially speed your video), but won't parallelize transfer of one file as protocols like Globus FTP does. I don't think internal transfers are handled in parallel though. (There's also a limitation in the API where one can only upload one file at a time/each upload causes a Dataset save)

There are a few considerations there that might inform design:

  • Keeping temp files makes it possible to put them on fast storage (e.g. an SSD) for processing.
  • Zip files aren't permanent and would need to be removed once the internal files are created, but, at present, the original file is stored for all other types (potentially with a translation (ingest) or additional metadata)
  • The user can delete files after upload and before saving and can just abandon a session. My sense is that most of the ways temp files become abandoned have been fixed in the latest versions of Dataverse (incremental fixes starting back 10+ versions) with abandoned sessions being the biggest potential issue. It is definitely much improved since ~4.9, however, if files are stored directly to their final location, it becomes more important to close all holes since it is harder to find abandoned files without a dedicated api/tool (i.e. you can't just use location or naming convention)
  • The current storage scheme requires knowing the dataset DOI to start (since it is in the file path/s3 name)
  • Functionality like ingest need access to the whole file, but this is only for some mimetypes and it may be that very few large files will undergo such processing
  • Some functionality like viewing file content to assign a mime type currently get access to the whole file, but none of them actually use more than the first ~1K bytes (some only look at the first few bytes). Other processing, like pulling metadata from FITS files may need more bytes, but could potentially still retrieve just the required parts of the file. (For direct s3 uploads, providing ranged queries, which is supported by the protocol itself, could eliminate having to retrieve the whole file to the dataverse machine for such processing (essentially creating a temp copy after the direct download)).
  • Truly remote stores (like TRSA) might require less custom functionality if such functionality didn't require making a temporary copy of the whole file at Dataverse.
  • Some stores (like rsync and I think the alternate s3 direct upload design discussed at the European meeting) don't make the user wait for file uploads and internal transfers to occur, but they require locking the dataset until that's complete and that definitely opens up questions of what you should do if something in that fails (i.e. after the user has gone away thinking things were OK)

My general sense for these types of considerations is that some refactoring of the StorageIO interface could help here and might enable experimentation/stores optimized for specific content over time along with helping the general case. I don't have a specific design in mind yet but things I've thought about include:

  • having the storageIO classes mint file identifiers and control the path
  • providing a ranged query to get file bytes
  • having storageIO provide temp storage/manage cleanup of files that are not eventually saved as part of a dataset.
  • making storageIO classes responsible for transferring data from temp to permanent (e.g. so a file store could do a mv instead of a copy, and other stores could do things like just use metadata to tag a file as temp/permanent, etc.)

Overall, I think that changes of this nature are ones that should be done in concert with thinking through the implications of remote stores, sensitive data, large data, data tags, reproducibility (where data location near compute may be important), and anything else on the roadmap that might affect requirements for storage and it's functionality (provenance? multiple datasets referencing the same files (e.g. starting reference data or controls). I don't want to make things too complicated but, I agree that refactoring storage beyond a couple simple steps (as at the beginning of my comment) is probably big enough that we won't want to do it multiple times.

@PaulBoon
Copy link
Contributor

To answer the question from @TaniaSchlatter; "How they get feedback":
If the 'save' succeeds, then there is no problem, only downloads while saving might fail, which is only a real problem if saving takes a considerable time.
When saving fails, the 'system failed' and we want the administrator (manager) of dataverse to have a look at this problem and we might want the user to try again when the problem has been fixed.

Maybe (for large saves) we should provide the user with the option to wait, or resume with other things.
To find if it is 'done saving' later on the user could reload the dataset page and check the progress which we should then state somewhere on the page as 'in progress of saving' if still not done.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Feb 17, 2020

Thank you @TaniaSchlatter, @qqmyers and @PaulBoon for your loads of input on this! On a side note, my comment to #6505 might be related here, mentioning making things more modular at different places to speed up things.

From all your input one might get the impression that this is not medium, but big and needs better planing and architecture. @scolapasta, how do you feel about this?

@djbrooke how might this fit into the roadmap? This might become sth that even needs funding. Someone at Tromso mentioned some folks granting funds for refactoring.

@djbrooke
Copy link
Contributor

@poikilotherm thanks for the tag and for the summary above. We have two UI/UX intensive efforts underway right now (Computational Reproducibility and the Dataset/File Redesign) so I'm hesitant to add additional work that has front end impact. I'd suggest that we focus any refactoring as much as possible on improvements that don't have front end impact and revisit any front end improvements and changes when we have a bit more bandwidth on the UI/UX side. Generally, small changes that can be implemented independently are preferred, instead of a larger rewrite. This allows us to review things and get things into releases more quickly. If there's something you, @qqmyers @scolapasta and others can come up we'll be happy to review and QA through the regular processes.

If there is a larger rewrite needed, we can discuss the best plan for that as well. It's also good to hear about the possibility of funding. If there's something specific, we can talk about how to best support it, whether from an IQSS perspective or from a GDCC perspective.

@TaniaSchlatter
Copy link
Member

To follow up on @PaulBoon's comment on feedback, feedback that a file has been uploaded successfully is the population of the file table. If we allowed users to do other tasks while they waited for large files to upload, we would want to present the user with a Success or Failure message that they would see. This could be in a popup, or on the page they are actively on.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Apr 23, 2020

Just recently @pdurbin linked on IRC to a related issue: #2848

@landreev
Copy link
Contributor

Thank you for opening this issue, and to everybody who participated in the discussion.
We've been aware of the inefficiencies in the file upload as implemented (specifically, the multi-level, repeated temp file creation). We kept delaying addressing this since we were also talking about making file upload a modular service that could be separated from the main application. So there was an understanding that it would need to be redesigned. But we just had some internal discussions of this and the consensus appears to be that we do want to prioritize this issue and start addressing it in this dev. cycle - and possibly in one of the next dev. sprints.

@landreev
Copy link
Contributor

To me personally, the biggest issue is the 2 temp copies created repeatedly - one by Glassfish/Primefaces, the second one by us. This is definitely a waste that should be eliminated.
Note that in addition to the Web UI upload as described in detail above, the API upload is just as bad: Jersey first creates a temp copy of the file from the incoming POST/multipart-form stream, then passes the InputStream pointing to that temp file to our code. The difference is that instead of domains/domain1/generated/jsp/dataverse folder, Jersey uses whatever java.io.tempdir is pointing to. So the important thing here is that there is more than one temp directories that can potentially be filled up (DOSed even) by active file uploads.

Yet another thing that may not be mentioned above is that while Dataverse has configurable file upload size limit, as implemented, it can only be enforced in our code - i.e. by the time the entire system temp file has been saved. I.e. you can set that limit to 8Kb in your Dataverse. But if a user attempts to upload a 100GB file, Jersey will dutifully try to save the entire stream in your /tmp before passing it to the application.
(a quick solution for the above problem is to configure the POST size limit on the http listener level in Glassfish/Payara; but we are looking for a more programmatic solution, that can be enabled from within the application)

I would be really happy to eliminate at least one of these temp file stages. Saving the file directly into its final location right away, thus eliminating temp storage completely, as discussed above, should also be doable. I personally like the idea of keeping the file in temp storage, at least for the UI user, until they decide to click save. If the files go to the destination folder right away, just like some junk files from incomplete uploads are piling up in the temp directories now, they will be piling up in the "real" storage folders - and it will still be the responsibility of the admin to go through files there and erase the orphaned ones. (But I'm open to either way)

@TaniaSchlatter
Copy link
Member

TaniaSchlatter commented Apr 30, 2020

Adding some context from a recent internal discussion, so we don't loose it:

  • The current state of the upload page was intended to capture the most useful aspects of a new dataset: some metadata and files.

  • There have been more recent discussions and mockups aimed at reducing the metadata on that page, to required metadata and file upload only.

  • From the technical standpoint, it is tricky to have users add data before the dataset is created.

  • From the user standpoint, we want to make it as easy as possible to create a dataset, but we want datasets to have data, and useful metadata (which ultimately benefits depositors). There are sort of 3 steps, even in the new flow being considered: 1) create draft with required metadata (option to upload files) 2) publish (option to add files) 3) option to fill in additional metadata, and add files.

@landreev
Copy link
Contributor

landreev commented May 1, 2020

@TaniaSchlatter I'm really happy to hear that you are thinking of how to improve the file upload work flow, in the context of creating a new dataset and otherwise. It should probably be a whole separate issue for that - right? My reading of this one was that it was created for something fairly narrowly defined - specifically for eliminating the inefficiencies in how we use temp storage in the process of adding files.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Mar 9, 2021

Just recently, I once again looked into this, trying to find the root cause for the temporary storage during uploads.

Since I opened this issue, we moved to Payara 5 and now definitely work with JSF 2.3 etc. Looking at the PrimeFaces FileUpload docs my initial statement about using the Apache Commons uploader isn't true anymore (and might have been a false claim from the start). This is all about using the native JSF ways of doing things.

Still I wondered why data is stored as parts under glassfish/domains/domain1/generated/jsp/dataverse and I finally found an answer to that... This directory is the "Servlet Context Temporary Path" jakarta.servlet.context.tempdir, (formerly known as javax.servlet.context.tempdir), which is not coming from a system property.

This attribute is used in PrimeFaces File Uploader with priority over java.io.tmpdir:

    default String getUploadDirectory(T request) {
        // Servlet 2.5 compatibility, equivalent to ServletContext.TMP_DIR
        File tmpDir = (File) request.getServletContext().getAttribute("javax.servlet.context.tempdir");
        if (tmpDir == null) {
            tmpDir = new File(System.getenv("java.io.tmpdir"));
        }
        return tmpDir.getAbsolutePath();
    }

So it looks like what we need to do here is to make this attribute configurable via our codebase to make it point to a location we want to have it. This should be a fairly small thing to do - the research was the tough part.

@mheppler
Copy link
Contributor

mheppler commented Mar 9, 2021

Two things I do best... stirring the pot... expanding the scope... I'll just leave this here... for now...

@landreev
Copy link
Contributor

landreev commented Mar 3, 2022

This directory is the "Servlet Context Temporary Path" jakarta.servlet.context.tempdir, (formerly known as javax.servlet.context.tempdir), which is not coming from a system property.

Yeah, this github issue is potentially HUGE/all-consuming, but we should have addressed this one small thing - making this PrimeFaces temp directory (more easily) configurable.
Just to make it clear, it IS already configurable. The location is stored in a servlet context attribute, as shown in the code fragment above. But the attribute is initiated from a property - not a system property indeed, but from one that can be defined in the glassfish-web.xml file, like this:

<property name="tempdir" value="/some/alt/dir"/>

So this is what I've been doing on IQSS own prod. servers; to maintain that web upload temp on a separate partition. This is obviously a difficult configuration method. So yes, it would be great to have this value stored in a normal system property and then added to the context in the application code. The comment I'm quoting is almost a year old, maybe it's time... I'd be happy to make a PR just for this feature. Just need to read up on where this code needs to go.

@landreev
Copy link
Contributor

landreev commented Mar 3, 2022

P.S. Thinking about it, we should've documented the hack above too in the guides somewhere.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Mar 3, 2022

@landreev if this is coming from glassfish-web.xml it might be worth a try to use Payara Variable Substitution with it.

Sth like <property name="tempdir" value="${MPCONFIG=dataverse.upload.tempdir:/some/sane/default}"/>

(MPCONFIG would allow to use a system property, env var or other source for this)

@pdurbin
Copy link
Member

pdurbin commented Mar 3, 2022

we should've documented the hack above too in the guides somewhere

Just a reminder that an issue about documenting temp directories is mentioned above a couple times: #2848

Obviously, we could just document the one hack. Even that would probably help a lot.

@landreev
Copy link
Contributor

landreev commented Mar 3, 2022

I'm trying to remember, if I had tried variable substitution, and concluded it wasn't working; or not. Will try again.

@landreev
Copy link
Contributor

landreev commented Mar 3, 2022

I did have a recollection of an issue for documenting the other temp spaces (the issue with this one, it's not even on the radar for many installations). Surprised and embarrassed it hasn't been addressed yet.

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Sep 20, 2022
As outlined in IQSS#6656, files will be stored in
`domaindir/generated/jsp/dataverse` during upload before being moved to
our temporary ingest file space at `$dataverse.files.directory/temp`.

With this commit, we enable to configure a different place for these
kind of generated temporary files by using MPCONFIG variable
substitution inside of glassfish-web.xml.

Also sorts the content of glassfish-web.xml into order as specified by
the XSD. Documentation of the setting is provided.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 11, 2022
pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022
pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022
pdurbin added a commit to poikilotherm/dataverse that referenced this issue Nov 22, 2022
kcondon added a commit that referenced this issue Dec 5, 2022
feat(upload): make upload file storage path configurable #6656
@pdurbin pdurbin added this to the 5.13 milestone Dec 7, 2022
T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Nov 16, 2023
…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.
T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Apr 25, 2024
…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.
T-Haeussermann pushed a commit to HyperSpec-FDM/dataverse-kubernetes that referenced this issue Apr 25, 2024
…image and Kubernetes deployment. Will be removed when gdcc#178 gets solved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: File Upload & Handling Feature: Installation Guide Feature: Performance & Stability Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc. User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants