Upload large files directly from disk, without loading into memory #19049

huonw · 2023-05-19T03:36:26Z

Is your feature request related to a problem? Please describe.

With #18153, large files produced are now cached as standalone files, on disk, rather than in the core LMDB structures. Uploading to a remote cache should be able to pipe these directly from that location on disk.

Currently they're pulled into memory, written to a temporary file (with an additional bug: it's sync IO in async code), and then that temporary file is uploaded via mmap. Breadcrumbs, in src/rust/engine/fs/store:

store::Store::ensure_remote_has_recursive calls store::Store::store_large_blob_remote
store_large_blob_remote calls store::remote::ByteStore::store_buffered
store_buffered creates temporary files...
...then calls the closure that store_large_blob_remote provided with a std::fs::File for that temporary file
that closure then reads the whole large-file into memory and splats it into the temporary file (synchronously)

Describe the solution you'd like

store::Store::ensure_remote_has_recursive should pass a file handle into the remote cache, and that's manipulated/uploaded directly, without going through memory.

The following code could be adjusted to just decide whether the blob is on disk (if so, go via the disk) or is in LMDB (if so, just pull into memory), resolving the TODO by removing the linking to wire chunk size:

pants/src/rust/engine/fs/store/src/lib.rs

Lines 831 to 839 in 4c708a0

    
           // TODO(John Sirois): Consider allowing configuration of when to buffer large blobs 
        
           // to disk to be independent of the remote store wire chunk size. 
        
           if digest.size_bytes > remote_store.store.chunk_size_bytes() { 
        
             Self::store_large_blob_remote(local, remote_store.store, entry_type, digest) 
        
               .await 
        
           } else { 
        
             Self::store_small_blob_remote(local, remote_store.store, entry_type, digest) 
        
               .await 
        
           }

Describe alternatives you've considered

N/A

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

huonw · 2023-05-21T04:37:20Z

Hmmm, I wonder if the use of an mmap'd file in async code is also an async hazard: I'm imagining memory accesses to pages of the file that aren't yet in memory will trigger effectively a sync read (i.e. the page fault will deschedule the whole OS thread including the Rust-level async runtime, rather than letting the runtime work on another coroutine...).

This does preparatory refactoring towards #11149 (and probably also #19049), by adjusting `store::remote::ByteStore` in a few ways to make it easier to plop in new 'providers': - package the various options for creating a provider into a struct, and a bunch of mechanical refactoring to use that struct - improved tests: - explicit tests for the REAPI provider - more extensive exercising of the `ByteStore` interface, including extra tests for combinations like "`load_file` when digest is missing", and (to me) more logical test names - testing the 'normal' `ByteStore` interface via fully in-memory simple `Provider` instead of needing to run the `StubCAS` - testing some more things at lower levels, like the REAPI provider doing hash verification, the `LoadDestination::reset` methods doing the right thing and `ByteStore::store_buffered` The commits are mostly individually reviewable, although I'd recommend having a sense of the big picture as described above before going through them one-by-one. After this, the next steps towards #11149 will be: 1. do something similar for the action cache 2. implement new providers, with some sort of factory function for going from `RemoteOptions` to an appropriate `Arc<dyn ByteStoreProvider + 'static>` 3. expose settings to select and configure those new providers

…9711) This (hopefully) optimises storing large blobs to a remote cache, by streaming them directly from the file stored on disk in the "FSDB". This builds on the FSDB local store work (#18153), relying on large objects being stored as an immutable file on disk, in the cache managed by Pants. This is an optimisation in several ways: - Cutting out an extra temporary file: - Previously `Store::store_large_blob_remote` would load the whole blob from the local store and then write that to a temporary file. This was appropriate with LMBD-backed blobs. - With new FSDB, there's already a file that can be used, no need for that temporary, and so the file creation and writing overhead can be eliminated . - Reducing sync IO in async tasks, due to mmap: - Previously `ByteStore::store_buffered` would take that temporary file and mmap it, to be able to slice into `Bytes` more efficiently... except this is secretly blocking/sync IO, happening within async tasks (AIUI: when accessing a mmap'd byte that's only on disk, not yet in memory, the whole OS thread is blocked/descheduled while the OS pulls the relevant part of the file into memory, i.e. `tokio` can't run another task on that thread). - This new approach uses normal `tokio` async IO mechanisms to read the file, and thus hopefully this has higher concurrency. - (This also eliminates the unmaintained `memmap` dependency.) I haven't benchmarked this though. My main motivation for this is firming up the provider API before adding new byte store providers, for #11149. This also resolves some TODOs and even eliminates some `unsafe`, yay! The commits are individually reviewable. Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the same problem but with an approach that wasn't possible at the time).

huonw added enhancement category:performance remote labels May 19, 2023

huonw self-assigned this May 19, 2023

huonw changed the title ~~Upload large files directly from disk, without serialising into memory~~ Upload large files directly from disk, without loading into memory May 19, 2023

huonw mentioned this issue Jul 7, 2023

Prepare store::remote::ByteStore for other providers #19424

Merged

cosmicexplorer mentioned this issue Aug 20, 2023

Perf tracking chainguard-dev/apko#781

Open

huonw mentioned this issue Aug 30, 2023

Stream large blobs to remote cache directly from local cache file #19711

Merged

huonw closed this as completed in #19711 Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload large files directly from disk, without loading into memory #19049

Upload large files directly from disk, without loading into memory #19049

huonw commented May 19, 2023 •

edited

Loading

huonw commented May 21, 2023

Upload large files directly from disk, without loading into memory #19049

Upload large files directly from disk, without loading into memory #19049

Comments

huonw commented May 19, 2023 • edited Loading

huonw commented May 21, 2023

huonw commented May 19, 2023 •

edited

Loading