Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming remote uploads read a chunk at a time #17234

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Oct 15, 2022

To reduce memory usage, #12087 moved to streaming the entire content of a large Digest out into a mmap'd temporary file, and then executing a streaming upload from there.

To remove that extra copy (and drop the dependency on memmap, which has been replaced: see #14341), this change switches to producing a Stream of chunks from the LMDB store, which can be uploaded without further copying.

Fixes #14341.

[ci skip-build-wheels]

[ci skip-build-wheels]
@stuhood stuhood added the category:internal CI, fixes for not-yet-released features, etc. label Oct 15, 2022
Ok(mapping) as Result<memmap::Mmap, String>
}?);

retry_call(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm. Looks like I removed retry from the large blob case. Will re-add that.

@stuhood stuhood marked this pull request as draft October 15, 2022 05:22
@huonw huonw closed this in #19711 Sep 12, 2023
huonw added a commit that referenced this pull request Sep 12, 2023
…9711)

This (hopefully) optimises storing large blobs to a remote cache, by
streaming them directly from the file stored on disk in the "FSDB".

This builds on the FSDB local store work (#18153), relying on large
objects being stored as an immutable file on disk, in the cache managed
by Pants.

This is an optimisation in several ways:

- Cutting out an extra temporary file:
- Previously `Store::store_large_blob_remote` would load the whole blob
from the local store and then write that to a temporary file. This was
appropriate with LMBD-backed blobs.
- With new FSDB, there's already a file that can be used, no need for
that temporary, and so the file creation and writing overhead can be
eliminated .
- Reducing sync IO in async tasks, due to mmap:
- Previously `ByteStore::store_buffered` would take that temporary file
and mmap it, to be able to slice into `Bytes` more efficiently... except
this is secretly blocking/sync IO, happening within async tasks (AIUI:
when accessing a mmap'd byte that's only on disk, not yet in memory, the
whole OS thread is blocked/descheduled while the OS pulls the relevant
part of the file into memory, i.e. `tokio` can't run another task on
that thread).
- This new approach uses normal `tokio` async IO mechanisms to read the
file, and thus hopefully this has higher concurrency.
  - (This also eliminates the unmaintained `memmap` dependency.)

I haven't benchmarked this though.

My main motivation for this is firming up the provider API before adding
new byte store providers, for #11149. This also resolves some TODOs and
even eliminates some `unsafe`, yay!

The commits are individually reviewable.

Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the
same problem but with an approach that wasn't possible at the time).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[internal, rust] memmap is unmaintained
2 participants