Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: blob level compression #147

Closed
mostynb opened this issue Jul 13, 2020 · 24 comments
Closed

feature request: blob level compression #147

mostynb opened this issue Jul 13, 2020 · 24 comments

Comments

@mostynb
Copy link
Contributor

mostynb commented Jul 13, 2020

In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs for the languages I checked require implementors to provide data in unserialized and uncompressed form.

This has two (related) practical downsides for REAPI caches:

  1. Cache servers who want to support compression will end up recompressing the same data many times.
  2. Cache servers who choose to store blobs in compressed form will also end up re-decompressing the same data many times.

Contrast this with Bazel's more primitive HTTP remote cache protocol, where it is trivial for cache servers to store and serve compressed blobs by inspecting the request's Accept-Encoding header.

I think it might be worth investigating adding optional blob level compression to REAPIv3, to avoid these downsides.

This might involve:

  1. Adding a new optional metadata field to the byte stream API's resource name to specify the compression format for uploads
  2. Adding a new optional metadata field to the byte stream API's resource name to specify a set of acceptable compression formats for downloads.
  3. Updating the Capabilities service for servers to announce supported compression formats.
  4. Updating the ActionCache and ContentAddressableStorage services to allow clients to selectively upload and request for download inlined blobs in particular compressed formats.

If we decide to implement encryption and authenticity checks (#133), then we may need to consider these two features at the same time during the design phase.

@edbaunton
Copy link
Collaborator

edbaunton commented Jul 13, 2020

Huge +1 from me. I was talking about this just today! We want to compress data in our in-memory CAS implementation. We have been experimenting with using compression at the gRPC layer but that is, as you said, completely undone when you put it in the CAS.

My immediate thought was to model these the same way we model the different digesting algorithms: via capabilities. As I understand things today, you cannot switch between digesting algorithms without restarting everything from scratch (including the CAS, although that can support multiple digests simultaneously per capabilities), so, if we want to support it in v2, we could just add a new capability called Compression. That could be dictated by the server just like digesting function is today. Apart from the inelegance, what would the downsides be there that I have missed?

@EricBurnett
Copy link
Collaborator

Convenient timing - we've just been working through trying to leverage gRPC transport-level compression ourselves, and finding it very difficult to capitalize on in practice, to the point we're giving up on it for now. So +1 to exploring API-level compression instead.

What we've gotten to in our exploration thus far is:

  • The ideal design (from our perspective) keeps blobs addressed by the digest of their uncompressed payload, even when transported with compression. (I think that's consistent with Mostyn's suggestion, and not Ed's?) This stems from a few motivations:
    • Not having to pay the cost of compression when making existence checks. (Right now, digest the file or get the digest from some other layer (e.g. fuse). Undesired: have to compress and then digest the file, as even the fastest compression is slower than SHA-256 today, and SHA-256 has optimized instructions already landing in the most recent hardware.)
    • Keeping interop with clients who don't understand compression.
    • Not wanting the CAS to be strongly coupled with one exact compression format (what we use for storage is not exactly what we want to use for transport, and I could imagine wanting to support other compression algorithms later as they're invented).
  • zstd-1 is fast and cheap enough to compress and decompress data on the fly. Much better than all other options we evaluated, and in particular, gzip materially worse.
  • There are a few wrinkles w.r.t ByteStream, in particular related to offset reads and writes. If you don't support offsets, ignore this). I generally think offset and read_limit should represent offsets into the uncompressed bytes, which means you have to be willing to recompress on the fly. Otherwise offsets are only usable for resuming broken reads/writes with a continuation of the exact byte sequence, but not for any other use-cases (pagination, appending chunks to a log, etc); read_limit is semantically nonsense (truncated compressed stream?) unless referring to the uncompressed bytes. (Some of this doesn't apply to the CAS as usually used, but e.g. we use ByteStream for log streams (see other bugs/threads), and I'd like consistent semantics across these.)

Mostyn, to your ideas specifically:

  1. Agreed
  2. Allowing multiple optional compression algorithms for response has a wrinkle that's messy - how does the server communicate which compression the response is compressed with for a ByteStream response? And it's mostly obviated by Remove note about Platform being redeveloped. #3, where you can learn the supported protocols directly. So I'd prefer to keep this one simple and just say "specify the compression format for downloads", with a format the server has declared support for in Capabilities.
  3. Agreed.
  4. Note that this one will have diminishing returns - ByteStream will very likely remain the majority of bytes moved and so represents the majority of potential impact, and for small blobs there's a threshold below which it's not actually worth compressing the bytes anyways (takes too long vs the bytes saved). I suspect for ContentAddressibleStorage it might be worth compressing the blobs as a group but not individually, or compressing some blobs but not others (and yes, doing so on the fly either way); for ActionCache probably not worth it at all for inlined blobs? But I'd actually suggest we start with neither, for simplicity, and just focus on ByteStream initially.

@mostynb
Copy link
Contributor Author

mostynb commented Jul 13, 2020

My immediate thought was to model these the same way we model the different digesting algorithms: via capabilities. As I understand things today, you cannot switch between digesting algorithms without restarting everything from scratch (including the CAS, although that can support multiple digests simultaneously per capabilities), so, if we want to support it in v2, we could just add a new capability called Compression. That could be dictated by the server just like digesting function is today. Apart from the inelegance, what would the downsides be there that I have missed?

One problem with the Capabilities API is that it only describes the server side- the server can't use it to check capabilites of the client. The client would need to specify its preferences/capabilites separately, eg in the resource_name for the byte stream API, or maybe via new fields in the other protos.

(This isn't a problem for digests because they have mostly non-overlapping key sizes, but I would prefer a nicer fix for this in v3 too, see eg #136.)

@mostynb
Copy link
Contributor Author

mostynb commented Jul 13, 2020

Convenient timing - we've just been working through trying to leverage gRPC transport-level compression ourselves, and finding it very difficult to capitalize on in practice, to the point we're giving up on it for now. So +1 to exploring API-level compression instead.

What we've gotten to in our exploration thus far is:

  • The ideal design (from our perspective) keeps blobs addressed by the digest of their uncompressed payload, even when transported with compression. (I think that's consistent with Mostyn's suggestion, and not Ed's?) This stems from a few motivations:

+1, I think we should always offer uncompressed blobs, since the cache is oblivious to what is being stored. Only clients have the information about which blobs are likely to benefit from compression, so IMO it should be opt-in from the client side.

  • There are a few wrinkles w.r.t ByteStream, in particular related to offset reads and writes. If you don't support offsets, ignore this). I generally think offset and read_limit should represent offsets into the uncompressed bytes, which means you have to be willing to recompress on the fly. Otherwise offsets are only usable for resuming broken reads/writes with a continuation of the exact byte sequence, but not for any other use-cases (pagination, appending chunks to a log, etc); read_limit is semantically nonsense (truncated compressed stream?) unless referring to the uncompressed bytes. (Some of this doesn't apply to the CAS as usually used, but e.g. we use ByteStream for log streams (see other bugs/threads), and I'd like consistent semantics across these.)

I'm also interested in exploring other potential solutions, eg: introduce a new Blob proto which contains a list of one or more Digests with corresponding encodings. That might make seekable compressed byte streams easy, since we could address compressed blobs directly.

Mostyn, to your ideas specifically:

  1. Agreed
  2. Allowing multiple optional compression algorithms for response has a wrinkle that's messy - how does the server communicate which compression the response is compressed with for a ByteStream response? And it's mostly obviated by Remove note about Platform being redeveloped. #3, where you can learn the supported protocols directly. So I'd prefer to keep this one simple and just say "specify the compression format for downloads", with a format the server has declared support for in Capabilities.

Hmm, right. It would be good to avoid needing to look for magic bytes in the response.

  1. Agreed.
  2. Note that this one will have diminishing returns - ByteStream will very likely remain the majority of bytes moved and so represents the majority of potential impact, and for small blobs there's a threshold below which it's not actually worth compressing the bytes anyways (takes too long vs the bytes saved). I suspect for ContentAddressibleStorage it might be worth compressing the blobs as a group but not individually, or compressing some blobs but not others (and yes, doing so on the fly either way); for ActionCache probably not worth it at all for inlined blobs? But I'd actually suggest we start with neither, for simplicity, and just focus on ByteStream initially.

+1 to focusing on ByteStream first, and worrying about the other services later.

@erikmav
Copy link
Contributor

erikmav commented Jul 15, 2020

A note for implementers: With this change you're likely to need to keep a map from:

(uncompressed digest) -> list-of (compressed digest reference in service CAS, compression algo)

The reason being, you're likely sooner or later to want to cache computation by re-compressing files into the algorithm(s) that (a) provide the best compression ratio, and (b) are most provided in the accepted content format list from clients.

Kind of like if you're streaming online video to a 4K TV and a phone with low res, the back-end might dynamically recode for the small screen on first call to provide fewer bits on the wire, but keep the result (or kick off a parallel job to create) and store a low res version of the high res content.

@johnterickson
Copy link

johnterickson commented Jul 15, 2020

Hi! @Erikma pointed me to this thread.

I'm the one responsible for the weird VSO0 hash. The reason that the VSO0 has is weird is that S3/Blob support the concept of uploading large objects in blocks. The VSO0 hash allows the server to verify the hash of each of these pieces independently. i.e. if you have a 10GB file, at no point does the server need to stream the data to a temporary location as it hashes it and then only copy it to a final location once the hash is verified.

This gets tricky with compression. Some options:

  1. Compress the whole file first, then upload blocks of the compressed file. As @EricBurnett points out, this means you have to compress to do an existence check - yuck.
  2. Compress each block individually. This could be ok, but it means the downloader would need to understand that the content is not an uncompressed stream, nor a compressed stream, but is a concatenation of multiple compressed streams.
  3. Split files into pieces and store the pieces in the CAS. Each piece is small enough that the it can be hash-validated in memory. This is what we do for Pipeline Artifacts.

To be clear, I'm not saying "don't do compression because of VSO0" 😊 - just that it wouldn't make sense for that particular hashing/upload mechanism.

@erikmav
Copy link
Contributor

erikmav commented Jul 15, 2020

(VSO0 == VsoHash in the digest algorithm list in the RE protobuf)

@EricBurnett
Copy link
Collaborator

EricBurnett commented Jul 15, 2020 via email

@erikmav
Copy link
Contributor

erikmav commented Jul 15, 2020

For single-file parallelism: We should consider a ByteStream change that specifies start offset and size within an uploading file. I have a note in our implementation of ByteStream to go experiment with custom headers for that, as we're often uploading one large file and overlapping chunks already works well for saturating the pipe for downloads (which do support reading from offsets).

@mostynb
Copy link
Contributor Author

mostynb commented Jul 15, 2020

A note for implementers: With this change you're likely to need to keep a map from:

(uncompressed digest) -> list-of (compressed digest reference in service CAS, compression algo)

If cache implementations were to expose that to clients somehow, then compressed blobs would be directly addressable and we would not need any changes to the download part of the bytestream API, and I suspect VSO0 would not pose any trouble. If so, then I wonder if we can find a reasonable way to do this in v2?

A new rpc in the ContentAddressableStorage service could do this at the cost of an extra roundtrip. Otherwise we could add new fields to existing messages like OutputFile.

@johnterickson
Copy link

Assuming you can rename a file in the destination system

Unfortunately neither Azure Blob nor S3 offer object rename 😕

They do offer partial uploads that handle concurrency well: "When you upload a block to a blob in your storage account, it is associated with the specified block blob, but it does not become part of the blob until you commit a list of blocks that includes the new block's ID"

https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs

@EricBurnett
Copy link
Collaborator

EricBurnett commented Jul 16, 2020 via email

@johnterickson
Copy link

multi-part uploads for a
blob, upload bytes as they come hashing along the way, and then when
complete and valid one writer would 'commit' the blob making it visible.

@EricBurnett This is exactly what my service (https://azure.microsoft.com/en-us/services/devops/artifacts/) does 👍

independent uploaders uploading independent chunks of the
same file would be tricky, as the server couldn't validate anything about
them till all chunks were seen, and it might have seen multiple different
candidates for any given byte range

This works in our service as Azure Blob lets you name each of the chunks - so we name each chunk by the hash of the chunk. The overall flow:

  1. Client breaks file into 2MB blocks and hashes each block.
  2. Client takes a hash of the block hashes - this is the content id.
  3. Client tells server "please add a reference to this content id". If the service says "yes" then it's done!
  4. Otherwise, for each block:
    a. The client does a parallel POST of the block to /[content id]/[hash of block].
    b. The server validates that [hash of block] from the URL matches the hash of the body and then writes the block with name [hash of block] to Azure Storage to the blob with the name [content id].
    c. Concurrent uploads from multiple clients can safely race because the server is always verifying the content of each block and thus a block of a given name (it's hash) will - at worst - be overwritten with the same content.
  5. The client "seals" the blob by doing a PUT of the list of block hashes to /[content id]. The server validates that the hash of the block hashes in the body equals [content id]. If so, it seals the Azure Blob with that list of blocks. Again safe to race on this because everyone is performing the same operation

@mostynb
Copy link
Contributor Author

mostynb commented Aug 11, 2020

Here's an initial proposal, which makes it possible to address blobs by their compressed digests, so read/write offsets still work:
https://docs.google.com/document/d/1diqigT1gJyIkiVjvayvCU2GDbJq3nvDZFEfO7EZPPwQ/edit?usp=sharing

@mostynb mostynb changed the title V3 idea: blob level compression feature request: blob level compression Aug 11, 2020
@EricBurnett
Copy link
Collaborator

Thanks Mostyn! I'm quite happy with how the proposal is shaping up, personally - does anyone else have comments on Mostyn's current proposal? (Note that it has been materially revised since first linked 2w ago).

I'll note the current proposal does allow offset reads still (a request of @Erikma ), but otherwise doesn't specifically consider block algorithms, and doesn't do anything specific around block uploads or concurrent partial uploads (as @johnterickson floated). I'm personally fine with this - it's consistent with the existing upload API, and (as I noted above) my suggested approach to block uploads would be to upload independent chunks as independent blobs and provide an additional API to stitch them. In that sense I think chunk uploads are out of scope for this Compression discussion, except to the extent anyone feels this proposal precludes features they want to propose.

@peterebden
Copy link
Contributor

I think the proposal sounds good in regard to integration with the bytestream APIs, and I like the idea overall and think it can be useful.

I'm a little concerned about the interaction with other CAS RPCs which would presumably require the server to decompress before serving them (although presumably this is the same for a client that requests a blob via the uncompressed bytestream paths, or a compressed one that isn't actually stored compressed). In general (as I noted on the doc) it gives the decision-making power to the client, but the server has better info to make that decision.

I also have a bit of a concern on focusing just on bytestream; our artifacts consist mostly of Go objects (which are nearly all compressible and mostly small enough to be batched) or Python and Java (which are often incompressible and large enough to require streaming). I worry a bit that we'd not get a lot of benefit given that profile. Maybe others are in a different situation though?

@EricBurnett
Copy link
Collaborator

Thanks for reviewing Peter.

it gives the decision-making power to the client, but the server has better info to make that decision.

I hope we can distill the decision making to roughly "prefer compressed", making that a no-op. E.g. if the server is going to store data internally compressed anyways, it's a choice between "server decompresses and then sends" vs "server sends, client decompresses", where the latter is ~always better. Mostyn and I also discussed incompressible data, but we felt it was sufficient to use a faster compressor (e.g. one of the zstd negative levels) and/or a framing-only "compression" (e.g. gzip level 0 if you support gzip, which just wraps the uncompressed data in gzip headers) rather than to try to push knowledge of compressibility into the client.

I also have a bit of a concern on focusing just on bytestream

That's a fair point, if your use-case is biased more towards smaller batch reads than bytestream reads, you won't benefit as much right away. We're definitely not closing the door to adding a compressed version of the batch APIs, simply not starting there. Could I ask you to gather stats on how many bytes you move by ByteStream vs batch? I'll be interested to know how much of a concern it is for you in practice. On my server it's ~14:1 for reads, but you may have materially different numbers.

@peterebden
Copy link
Contributor

Sure, I can have a look at that.

And good point on the fast or framing-only compression levels, that makes sense. Thanks.

@mostynb
Copy link
Contributor Author

mostynb commented Aug 25, 2020

In general (as I noted on the doc) it gives the decision-making power to the client, but the server has better info to make that decision.

IMO the client has more context to decide, eg:

  • Action/rule type.
  • Filename/extension.
  • Local configuration, such as marking particular actions or files as "do not compress".
  • (For uploads) uncompressed data, + compressed data (if willing to compress before sending).

Whereas cache servers have at most compressed and uncompressed data sizes.

I also have a bit of a concern on focusing just on bytestream

That's a fair point, if your use-case is biased more towards smaller batch reads than bytestream reads, you won't benefit as much right away. We're definitely not closing the door to adding a compressed version of the batch APIs, simply not starting there.

If it looks like it's worth the effort, we could add a new compression_type field to the OutputFile message which refers to the contents field, and an acceptable_compression field to GetActionResultRequest.

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020
@peterebden
Copy link
Contributor

OK so I have a couple of days worth of metrics now and it's approximately 3.5x in favour of streams for reads (more for writes). I don't have a lot of insight into compressibility of them but that definitely supports the idea that streams are indeed more valuable as you say.

Thanks for the discussion - this SGTM then!

@EricBurnett
Copy link
Collaborator

Great, thanks for the data Peter!

I don't have a lot of insight into compressibility of them

I don't know what they'll be for you, but on my side I observe compression ratios around 0.4 for writes and 0.333 for reads. If that held ballpark true for your data, and if your streams and batch reads were equally compressible, you'd expect an ideal reduction of about 67% bytes read, and an effective reduction of about 52% with streams alone.

That suggests it's definitely worth starting with streams to get the most impact with the least effort, but that you'll probably still desire to follow on with batch APIs too, as you'd expect an incremental 30% reduction going from streams-only compression to streams-and-batch compression.

@bergsieker
Copy link
Collaborator

It sounds like we're generally in agreement on Mostyn's proposal as written. @mostynb, can you turn your doc into a PR?

@mostynb
Copy link
Contributor Author

mostynb commented Sep 3, 2020

Will do.

mostynb added a commit to mostynb/remote-apis that referenced this issue Sep 3, 2020
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Implements bazelbuild#147.
mostynb added a commit to mostynb/remote-apis that referenced this issue Sep 3, 2020
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
mostynb added a commit to mostynb/remote-apis that referenced this issue Sep 4, 2020
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
peterebden pushed a commit to peterebden/remote-apis that referenced this issue Sep 17, 2020
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
mostynb added a commit to mostynb/remote-apis that referenced this issue Nov 24, 2020
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
mostynb added a commit to mostynb/remote-apis that referenced this issue Mar 2, 2021
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
bergsieker pushed a commit that referenced this issue Mar 9, 2021
In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements #147.
@mostynb
Copy link
Contributor Author

mostynb commented Jul 12, 2021

This feature was added some time ago now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants