Add support for transferring compressed blobs via ByteStream

In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements #147.
bazelbuild · Nov 24, 2020 · 50148f6 · 50148f6
1 parent 1e9ccef
commit 50148f6
Showing 1 changed file with 102 additions and 28 deletions.
diff --git a/build/bazel/remote/execution/v2/remote_execution.proto b/build/bazel/remote/execution/v2/remote_execution.proto
@@ -199,31 +199,53 @@ service ActionCache {
 //
 // For small file uploads the client should group them together and call
 // [BatchUpdateBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchUpdateBlobs].
+//
 // For large uploads, the client must use the
-// [Write method][google.bytestream.ByteStream.Write] of the ByteStream API. The
-// `resource_name` is `{instance_name}/uploads/{uuid}/blobs/{hash}/{size}`,
-// where `instance_name` is as described in the next paragraph, `uuid` is a
-// version 4 UUID generated by the client, and `hash` and `size` are the
-// [Digest][build.bazel.remote.execution.v2.Digest] of the blob. The
-// `uuid` is used only to avoid collisions when multiple clients try to upload
-// the same file (or the same client tries to upload the file multiple times at
-// once on different threads), so the client MAY reuse the `uuid` for uploading
-// different blobs. The `resource_name` may optionally have a trailing filename
-// (or other metadata) for a client to use if it is storing URLs, as in
-// `{instance}/uploads/{uuid}/blobs/{hash}/{size}/foo/bar/baz.cc`. Anything
-// after the `size` is ignored.
+// [Write method][google.bytestream.ByteStream.Write] of the ByteStream API.
+//
+// For uncompressed data, The `WriteRequest.resource_name` is of the following form:
+// `{instance_name}/uploads/{uuid}/blobs/{hash}/{size}{/optional_metadata}`
+//
+// Where:
+// * `instance_name` is an identifier, possibly containing multiple path
+//   segments, used to distinguish between the various instances on the server,
+//   in a manner defined by the server. If it is the empty path, the leading
+//   slash is omitted, so that  the `resource_name` becomes
+//   `uploads/{uuid}/blobs/{hash}/{size}{/optional_metadata}`.
+//   To simplify parsing, a path segment cannot equal any of the following
+//   keywords: `blobs`, `uploads`, `actions`, `actionResults`, `operations`,
+//   `capabilities` or `compressed-blobs`.
+// * `uuid` is a version 4 UUID generated by the client, used to avoid
+//   collisions between concurrent uploads of the same data. Clients MAY
+//   reuse the same `uuid` for uploading different blobs.
+// * `hash` and `size` refer to the [Digest][build.bazel.remote.execution.v2.Digest]
+//   of the data being uploaded.
+// * `optional_metadata` is implementation specific data, which clients MAY omit.
+//   Servers MAY ignore this metadata.
+//
+// Data can alternatively be uploaded in compressed form, with the following
+// `WriteRequest.resource_name` form:
+// `{instance_name}/uploads/{uuid}/compressed-blobs/{compressor}/{uncompressed_hash}/{uncompressed_size}{/optional_metadata}`
+//
+// Where:
+// * `instance_name`, `uuid` and `optional_metadata` are defined as above.
+// * `compressor` is a lowercase string form of a `Compressor.Value` enum
+//   other than `identity`, which is supported by the server and advertised in
+//   [CacheCapabilities.supported_compressor][build.bazel.remote.execution.v2.CacheCapabilities.supported_compressor].
+// * `uncompressed_hash` and `uncompressed_size` refer to the
+//   [Digest][build.bazel.remote.execution.v2.Digest] of the data being
+//   uploaded, once uncompressed. Servers MUST verify that these match
+//   the uploaded data once uncompressed, and MUST return an
+//   `INVALID_ARGUMENT` error in the case of mismatch.
+//
+// Note that when writing compressed blobs, the `WriteRequest.write_offset`
+// refers to the offset in the uncompressed form of the blob.
 //
-// A single server MAY support multiple instances of the execution system, each
-// with their own workers, storage, cache, etc. The exact relationship between
-// instances is up to the server. If the server does, then the `instance_name`
-// is an identifier, possibly containing multiple path segments, used to
-// distinguish between the various instances on the server, in a manner defined
-// by the server. For servers which do not support multiple instances, then the
-// `instance_name` is the empty path and the leading slash is omitted, so that
-// the `resource_name` becomes `uploads/{uuid}/blobs/{hash}/{size}`.
-// To simplify parsing, a path segment cannot equal any of the following
-// keywords: `blobs`, `uploads`, `actions`, `actionResults`, `operations` and
-// `capabilities`.
+// Uploads of the same data MAY occur concurrently in any form, compressed or
+// uncompressed.
+//
+// Clients SHOULD NOT use gRPC-level compression for ByteStream API `Write`
+// calls of compressed blobs, since this would compress already-compressed data.
 //
 // When attempting an upload, if another client has already completed the upload
 // (which may occur in the middle of a single upload if another client uploads
@@ -235,11 +257,43 @@ service ActionCache {
 // `INVALID_ARGUMENT` error will be returned. In either case, the client should
 // not attempt to retry the upload.
 //
-// For downloading blobs, the client must use the
-// [Read method][google.bytestream.ByteStream.Read] of the ByteStream API, with
-// a `resource_name` of `"{instance_name}/blobs/{hash}/{size}"`, where
-// `instance_name` is the instance name (see above), and `hash` and `size` are
-// the [Digest][build.bazel.remote.execution.v2.Digest] of the blob.
+// Small downloads can be grouped and requested in a batch via
+// [BatchReadBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchReadBlobs].
+//
+// For large downloads, the client must use the
+// [Read method][google.bytestream.ByteStream.Read] of the ByteStream API.
+//
+// For uncompressed data, The `ReadRequest.resource_name` is of the following form:
+// `{instance_name}/blobs/{hash}/{size}`
+// Where `instance_name`, `hash` and `size` are defined as for uploads.
+//
+// Data can alternatively be downloaded in compressed form, with the following
+// `ReadRequest.resource_name` form:
+// `{instance_name}/compressed-blobs/{compressor}/{uncompressed_hash}/{uncompressed_size}`
+//
+// Where:
+// * `instance_name` and `compressor` are defined as for uploads.
+// * `uncompressed_hash` and `uncompressed_size` refer to the
+//   [Digest][build.bazel.remote.execution.v2.Digest] of the data being
+//   downloaded, once uncompressed. Clients MUST verify that these match
+//   the downloaded data once uncompressed, and take appropriate steps in
+//   the case of failure such as retrying a limited number of times or
+//   surfacing an error to the user.
+//
+// Note that when reading compressed blobs, the `ReadRequest.read_offset`
+// refers to the offset in the uncompressed form of the blob.
+//
+// Servers MAY use any compression level they choose, including different
+// levels for different blobs (e.g. choosing a level designed for maximum
+// speed for data known to be incompressible).
+//
+// Servers MUST be able to provide data for all recently advertised blobs in
+// each of the compression formats that the server supports, as well as in
+// uncompressed form.
+//
+// Clients SHOULD NOT use gRPC-level compression on ByteStream API `Read`
+// requests for compressed blobs, since this would compress already-compressed
+// data.
 //
 // The lifetime of entries in the CAS is implementation specific, but it SHOULD
 // be long enough to allow for newly-added and recently looked-up entries to be
@@ -1616,6 +1670,18 @@ message SymlinkAbsolutePathStrategy {
   }
 }
 
+// Compression formats which may be supported.
+message Compressor {
+  enum Value {
+    // No compression. Servers and clients MUST always support this, and do
+    // not need to advertise it.
+    IDENTITY = 0;
+
+    // Zstandard compression.
+    ZSTD = 1;
+  }
+}
+
 // Capabilities of the remote cache system.
 message CacheCapabilities {
   // All the digest functions supported by the remote cache.
@@ -1636,6 +1702,14 @@ message CacheCapabilities {
 
   // Whether absolute symlink targets are supported.
   SymlinkAbsolutePathStrategy.Value symlink_absolute_path_strategy = 5;
+
+  // Compressors supported by the "compressed-blobs" bytestream resources.
+  // Servers MUST support identity/no-compression, even if it is not listed
+  // here.
+  //
+  // Note that this does not imply which if any compressors are supported by
+  // the server at the gRPC level.
+  repeated Compressor.Value supported_compressor = 6;
 }
 
 // Capabilities of the remote execution system.