Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for chunking of blobs, using a variant of BLAKE3
Buildbarn has invested heavily in using virtual file systems. Both on the worker and client side it's possible to lazily fault in data from the CAS. As Buildbarn implements checksum verification where needed, randomly accessing large files may be slow. To address this, this change adds support for composing and decomposing CAS objects, using newly added ConcatenateBlobs() and SplitBlobs() operations. If implemented naively (e.g., using SHA-256), these operations would not be verifiable. To rephrase: when merely given the checksum of smaller objects, there is no way to obtain that of its concatenated version. This is why at the same time, this change adds a new digest function that closely resembles BLAKE3. BLAKE3 is based on a binary Merkle tree, meaning that it's possible to efficiently concatenate and split objects at the 2^k boundary (where k >= 10). With these new operations present, there is no true need to use the Bytestream protocol any longer. Writes can be performed by uploading smaller parts through BatchUpdateBlobs(), followed by calling ConcatenateBlobs(). Conversely, reads of large objects can be performed by calling SplitBlobs() and downloading individual parts through BatchReadBlobs(). At no point is integrity compromised, as callers of SplitBlobs() can validate the resulting tree nodes against the original digests. One feature of BLAKE3 is that its hashes are variable length. Though this is generally nice to have (allowing users to make size/security tradeoffs), we don't want to use this in our case. The reason being that the first 256 bits of output are identical to the chaining value, which we need for concatenation/splitting. Requiring the use of 256 bit hashes is problematic, as SHA-256 hashes share the same length. The digest function can thus not be derived by looking at the hash length. This has already become an issue with MD5 vs MURMUR3. To solve that, we extend all operations that work with digests to take a digest function explicitly. For compatibility, we allow this to be UNKNOWN for all existing digest functions.
- Loading branch information