-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output download failed: digest mismatch #12927
Comments
Can you elaborate more details?
|
@coeuvre I think I have a reliable repro and also a root cause. Repro case1. Start a grpc cache serverTo make the repro consistent, we hack the byte stream server to artificially delay sending each chunk by 1s. #!/bin/bash
rm -rf bb-storage
git clone https://github.com/buildbarn/bb-storage.git
cd bb-storage
git checkout e2bed5b267e52baa66cf45512aa369df8ab821b5 # fixed commit
patch -p1 <<EOF
--- a/pkg/blobstore/grpcservers/byte_stream_server.go
+++ b/pkg/blobstore/grpcservers/byte_stream_server.go
@@ -3,6 +3,7 @@ package grpcservers
import (
"context"
"io"
+ "time"
"github.com/buildbarn/bb-storage/pkg/blobstore"
"github.com/buildbarn/bb-storage/pkg/blobstore/buffer"
@@ -48,6 +49,8 @@ func (s *byteStreamServer) Read(in *bytestream.ReadRequest, out bytestream.ByteS
if readErr != nil {
return readErr
}
+ time.Sleep(1 * time.Second) // XXX
+
if writeErr := out.Send(&bytestream.ReadResponse{Data: readBuf}); writeErr != nil {
return writeErr
}
EOF
cat > config.jsonnet <<EOF
{
blobstore: {
contentAddressableStorage: {
'local': {
keyLocationMapInMemory: {
entries: 16 * 1024 * 1024,
},
keyLocationMapMaximumGetAttempts: 8,
keyLocationMapMaximumPutAttempts: 32,
oldBlocks: 8,
currentBlocks: 24,
newBlocks: 3,
blocksInMemory: {
blockSizeBytes: ((8 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
},
},
},
actionCache: {
'local': {
keyLocationMapInMemory: {
entries: 16 * 1024 * 1024,
},
keyLocationMapMaximumGetAttempts: 8,
keyLocationMapMaximumPutAttempts: 32,
oldBlocks: 8,
currentBlocks: 24,
newBlocks: 3,
blocksInMemory: {
blockSizeBytes: ((4 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
},
},
},
},
global: { diagnosticsHttpListenAddress: ':9980' },
grpcServers: [{
listenAddresses: [':8980'],
authenticationPolicy: { allow: {} },
}],
schedulers: {
bar: { endpoint: { address: 'bar-scheduler:8981' } },
},
allowAcUpdatesForInstanceNamePrefixes: ['default'],
maximumMessageSizeBytes: 16 * 1024 * 1024,
}
EOF
bazel run //cmd/bb_storage -- $PWD/config.jsonnet
2. Run bazel buildWe force bazel to download a 256K CAS blob. #!/bin/bash
set -ex
rm -rf bazel_test
mkdir bazel_test
cd bazel_test
echo > WORKSPACE
cat > BUILD <<EOF
genrule(
name = "bigfile",
outs = ["bigfile.img"],
cmd = "fallocate -l 256K \$@",
)
genrule(
name = "foo",
srcs = [
":bigfile",
"volatile.txt",
],
outs = ["foo.txt"],
cmd = "ls -lh \$(location :bigfile) > \$@",
)
EOF
touch volatile.txt
bazel version
# Make sure bigfile.img is written to remote cache.
bazel clean
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default
# Now run the "foo" action with --remote_download_minimal so bazel needs to download bigfile.img from remote cache.
bazel clean
echo $(date) >> volatile.txt
set +e
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_download_minimal --remote_timeout=2 --remote_retries=0 --experimental_remote_grpc_log=grpc.log
EXIT_CODE=$?
set -e
echo "Exit code: $EXIT_CODE"
cp $(bazel info server_log) ./java.log
remote_client --grpc_log grpc.log printlog > grpc.log.txt || true
Now, we see the following error reliably:
and in java.log:
Root cause analysisThis issue is present also at bazel HEAD. I bisected bazel, and the issue started happening when grpc-java was upgraded from 1.26.0 to 1.31.1. (Bazel HEAD and 4.0 are currently using grpc-java 1.32.2.) The issue is fixed if I upgrade grpc-java to 1.33.1 or 1.34.1. Then I bisected grpc-java, and the issue was fixed by grpc/grpc-java@7ca6c02 In the related grpc/grpc-java#7105 there was a comment stating:
And indeed this is what was is happening in bazel too! I added prints to src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java#requestRead and observed that
Note how the received digest is larger in size than the expected digest! |
Before this change, BulkTransferExceptions would be reported like this: Executing genrule //:foo failed: Exec failed due to IOException: null By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. and in the case when there are multiple suppressed exceptions: Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer This helps diagnosing issues like bazelbuild#12927
Part 1: add v1.34.1 version to third_party/grpc Note: partly switches to v1.34.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.34.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 1: add v1.34.1 version to third_party/grpc Note: partly switches to v1.34.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.34.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes #13105. PiperOrigin-RevId: 359732452
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106
Fixed by #13104, #13105, #13106 - hopefully this can be closed now @federicohml? I have asked the fix to be included in 4.1 release, that's tracked in #13099 (I've also rebased the fix on top of 4.0 here: https://github.com/scele/bazel/commits/issue_12927_bazel4) |
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106
Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576
When using the dynamic scheduler, local actions may get interrupted or cancelled when the remote strategy is faster (e.g., remote cache hit). Ordinarily this isn't a problem, except when the local action is sharing a file download future with another local action. The interrupted thread of the local action cancels the future, and causes a CancellationExeception when the other local action thread tries to retrieve it. This resolves that problem by not letting threads/callers share the same future instance. The shared download future is only cancelled if all callers have requested cancellation. Fixes #12927. PiperOrigin-RevId: 362009791
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106
Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576
When using the dynamic scheduler, local actions may get interrupted or cancelled when the remote strategy is faster (e.g., remote cache hit). Ordinarily this isn't a problem, except when the local action is sharing a file download future with another local action. The interrupted thread of the local action cancels the future, and causes a CancellationExeception when the other local action thread tries to retrieve it. This resolves that problem by not letting threads/callers share the same future instance. The shared download future is only cancelled if all callers have requested cancellation. Fixes #12927. PiperOrigin-RevId: 362009791
Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927
Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927 Closes: bazelbuild#13106
Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like bazelbuild/bazel#12927 Closes #13061. PiperOrigin-RevId: 359250576
Description of the problem / feature request:
After upgrading to Bazel 4.0.0, I get the following error
Feature requests: what underlying problem are you trying to solve with this feature?
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
What operating system are you running Bazel on?
Ubuntu 20.04
What's the output of
bazel info release
?release 4.0.0
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Have you found anything relevant by searching the web?
Any other information, logs, or outputs that you want to share?
The exact same command that produced the output above was tested in bazel 3.4.1 and 3.7.2, both work neatly.
The text was updated successfully, but these errors were encountered: