Output download failed: digest mismatch #12927

federicohml · 2021-01-29T13:49:31Z

ATTENTION! Please read and follow:

if this is a question about how to build / test / query / deploy using Bazel, or a discussion starter, send it to bazel-discuss@googlegroups.com

if this is a bug or feature request, fill the form below as best as you can.

Description of the problem / feature request:

After upgrading to Bazel 4.0.0, I get the following error

com.google.devtools.build.lib.remote.BulkTransferException
        at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:227)
        at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:338)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:487)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:307)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:140)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:306)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:855)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1016)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:975)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:472)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:834)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:307)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:477)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
        Suppressed: java.io.IOException: Output download failed: Expected digest 'da191677a1b73bbf16d06669b32ec2a3b65fdcb316f833429ee8446b18b221dd/400451240' does not match received digest '07b7519d3f1492ed9f64ff8396487c340a0608dacc0b05fcffcf06149c7e19ff/401499816'.
                at com.google.devtools.build.lib.remote.util.Utils.verifyBlobContents(Utils.java:201)
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onCompleted(GrpcCacheClient.java:372)
                at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:476)
                at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
                at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
                at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
                at com.google.devtools.build.lib.remote.util.NetworkTime$NetworkTimeCall$1.onClose(NetworkTime.java:113)
                at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
                at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
                at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
                at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
                at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
                at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
                ... 3 more

Feature requests: what underlying problem are you trying to solve with this feature?

Replace this line with your answer.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Replace this line with your answer.

What operating system are you running Bazel on?

Ubuntu 20.04

What's the output of `bazel info release`?

release 4.0.0

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

Replace this line with your answer.

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

Replace this line with your answer.

Have you found anything relevant by searching the web?

Replace these lines with your answer.

Places to look:

StackOverflow: http://stackoverflow.com/questions/tagged/bazel

GitHub issues: https://github.com/bazelbuild/bazel/issues

email threads on https://groups.google.com/forum/#!forum/bazel-discuss

Any other information, logs, or outputs that you want to share?

The exact same command that produced the output above was tested in bazel 3.4.1 and 3.7.2, both work neatly.

The text was updated successfully, but these errors were encountered:

coeuvre · 2021-02-01T03:21:30Z

Can you elaborate more details?

Which remote cache server does the build use?
Does the error occur for certain types of actions or just randomly?
Can you check whether the content of the file in the remote cache is correct?
Do you have a minimal example which can reproduce this error?

scele · 2021-02-17T11:07:55Z

@coeuvre I think I have a reliable repro and also a root cause.

Repro case

1. Start a grpc cache server

To make the repro consistent, we hack the byte stream server to artificially delay sending each chunk by 1s.

#!/bin/bash
rm -rf bb-storage
git clone https://github.com/buildbarn/bb-storage.git
cd bb-storage
git checkout e2bed5b267e52baa66cf45512aa369df8ab821b5 # fixed commit

patch -p1 <<EOF
--- a/pkg/blobstore/grpcservers/byte_stream_server.go
+++ b/pkg/blobstore/grpcservers/byte_stream_server.go
@@ -3,6 +3,7 @@ package grpcservers
 import (
 	"context"
 	"io"
+	"time"
 
 	"github.com/buildbarn/bb-storage/pkg/blobstore"
 	"github.com/buildbarn/bb-storage/pkg/blobstore/buffer"
@@ -48,6 +49,8 @@ func (s *byteStreamServer) Read(in *bytestream.ReadRequest, out bytestream.ByteS
 		if readErr != nil {
 			return readErr
 		}
+		time.Sleep(1 * time.Second) // XXX
+
 		if writeErr := out.Send(&bytestream.ReadResponse{Data: readBuf}); writeErr != nil {
 			return writeErr
 		}
EOF

cat > config.jsonnet <<EOF
{
  blobstore: {
    contentAddressableStorage: {
      'local': {
        keyLocationMapInMemory: {
          entries: 16 * 1024 * 1024,
        },
        keyLocationMapMaximumGetAttempts: 8,
        keyLocationMapMaximumPutAttempts: 32,
        oldBlocks: 8,
        currentBlocks: 24,
        newBlocks: 3,
        blocksInMemory: {
          blockSizeBytes: ((8 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
        },
      },
    },
    actionCache: {
      'local': {
        keyLocationMapInMemory: {
          entries: 16 * 1024 * 1024,
        },
        keyLocationMapMaximumGetAttempts: 8,
        keyLocationMapMaximumPutAttempts: 32,
        oldBlocks: 8,
        currentBlocks: 24,
        newBlocks: 3,
        blocksInMemory: {
          blockSizeBytes: ((4 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
        },
      },
    },
  },
  global: { diagnosticsHttpListenAddress: ':9980' },
  grpcServers: [{
    listenAddresses: [':8980'],
    authenticationPolicy: { allow: {} },
  }],
  schedulers: {
    bar: { endpoint: { address: 'bar-scheduler:8981' } },
  },
  allowAcUpdatesForInstanceNamePrefixes: ['default'],
  maximumMessageSizeBytes: 16 * 1024 * 1024,
}
EOF

bazel run //cmd/bb_storage -- $PWD/config.jsonnet

2. Run bazel build

We force bazel to download a 256K CAS blob.
It will be downloaded in four 64K chunks, and downloading each chunk takes 1s due to the artificial delay in the server.
We use --remote_timeout=2 to inject a timeout to the process.

#!/bin/bash
set -ex

rm -rf bazel_test
mkdir bazel_test
cd bazel_test
echo > WORKSPACE
cat > BUILD <<EOF
genrule(
    name = "bigfile",
    outs = ["bigfile.img"],
    cmd = "fallocate -l 256K \$@",
)

genrule(
    name = "foo",
    srcs = [
        ":bigfile",
        "volatile.txt",
    ],
    outs = ["foo.txt"],
    cmd = "ls -lh \$(location :bigfile) > \$@",
)
EOF
touch volatile.txt

bazel version
# Make sure bigfile.img is written to remote cache.
bazel clean
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default
# Now run the "foo" action with --remote_download_minimal so bazel needs to download bigfile.img from remote cache.
bazel clean
echo $(date) >> volatile.txt
set +e
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_download_minimal --remote_timeout=2 --remote_retries=0 --experimental_remote_grpc_log=grpc.log
EXIT_CODE=$?
set -e
echo "Exit code: $EXIT_CODE"

cp $(bazel info server_log) ./java.log
remote_client --grpc_log grpc.log printlog > grpc.log.txt || true

Now, we see the following error reliably:

Executing genrule //:foo failed: Exec failed due to IOException: null

and in java.log:

...
Caused by: com.google.devtools.build.lib.remote.BulkTransferException
>       at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:227)
>       at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
>       at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:232)
>       at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:243)
>       at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:135)
>       ... 18 more
>       Suppressed: java.io.IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.

Root cause analysis

This issue is present also at bazel HEAD.

I bisected bazel, and the issue started happening when grpc-java was upgraded from 1.26.0 to 1.31.1. (Bazel HEAD and 4.0 are currently using grpc-java 1.32.2.)

The issue is fixed if I upgrade grpc-java to 1.33.1 or 1.34.1.

Then I bisected grpc-java, and the issue was fixed by grpc/grpc-java@7ca6c02

In the related grpc/grpc-java#7105 there was a comment stating:

Note: a user has noticed a bug where onMessage() could be called after onClose() on client-side (b/169277773). The problem was present starting in 1.27 and fixed in 1.33, so we strongly believe it was this bug.

That means this was even more serious of a bug than previous believed. I'm very surprised more didn't break. Maybe more broke, and people just couldn't debug it. I'm going to update the release notes to mention it can produce weird callback orders.

And indeed this is what was is happening in bazel too! I added prints to src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java#requestRead and observed that onNext gets called after onError, which screws up offset bookkeeping. The result is that we end up downloading more bytes than what the CAS blob actually has. This is observed already from @federicohml's original bug report, which has

Suppressed: java.io.IOException: Output download failed: Expected digest
  'da191677a1b73bbf16d06669b32ec2a3b65fdcb316f833429ee8446b18b221dd/400451240' does not match received digest
  '07b7519d3f1492ed9f64ff8396487c340a0608dacc0b05fcffcf06149c7e19ff/401499816'.

Note how the received digest is larger in size than the expected digest!

Before this change, BulkTransferExceptions would be reported like this: Executing genrule //:foo failed: Exec failed due to IOException: null By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. and in the case when there are multiple suppressed exceptions: Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer This helps diagnosing issues like bazelbuild#12927

Part 1: add v1.34.1 version to third_party/grpc Note: partly switches to v1.34.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.34.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 1: add v1.34.1 version to third_party/grpc Note: partly switches to v1.34.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.34.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes #13105. PiperOrigin-RevId: 359732452

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106

scele · 2021-03-01T11:23:42Z

Fixed by #13104, #13105, #13106 - hopefully this can be closed now @federicohml?

I have asked the fix to be included in 4.1 release, that's tracked in #13099 (I've also rebased the fix on top of 4.0 here: https://github.com/scele/bazel/commits/issue_12927_bazel4)

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106

Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576

When using the dynamic scheduler, local actions may get interrupted or cancelled when the remote strategy is faster (e.g., remote cache hit). Ordinarily this isn't a problem, except when the local action is sharing a file download future with another local action. The interrupted thread of the local action cancels the future, and causes a CancellationExeception when the other local action thread tries to retrieve it. This resolves that problem by not letting threads/callers share the same future instance. The shared download future is only cancelled if all callers have requested cancellation. Fixes #12927. PiperOrigin-RevId: 362009791

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. #12927 Closes: #13106

Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like #12927 Closes #13061. PiperOrigin-RevId: 359250576

When using the dynamic scheduler, local actions may get interrupted or cancelled when the remote strategy is faster (e.g., remote cache hit). Ordinarily this isn't a problem, except when the local action is sharing a file download future with another local action. The interrupted thread of the local action cancels the future, and causes a CancellationExeception when the other local action thread tries to retrieve it. This resolves that problem by not letting threads/callers share the same future instance. The shared download future is only cancelled if all callers have requested cancellation. Fixes #12927. PiperOrigin-RevId: 362009791

Part 1: add v1.33.1 version to third_party/grpc Note: partly switches to v1.33.1 too as not all bits are versioned and some of unversioned bits are used from other third_party targets grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 2: Switch to grpc 1.33.1. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927

Part 3: remove 1.32.x from third_party/grpc. grpc-java versions 1.27 through 1.32 had a bug where messages could arrive after the call was reported clsoed. In the case of bazel, this meant that in GrpcCacheClient, onNext could be called after onError. This leads to offset bookkeeping getting out of sync, and corrupts the CAS blob download. bazelbuild#12927 Closes: bazelbuild#13106

Before this change, BulkTransferExceptions would be reported like this: ``` Executing genrule //:foo failed: Exec failed due to IOException: null ``` By implementing getMessage(), we get a more informative message, in the case when there is only one suppressed exception: ``` Executing genrule //:foo failed: Exec failed due to IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'. ``` and in the case when there are multiple suppressed exceptions: ``` Executing genrule //:foo failed: Exec failed due to IOException: 2 errors during bulk transfer ``` This helps diagnosing issues like bazelbuild/bazel#12927 Closes #13061. PiperOrigin-RevId: 359250576

coeuvre self-assigned this Feb 1, 2021

coeuvre added more data needed P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug labels Feb 1, 2021

scele mentioned this issue Feb 17, 2021

Bump grpc to 1.33.1 to fix corruption when downloading CAS blobs #13055

Closed

This was referenced Feb 18, 2021

Implement getMessage for BulkTransferException #13061

Closed

Release 4.0 - December 2020 #12455

Closed

philwo mentioned this issue Feb 24, 2021

Release 4.1 - May 2021 #13099

Closed

9 tasks

scele mentioned this issue Feb 25, 2021

[1/3] Bump grpc to 1.33.1 to fix corruption when downloading CAS blobs #13104

Closed

This was referenced Feb 25, 2021

[2/3] Bump grpc to 1.33.1 to fix corruption when downloading CAS blobs #13105

Closed

[3/3] Bump grpc to 1.33.1 to fix corruption when downloading CAS blobs #13106

Closed

bazel-io closed this as completed in 7dc6f57 Mar 10, 2021

coeuvre mentioned this issue Jun 15, 2021

Remote cache poisoned? bazelbuild/continuous-integration#1174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output download failed: digest mismatch #12927

Output download failed: digest mismatch #12927

federicohml commented Jan 29, 2021

coeuvre commented Feb 1, 2021

scele commented Feb 17, 2021

scele commented Mar 1, 2021

Output download failed: digest mismatch #12927

Output download failed: digest mismatch #12927

Comments

federicohml commented Jan 29, 2021

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

coeuvre commented Feb 1, 2021

scele commented Feb 17, 2021

Repro case

1. Start a grpc cache server

2. Run bazel build

Root cause analysis

scele commented Mar 1, 2021

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?