Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

scele · 2019-11-04T10:05:22Z

Description of the problem / feature request:

We are running bazel with:

--remote_cache=https://our-aws-cloudfront-distribution.com/backed/by/s3/bucket
--disk_cache=~/path/on/local/disk
--remote_download_minimal
--remote_timeout=300
--remote_retries=2

and the build/test parallelism is high: --jobs=auto (nproc == 56), --local_test_jobs=40.

We are seeing intermittent errors that seem to be related to the use of https as the remote cache protocol.

Example 1:

(06:43:31) [3,407 / 3,521] 19 / 872 tests; Testing //some:target; 73s local ... (87 actions running)
(06:43:35) ERROR: /ssd/home/gitlab-runner/builds/ZQ9e7vGx/0/some/path/BUILD:40:1:  failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(06:43:35) INFO: Elapsed time: 82.648s, Critical Path: 77.62s
(06:43:35) INFO: 1313 processes: 1216 remote cache hit, 97 local.
(06:43:35) FAILED: Build did NOT complete successfully

Example 2:

(06:21:37) WARNING: Reading from Remote Cache:
Symlinks in action outputs are not yet supported by --experimental_remote_download_outputs=minimal
(06:21:52) [2,129 / 3,349] 3 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 19s local, remote-cache ... (78 actions running)
(06:21:58) WARNING: Writing to Remote Cache:
ClosedChannelException
(06:22:16) [2,600 / 3,768] 72 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 43s local, remote-cache ... (60 actions, 51 running)
(06:22:22) WARNING: Writing to Remote Cache:
ExtendedClosedChannelException
(06:22:44) [2,615 / 3,782] 77 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 70s local, remote-cache ... (70 actions, 68 running)
(06:23:15) [2,920 / 4,087] 144 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 101s local, remote-cache ... (85 actions running)
(06:23:51) [3,323 / 4,490] 244 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 137s local, remote-cache ... (65 actions, 60 running)
(06:24:07) ERROR: /ssd/home/gitlab-runner/builds/VFqgTLVf/0/some/path/BUILD:7:1:  failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(06:24:43) [3,519 / 4,662] 261 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 190s local, remote-cache ... (6 actions running)
(06:25:08) WARNING: Writing to Remote Cache:
SSLEngine closed already
(06:25:45) INFO: Elapsed time: 391.995s, Critical Path: 250.27s
(06:25:45) INFO: 1689 processes: 1152 remote cache hit, 537 local.

Example 3:

(11:46:29) WARNING: Writing to Remote Cache:
400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>DC396580BA6E5186</RequestId><HostId>PFE6JZxpAXa2yYxxIbM/1KVHqHii3XPH67ql3wF3ZhUslJUQTw1OMfg7j8krnc8H9SEXS+VotiE=</HostId></Error>
(11:46:29) WARNING: Writing to Remote Cache:
400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>211CBD741B01AC68</RequestId><HostId>GZrAWg9+c8EZtaWCwq4/z3O9qNIPHc5wsT//ROFGeTNtGQo2NM/gu5zfR5rtexaWtQ+6Tr0ag8c=</HostId></Error>
(11:46:32) WARNING: Reading from Remote Cache:
Symlinks in action outputs are not yet supported by --experimental_remote_download_outputs=minimal
(11:46:44) WARNING: Writing to Remote Cache:
Broken pipe
(11:47:41) WARNING: Writing to Remote Cache:
SSLEngine closed already
(11:47:41) WARNING: Writing to Remote Cache:
ClosedChannelException
(11:51:19) ERROR: /ssd/home/gitlab-runner/builds/RBdZnUE1/0/some/path/BUILD:35:1: C++ compilation of rule '//some/target.so' failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(11:51:21) INFO: Elapsed time: 596.349s, Critical Path: 366.45s
(11:51:21) INFO: 5841 processes: 5681 remote cache hit, 160 local.
(11:51:21) FAILED: Build did NOT complete successfully

I have also seen a curious "null" error once, which might or might not be related:

(06:27:30) ERROR: /ssd/home/gitlab-runner/builds/XgQhJyq7/0/some/path/BUILD:33:1:  failed due to unexpected I/O exception: null
io.netty.channel.ExtendedClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

Brainstorming:

Any remote cache connectivity issue should not result in failing the build, but falling back to local execution at worst, right?
Note that while Example 3 warns about the 400 error code returned by AWS CloudFront, the ERROR that causes the build to fail is the same handshake timeout.
Perhaps some TLS errors are not being correctly retried by bazel's http remote cache backend?
Does bazel support HTTP/2 and multiplexing multiple requests over a single connection? Perhaps bazel is creating too many connections to the remote cache which makes things unreliable.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately the errors are intermittent and only happen when there is lot of traffic going in and out of the remote cache. So I don't have an easy reproducer.

What operating system are you running Bazel on?

Ubuntu 16.04.6 LTS
Linux lpeltonen-dtx 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

What's the output of `bazel info release`?

release 1.1.0-patches1

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

We have applied two patches on top of the 1.1.0 release, which should be unrelated. We have built bazel from this commit: https://github.com/scele/bazel/commits/1.1.0-patches1 using

bazel build --stamp --embed_label 1.1.0-patches1 //scripts/packages:with-jdk/install.sh

Have you found anything relevant by searching the web?

These issues seem related, but both should be fixed in 1.1.0 and neither of them mentions "handshake timed out" error:

The text was updated successfully, but these errors were encountered:

ulfjack · 2020-09-05T19:04:29Z

The HTTP client does not retry failed downloads or uploads (I only checked for downloads). If it prints WARNING, then it should fall back to local execution. ERROR probably not. The HTTP client is HTTP 1.1 only with a maximum number of concurrent connections.

ulfjack · 2020-09-05T19:08:57Z

I have a fairly reliable repro of at least one subclass of issues against a local nginx over HTTP (not TLS).

My BUILD file:

[genrule(
  name = "foo-%s" % i,
  outs = ["foo-%s.txt" % i],
  cmd = "echo 'hello %s' > $@" % i) for i in range(0, 100)]

My nginx config:

server {
  listen 80;
  server_name localhost;
  location /cache/ {
    # The path to the directory where nginx should store the cache contents.
    root /tmp/nginx-cache/;
    # Allow PUT
    dav_methods PUT;
    # Allow nginx to create the /ac and /cas subdirectories.
    create_full_put_path on;
    # The maximum size of a single file.
    client_max_body_size 1G;
    allow all;
    autoindex on;
    keepalive_requests 1;
  }
}

My Bazel command line:

bazel clean && bazel build --remote_cache=http://localhost/cache/ //foo:all --verbose_failures --jobs=200

ulfjack · 2020-09-05T19:14:33Z

My hypothesis is that it does not handle 'connection:close' headers correctly, or at least that there's a race condition. I think it explains at least the following failure modes:

(11:47:41) WARNING: Writing to Remote Cache:
SSLEngine closed already
(11:47:41) WARNING: Writing to Remote Cache:
ClosedChannelException

It may also explain some of the other problems reported here.

ulfjack · 2020-09-05T19:30:07Z

If it doesn't repro, increasing the number of genrules helps. Decreasing the number of jobs seems to make it less likely to trigger.

ulfjack · 2020-09-05T19:32:15Z

Here's a patch for HttpDownloadHandler.java that makes the problem go away in my testing:

  private void succeedAndReset(ChannelHandlerContext ctx) {
    try {
      reset(ctx);
      succeedAndResetUserPromise();
    } finally {
//      reset(ctx);
    }
  }

The user promise has a callback that returns the connection to the pool. If the server returns a 'connection: close' HTTP header, then this can currently happen before the connection is closed, in which case the client attempts to reuse the connection, which - of course - fails. This changes the ordering to close the connection *before* completing the user promise. This is at least a partial fix for the linked issue. It is unclear if this is the root cause for all the reported failure modes. Progress on bazelbuild#10159. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528

The user promise has a callback that returns the connection to the pool. If the server returns a 'connection: close' HTTP header, then this can currently happen before the connection is closed, in which case the client attempts to reuse the connection, which - of course - fails. This changes the ordering to close the connection *before* completing the user promise. This is at least a partial fix for the linked issue. It is unclear if this is the root cause for all the reported failure modes. Progress on #10159. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528 Closes #12055. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528 PiperOrigin-RevId: 330496714

ulfjack · 2020-09-25T16:43:27Z

I'm not sure if my change fixes the 'ssl handshake' issue. Please let me know if you still see issues with my commit (not released yet, should be in 3.6.0).

scele · 2020-09-28T04:57:21Z

Thanks @ulfjack for the fixes!!

Unfortunately it's not easy for me to verify the fix, because the issue was only reproducing in high-volume CI builds, and we are currently using grpc in CI.

But if you believe the issue is fixed, or at least significantly improved, I'd be perfectly fine with closing this bug - we can always re-open or file a new ticket if we still see issues after 3.6.0.

Thanks!!

ulfjack · 2020-10-05T20:31:11Z

Adding @coeuvre to decide whether to close this. I don't have the access to do it myself anyway.

coeuvre · 2020-10-09T07:14:32Z

Closing. Please feel free to re-open if this is still an issue.

dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Nov 25, 2019

This was referenced Sep 5, 2020

Fix races in the HTTP upload and download handlers #12055

Closed

Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissingBlobs #11019

Closed

coeuvre closed this as completed Oct 9, 2020

werkt mentioned this issue Jul 30, 2024

Do not reuse gRPC connections that fail with native Netty errors #23150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

scele commented Nov 4, 2019

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 25, 2020

scele commented Sep 28, 2020

ulfjack commented Oct 5, 2020

coeuvre commented Oct 9, 2020

Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

Comments

scele commented Nov 4, 2019

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Have you found anything relevant by searching the web?

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 5, 2020

ulfjack commented Sep 25, 2020

scele commented Sep 28, 2020

ulfjack commented Oct 5, 2020

coeuvre commented Oct 9, 2020

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.