Suppress DEADLINE_EXCEEDED on download progress #5230

werkt · 2018-05-21T15:43:05Z

Prevent DEADLINE_EXCEEDED from contributing to retry counter when it
is making progress, and preserve progress in between retry attempts on a
single file.

buchgr · 2018-05-30T00:02:24Z

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java

@@ -174,34 +174,96 @@ public void ensureInputsPresent(
    uploader.uploadBlobs(toUpload);
  }

+  class PositionalOutputStream extends OutputStream {
+    private final OutputStream delegate;
+    private transient long offset = 0;


why transient?

Unnecessary, dropping it

buchgr

Overall a good idea, I am somewhat worried about slow downloads where one would just want to fallback to local execution. Also, this code is used in remote caching where slow downloads are more of an issue. However, these concerns are likely unfounded and so I think we should merge this.

Also, can you please add some tests?

werkt · 2018-05-30T02:47:39Z

Agreed about the fallback behavior - what this is really lacking is a global heuristic/traffic shaping mechanism that fully saturates the bandwidth available and locally executes anything after that point up until the cpu/disk/ram capacity. But realistically we're just talking about bin-packing, and I've not seen any indication that races are the norm and that falling back to local is not exceptional behavior.

ola-rozenfeld · 2018-06-18T23:53:50Z

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java

+            ReadRequest.newBuilder()
+                .setResourceName(resourceName)
+                .setReadOffset(offset)
+                .build(),


Do I understand correctly that this will not work as is because you're requesting a different offset every time, but disregard this offset when writing to out, so you would need something like the code that you commented out below to do it right?

You're completely right. Commented code is coming out, scope of Offset (and Hashing) has to move to creation outside of the retrier.

buchgr · 2018-07-09T18:46:41Z

I like your idea with the progress and I think we should also apply this to uploads. I think we should go even further and make --remote_timeout not apply to up/downloads. It just doesn't seem sound to have one timeout for uploads/downloads/execution/any other calls.

I think up/downloads should continue as long as they make progress and the timeout should apply to all other calls, but mostly execute really which one would expect to always take the longest.

Further down the road we should probably remove the --remote_timeout flag completely and just implement sensible defaults for execution, i.e. check every 10 seconds whether it's still running. For tests, propagte the timeout= attribute.

WDYT?

werkt · 2018-07-24T14:10:13Z

I think the timeout is necessary in the case where you don't make progress, but that clocking progress needs to be made into an intercepted behavior on the channel if such a thing is available in grpc - we are essentially monitoring starvation without a centralized timer reset.

Agreed on the upload addition, that actually brings up a point that the ByteStream documentation is not clear on:

  // When the client closes the request channel, the service will respond with
  // a `WriteResponse`.

Will (other) clients react poorly if they see a WriteResponse, for instance, for every WriteRequest with the current committed_size?

Pertinent to execute, we would need to be talking about the v2 streaming api, which I am anxiously awaiting seeing client code for.

buchgr · 2018-07-25T12:38:07Z

What's the status of this change?

werkt · 2018-07-25T17:21:18Z

Resolved the merge conflicts. I would like to see a separate commit or issue to push out incremental uploads and the timeout stuff. I think this is a good stopgap to prevent concurrent downloads based on link saturation, and if we want fundamental changes like a 'give up and fallback to local', that we need a real state machine designed so that suddenly my 500-wide workerset doesn't swing its hammer down to local and smash my RAM to bits.

werkt · 2018-09-13T20:22:59Z

Ping? @buchgr can we defer some of the other magical robustness changes to later additions?

buchgr · 2019-01-24T20:36:23Z

I apologize for being absent George! Would you mind rebasing this PR in case you are still interested in getting this merged? Thanks!

werkt · 2019-02-08T02:53:50Z

Master is merged, passed all CI tests. Let's get this one done.

A gRPC downloadBlob can be distributed over multiple sequential ByteStream::read requests, all of which will fail prior to ultimate completion or failure of the download, represented by a single ListenableFuture. To provide this abstraction, GrpcRemoteCache::downloadBlob supplies the requests with an offset that transits committed size between each request, a Retrier.Backoff that is reset each time partial content is received, and an optional hash supplier attached to a digesting filter bound to the lifetime of the download. Each request retains the count of bytes that has been committed to the output stream since it began. When a request error is encountered during a gRPC blob download, the client proceeds as follows: If the error is not retriable as determined by the GrpcActionCache's retrier, the blob download future fails. If the error is retriable and any positive size of partial content has been committed to the output stream since the request began, a new request is initiated with a read from the current request offset plus the committed size. If the error is retriable and no partial content has been committed to the output stream, a delay is acquired from its backoff. If the delay is negative, the download fails immediately with the error. Otherwise a new request is initiated with a read from the current request offset. All retry and grpc Context logic has been removed from the AbstractRemoteActionCache, making the individual ActionCache/store implementations responsible for their own retry mechanisms.

The stalled backoff, updated since last progress, should be the only backoff used in the observer object.

philwo · 2019-02-11T12:29:36Z

Hey @werkt, sorry - I don't feel qualified to review this PR and @buchgr is currently on sick leave. Could you ping this in ~two weeks when he's back?

werkt · 2019-02-11T14:22:05Z

@philwo he and I spoke offline, I will do so as soon as he's back on his feet.

Avoid considering non-DEADLINE_EXCEEDED results when deciding whether to use a retry attempt/backoff delay.

werkt · 2019-02-20T14:13:57Z

@dslomov can another reviewer be assigned here in @buchgr's absence?

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java

buchgr · 2019-03-12T18:34:14Z

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java

+
+  private ListenableFuture<Void> requestRead(
+      String resourceName,
+      AtomicLong offset,


While I appreciate your ingenuity, I'd argue that this is quite the hack. I am wondering if this whole logic couldn't be implemented in the Retrier itself i.e.

class ProgressiveRetrier extends Retrier { public ListenableFuture<Void> executeAsync(AsyncCallable<Long> call, long totalBytes) { return executeAsync(new AsyncCallable<Void>() { // implement progressive retry logic in here. in case of deadline exceeded and having made progress // immediately retry. }, newBackoff()); } }

I haven't fully thought it through but I think this could be made generic enough so that it could also be used in the ByteStreamUploader and the HttpBlobStore. Wdyt?

whatever gets me to an accept at this point.

Assuming that this comment addresses the progressive offset holder, I don't see how totalBytes can be used in the retrier to effect this: call presents an offset-worthy result only in the case of non-exception, but DEADLINE_EXCEEDED will be an exception state for the future. Do you expect the call to swallow the DEADLINE_EXCEEDED and respond with a committedSize, expecting the retrier to recall until it matches totalBytes? If so, the call still needs to maintain the current offset (per the AtomicLong), since there's no way to pass it back to the subsequent one.

buchgr · 2019-03-19T13:57:53Z

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java

+            outerF.setException(t);
+          }
+        },
+        Context.current().fixedContextExecutor(MoreExecutors.directExecutor()));


currentContextExecutor(...)?

Negative. currentContextExecutor will take the threads context from the caller of execute, not from the call of the method. In this case that is a scheduled executor thread for the retrier. I spent hours trying to figure out how currentContextExecutor was doing me wrong before I realized how. It has all the problems of directExecutor with a confusing name

googlebot added the cla: yes label May 21, 2018

werkt mentioned this pull request May 21, 2018

Grpc downloadBlob global timeout exhaustion #5231

Closed

buchgr self-assigned this May 24, 2018

buchgr reviewed May 30, 2018

View reviewed changes

buchgr suggested changes May 30, 2018

View reviewed changes

werkt force-pushed the grpc-downloadBlob-progress branch 3 times, most recently from 754a2dc to 71b5b0e Compare June 7, 2018 11:59

ola-rozenfeld reviewed Jun 18, 2018

View reviewed changes

werkt force-pushed the grpc-downloadBlob-progress branch 2 times, most recently from d717f8d to 1cdbf6f Compare June 19, 2018 21:25

werkt force-pushed the grpc-downloadBlob-progress branch 2 times, most recently from 57347de to 5fb574d Compare July 25, 2018 16:22

werkt requested a review from philwo as a code owner August 31, 2018 16:51

helenalt added the team-Execution label Sep 25, 2018

ola-rozenfeld mentioned this pull request Nov 27, 2018

Decrease remote_timeout default #6780

Closed

jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Execution labels Jan 14, 2019

werkt force-pushed the grpc-downloadBlob-progress branch from f03fa20 to 1ac9d1f Compare February 8, 2019 18:40

Remove unused ListenableScheduledFuture import

efb3436

George Gensure added 2 commits February 8, 2019 15:10

Pass a stalledBackoff, not original to new read

02819c5

The stalled backoff, updated since last progress, should be the only backoff used in the observer object.

Remove retrier use in RemoteWorker with SimpleBlobStore

18a0008

Limit immediate retry to progressing timeout

9e89a5f

Avoid considering non-DEADLINE_EXCEEDED results when deciding whether to use a retry attempt/backoff delay.

buchgr reviewed Mar 11, 2019

View reviewed changes

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteCache.java Outdated Show resolved Hide resolved

George Gensure added 3 commits March 11, 2019 09:57

Merge branch 'master' into grpc-downloadBlob-progress

b502984

Switched to using retrier for progressive reads

5cf12b3

Comment and correct ProgressiveBackoff behavior.

478a779

buchgr reviewed Mar 12, 2019

View reviewed changes

buchgr reviewed Mar 19, 2019

View reviewed changes

bazel-io closed this in 9813c58 Mar 20, 2019

werkt deleted the grpc-downloadBlob-progress branch April 15, 2019 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress DEADLINE_EXCEEDED on download progress #5230

Suppress DEADLINE_EXCEEDED on download progress #5230

werkt commented May 21, 2018

buchgr May 30, 2018

werkt May 30, 2018

buchgr left a comment

werkt commented May 30, 2018

ola-rozenfeld Jun 18, 2018

werkt Jun 19, 2018

buchgr commented Jul 9, 2018 •

edited

Loading

werkt commented Jul 24, 2018

buchgr commented Jul 25, 2018

werkt commented Jul 25, 2018

werkt commented Sep 13, 2018

buchgr commented Jan 24, 2019

werkt commented Feb 8, 2019

philwo commented Feb 11, 2019

werkt commented Feb 11, 2019

werkt commented Feb 20, 2019

buchgr Mar 12, 2019

werkt Mar 12, 2019

werkt Mar 12, 2019

buchgr Mar 19, 2019

werkt Mar 19, 2019

Suppress DEADLINE_EXCEEDED on download progress #5230

Suppress DEADLINE_EXCEEDED on download progress #5230

Conversation

werkt commented May 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buchgr left a comment

Choose a reason for hiding this comment

werkt commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buchgr commented Jul 9, 2018 • edited Loading

werkt commented Jul 24, 2018

buchgr commented Jul 25, 2018

werkt commented Jul 25, 2018

werkt commented Sep 13, 2018

buchgr commented Jan 24, 2019

werkt commented Feb 8, 2019

philwo commented Feb 11, 2019

werkt commented Feb 11, 2019

werkt commented Feb 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buchgr commented Jul 9, 2018 •

edited

Loading