-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(storage): add backoff to gRPC write retries #11200
Conversation
BrennaEpp
commented
Nov 27, 2024
•
edited
Loading
edited
- Integration tests pass
- Emulated/conformance tests pass except for the following which I assume is unrelated:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments but overall the approach looks good.
storage/grpc_client.go
Outdated
@@ -2370,3 +2382,78 @@ func checkCanceled(err error) error { | |||
|
|||
return err | |||
} | |||
|
|||
func (w *gRPCWriter) initializeRetryConfig() { | |||
if w.attempts == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are we tracking attempts across the entire upload? I would think we should track this per-chunk as we do for JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this
storage/grpc_client.go
Outdated
// shouldRetry determines if a retry is necessary and if so waits the appropriate | ||
// amount of time. It returns true if the error is retryable or the error to be | ||
// surfaced to the user if not. | ||
func (w *gRPCWriter) shouldRetry(ctx context.Context, err error) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this something else? Just confusing to conflate this with the ShouldRetry that only takes an err and returns a bool, especially since the backoff pause happens inside this func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair. I'll think of something better.
storage/grpc_client.go
Outdated
} | ||
|
||
if retryable { | ||
p := w.backoff.Pause() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, maybe we can add a mock test in this PR to ensure that backoff.Pause is actually called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mock for the backoff and had to change more than anticipated to be able to inject it, PTAL
Also, are you sure your testbench is up to date? That would be a reason that the ReadStallTimeout test might not work locally. Looks like it runs fine in Kokoro so I am not concenred. |
It is up to date. It's possible it just doesn't run fast enough or something like that locally; I'm also not concerned if they are passing in kokoro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run an e2e test to verify it fixes b/379925581?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one minor suggestion but I think maybe just for future reference.
func (w *gRPCWriter) shouldRetry(ctx context.Context, err error) (bool, error) { | ||
// retriable determines if a retry is necessary and if so returns a nil error; | ||
// otherwise it returns the error to be surfaced to the user. | ||
func (retry *uploadBufferRetryConfig) retriable(ctx context.Context, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little odd to have this return the error to be resurfaced. Maybe just have it return a bool, and you can use retry.lastErr to set the error to return if need be?
Also not a big deal if you want to leave this as-is for now, since we'll be overwriting with the refactored version anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair; I think I may have been too deep in this. Although, this returns a formatted error for when max attempts are reached, which you wouldn't know outside this method unless you returned two bools... but I guess adding the number of attempts to the error may be useful for all cases anyway. I'll leave as-is and consider this feedback if relevant to the other version.
} | ||
} | ||
|
||
type MockWriteStream struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks useful to test other behavior for writer; we should file a ticket to add.