Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): add backoff to gRPC write retries #11200

Merged
merged 5 commits into from
Dec 4, 2024

Conversation

BrennaEpp
Copy link
Contributor

@BrennaEpp BrennaEpp commented Nov 27, 2024

  • Integration tests pass
  • Emulated/conformance tests pass except for the following which I assume is unrelated:
--- FAIL: TestRetryReadStallEmulated (5.52s)
    client_test.go:1496: NewReader: context deadline exceeded

@BrennaEpp BrennaEpp requested review from a team as code owners November 27, 2024 08:00
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Nov 27, 2024
Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments but overall the approach looks good.

@@ -2370,3 +2382,78 @@ func checkCanceled(err error) error {

return err
}

func (w *gRPCWriter) initializeRetryConfig() {
if w.attempts == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are we tracking attempts across the entire upload? I would think we should track this per-chunk as we do for JSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this

// shouldRetry determines if a retry is necessary and if so waits the appropriate
// amount of time. It returns true if the error is retryable or the error to be
// surfaced to the user if not.
func (w *gRPCWriter) shouldRetry(ctx context.Context, err error) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this something else? Just confusing to conflate this with the ShouldRetry that only takes an err and returns a bool, especially since the backoff pause happens inside this func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. I'll think of something better.

}

if retryable {
p := w.backoff.Pause()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, maybe we can add a mock test in this PR to ensure that backoff.Pause is actually called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mock for the backoff and had to change more than anticipated to be able to inject it, PTAL

@tritone
Copy link
Contributor

tritone commented Nov 27, 2024

Also, are you sure your testbench is up to date? That would be a reason that the ReadStallTimeout test might not work locally. Looks like it runs fine in Kokoro so I am not concenred.

@BrennaEpp
Copy link
Contributor Author

Also, are you sure your testbench is up to date? That would be a reason that the ReadStallTimeout test might not work locally. Looks like it runs fine in Kokoro so I am not concenred.

It is up to date. It's possible it just doesn't run fast enough or something like that locally; I'm also not concerned if they are passing in kokoro.

Copy link
Contributor

@danielduhh danielduhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run an e2e test to verify it fixes b/379925581?

Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one minor suggestion but I think maybe just for future reference.

func (w *gRPCWriter) shouldRetry(ctx context.Context, err error) (bool, error) {
// retriable determines if a retry is necessary and if so returns a nil error;
// otherwise it returns the error to be surfaced to the user.
func (retry *uploadBufferRetryConfig) retriable(ctx context.Context, err error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd to have this return the error to be resurfaced. Maybe just have it return a bool, and you can use retry.lastErr to set the error to return if need be?

Also not a big deal if you want to leave this as-is for now, since we'll be overwriting with the refactored version anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair; I think I may have been too deep in this. Although, this returns a formatted error for when max attempts are reached, which you wouldn't know outside this method unless you returned two bools... but I guess adding the number of attempts to the error may be useful for all cases anyway. I'll leave as-is and consider this feedback if relevant to the other version.

}
}

type MockWriteStream struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful to test other behavior for writer; we should file a ticket to add.

@BrennaEpp BrennaEpp enabled auto-merge (squash) December 4, 2024 07:34
@BrennaEpp BrennaEpp merged commit a7db927 into googleapis:main Dec 4, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants