Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit minimal time poll.Wait* functions can wait between attempts #2543

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

hairyhum
Copy link
Contributor

Change Overview

Currently calculation of sleep time can make it negative, which makes t.C channel return right away and does not give ctx time to finish on deadline.
This means that even if context is expired, retries still happen until there are no more retries.

Currently it's causing test failures in snapshot_test.go

The fix is to cap minimal sleep time at 5ms to make sure that context can perform expiration.

NOTE: we could make the minimal sleep smaller, but it's wouldn't make sense getting less than OS scheduling frequency.

Pull request type

Please check the type of change your PR introduces:

  • 🚧 Work in Progress
  • 🌈 Refactoring (no functional changes, no api changes)
  • 🐹 Trivial/Minor
  • 🐛 Bugfix
  • 🌻 Feature
  • 🗺️ Documentation
  • 🤖 Test

Test Plan

Added new test to poll_test.go, but it's not always catching the race condition with context.

  • 💪 Manual
  • ⚡ Unit test
  • 💚 E2E

@infraq infraq added this to In Progress in Kanister Dec 14, 2023
@hairyhum
Copy link
Contributor Author

Seems like this one needs an update linter from #2539

@@ -91,7 +91,11 @@ func WaitWithBackoffWithRetries(ctx context.Context, b backoff.Backoff, numRetri
sleep := b.Duration()
if deadline, ok := ctx.Deadline(); ok {
ctxSleep := time.Until(deadline)
sleep = minDuration(sleep, ctxSleep)
// We want to wait for smaller of backoff sleep and context sleep
Copy link
Contributor

@julio-lopez julio-lopez Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: IIUC from the PR description the <-t.C option is (often? always?) selected in the select block below when sleep is negative, is that the case?

If the deadline has passed, it seems that a sensible behavior would be to respect the context deadline by returning an error right away and thus not executing the operation.

It seems that with this change, ctx.Done(): is going to be selected below when the context has expired, independent of how large sleep is, so in that case, why not make sleep positive (strictly greater than 0)? Would that be sufficient?
Or even simpler, why not return with an error when ctxSleep <= 0 ?

Copy link
Contributor Author

@hairyhum hairyhum Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this is a race condition. If ctx already expired but did not perfom the cancel yet, so there's nothing in the ctx.Done() channel. At the same time if we have negative timer, t.C will always have a value in the channel.
Just means that positive sleep (in nanoseconds) doesn't really lets the context to finish in my tests.
We could just return a context.DeadlineExceeded error when ctxSleep <= 0, but then execution will continue in race with termination of the context. While technically correct (deadline did exceed), I don't know which consequences this might have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other hand additional 10ms of wait doesn't really add that much considering default backoff starts with 100ms and goes up from there (so context deadline in 10s of ms doesn't make much sense)

Kanister automation moved this from In Progress to Review Required Dec 15, 2023
pkg/poll/poll.go Outdated
Comment on lines 95 to 96
// but it has to be over os scheduling interval
// otherwise context doesn't have time to finish
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// but it has to be over os scheduling interval
// otherwise context doesn't have time to finish
// but it has to be > 0 to give ctx.Done() a chance
// to return below

Copy link
Contributor

@julio-lopez julio-lopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment. See inline suggestion.

Kanister automation moved this from Review Required to Reviewer approved Dec 18, 2023
Currently calculation of sleep time can make it negative, which
makes `t.C` channel return right away and does not give ctx time
to finish on deadline.
This means that even if context is expired, retries still happen until
there are no more retries.

The fix is to cap minimal sleep time at 5ms to make sure that context
can perform expiration.

NOTE: we could make the minimal sleep smaller, but it's wouldn't make sense
getting less than OS scheduling frequency.
@mergify mergify bot merged commit 7629583 into master Dec 19, 2023
15 checks passed
Kanister automation moved this from Reviewer approved to Done Dec 19, 2023
@mergify mergify bot deleted the poll-min-sleep branch December 19, 2023 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants