-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlliveness: add timeouts to heartbeats #87533
sqlliveness: add timeouts to heartbeats #87533
Conversation
8eeebb5
to
d305a16
Compare
@@ -21,7 +22,8 @@ import ( | |||
|
|||
// FakeStorage implements the sqlliveness.Storage interface. | |||
type FakeStorage struct { | |||
mu struct { | |||
InsertSleep time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeps are generally an anti-pattern in our repo. Especially sleeps that are intended to wait for something to happen. Anything like this would need to be well motivated. I have a hard time understanding how these sleeps fit into the test.
I think I'd prefer if we did something more explicit, like have a the FakeStorage
send on a channel here or. One approach is to use a chan chan struct{}
where the Insert
call will send a channel on a channel, and then block until it receives from the sent channel. In that way, we can be sure that the Insert does not make progress, and we don't need to wait for timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. Made changes and introduced the pattern you suggested.
func (l *Instance) createSessionWithTimeout( | ||
ctx context.Context, timeout time.Duration, | ||
) (*session, error) { | ||
var s *session | ||
var err error | ||
var createChan = make(chan struct{}) | ||
go func() { | ||
s, err = l.createSession(ctx) | ||
close(createChan) | ||
}() | ||
|
||
t := timeutil.NewTimer() | ||
t.Reset(timeout) | ||
select { | ||
case <-t.C: | ||
return nil, &TimeoutError{} | ||
case <-createChan: | ||
return s, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this seems complex
- This fails to actually cancel the underlying work, which, I think, is bad
Can we instead with context.WithTimeout
and then detect the deadline exceeded error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use context.WithTimeout
d305a16
to
d13363b
Compare
4822910
to
934e219
Compare
@@ -103,6 +104,7 @@ func (s *session) invokeSessionExpiryCallbacks(ctx context.Context) { | |||
for _, callback := range s.mu.sessionExpiryCallbacks { | |||
callback(ctx) | |||
} | |||
log.Fatal(ctx, "sqlliveness session expired") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is at all reasonable. The kv node expects to be robust to this failure and the sql pod installs a callback for exactly this purpose. Please remove this from this PR and look at those callbacks. If you want to change the behavior which happens in those callbacks, please do, but in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will add this to the callback and put it in a separate PR.
@@ -237,6 +239,48 @@ func (l *Instance) extendSession(ctx context.Context, s *session) (bool, error) | |||
return true, nil | |||
} | |||
|
|||
func (l *Instance) createSessionWithTimeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not quite getting it. The context cancellation should just flow through the underlying calls. Why are you creating these separate goroutines? It feels like all you want to do is context the timeout context in the existing createSession
and extendSession
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.
934e219
to
34bf5ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 242 at r3 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Please correct me if my understanding is wrong. From #85541, my understanding was that the underlying storage could hang for whatever reason. When this happens, it would hang the current goroutine (which is running the heartbeat), not really solving the initial problem. By doing this in a separate goroutine, the caller can exit when the context is canceled (after exceeding the deadline) even if the child is stuck on a call. Whenever the child recovers, it will see that the context is canceled and exit.
It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.
0e443d6
to
f356350
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 242 at r3 (raw file):
Previously, ajwerner wrote…
It's a good deal of complexity. I think if we're going to cover that case we need to understand it and have tests which exercise the scenario. In general, my feeling is that we should rely on context cancellation achieving our goals down through the RPC stack. Just adding the timeout to the context would be a 99+% win in my book, whereas this feels like an unclear additional step.
I see, that makes sense. I updated the code and tests to rely on context cancellation for exiting early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 288 at r6 (raw file):
// Unable to extend session due to unknown error. // Clear and stop heartbeat loop. log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)
I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.
Code quote:
// Unable to extend session due to unknown error.
// Clear and stop heartbeat loop.
log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 155 at r6 (raw file):
slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond) // Must be shorter than the storage sleep amount below slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)
In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.
Code quote:
slinstance.DefaultTTL.Override(ctx, &settings.SV, 2*time.Millisecond)
// Must be shorter than the storage sleep amount below
slinstance.DefaultHeartBeat.Override(ctx, &settings.SV, 1*time.Millisecond)
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 162 at r6 (raw file):
// verify that eventually session is created successfully require.Eventually(
I think testutils.SucceedsSoon
will both be easier to use here and will behave better under stress because it backs off its loop.
pkg/sql/sqlliveness/slstorage/test_helpers.go
line 24 at r6 (raw file):
// FakeStorage implements the sqlliveness.Storage interface. type FakeStorage struct { BlockCh chan struct{}
My guess is that you're going to need to put this BlockCh
under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.
a89e27f
to
32c64d7
Compare
32c64d7
to
47c41e4
Compare
a05af60
to
664f2c3
Compare
664f2c3
to
b0add23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 288 at r6 (raw file):
Previously, ajwerner wrote…
I wonder if this is the right policy. I know this isn't changed code. Really just asking the question.
I am not sure personally. Who would be the right team/person to ask?
pkg/sql/sqlliveness/slstorage/test_helpers.go
line 24 at r6 (raw file):
Previously, ajwerner wrote…
My guess is that you're going to need to put this
BlockCh
under the mutex and add a setter for it. It feels to me like the test is not synchronizing setting it with its use.
Good call, it did have a race condition. Thanks! Made the change in the latest revision.
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 155 at r6 (raw file):
Previously, ajwerner wrote…
In my experience, 1ms timers can cause trouble under race. You may want to move them up to something on the order of 10ms if the race detector is enabled.
Thanks for the pointer. Changed.
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 162 at r6 (raw file):
Previously, ajwerner wrote…
I think
testutils.SucceedsSoon
will both be easier to use here and will behave better under stress because it backs off its loop.
Thanks for the pointer. Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slstorage/test_helpers.go
line 52 at r7 (raw file):
ctx context.Context, sid sqlliveness.SessionID, expiration hlc.Timestamp, ) error { s.mu.Lock()
is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.
Code quote:
defer s.mu.Unlock()
b0add23
to
5bbf619
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slstorage/test_helpers.go
line 52 at r7 (raw file):
Previously, ajwerner wrote…
is there an issue here if you block while holding the mutex? should you pull the channel out under the mutex and then block? I feel like this is a classic case where mutexes and channels don't compose well.
Thanks, fixed in the latest revision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r6 (raw file):
found, err = l.extendSession(ctx, s) return err }); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {
It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?
pkg/sql/sqlliveness/slinstance/slinstance.go
line 288 at r6 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
I am not sure personally. Who would be the right team/person to ask?
I think you also want to avoid logging if the ctx.Err() != nil
because we shouldn't log during clean server shutdown.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r10 (raw file):
found, err = l.extendSession(ctx, s) return err }); err != nil && !errors.HasType(err, (*contextutil.TimeoutError)(nil)) {
I think you should also check if the context is canceled, which it will be during shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r6 (raw file):
Previously, ajwerner wrote…
It seems to me like if you hit the timeout, you don't know whether the session exists or not, so you probably should not clear the session -- you should keep retrying to extend the current session, no?
Yes, you are correct. Fixed the behavior and tests to match it.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 288 at r6 (raw file):
Previously, ajwerner wrote…
I think you also want to avoid logging if the
ctx.Err() != nil
because we shouldn't log during clean server shutdown.
That makes sense. Changed.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r10 (raw file):
Previously, ajwerner wrote…
I think you should also check if the context is canceled, which it will be during shutdown.
Thanks for pointing it out, fixed.
5bbf619
to
3b43176
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very close. My refactor is a nit. Feel free to take it or leave it and merge.
Reviewed 1 of 3 files at r9, 2 of 4 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @joshimhoff)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 292 at r12 (raw file):
return err }) if err != nil {
this logic is getting a bit messy and hard to read how about:
switch {
case errors.HasType(err, (*contextutil.TimeoutError)(nil)):
// Retry without clearing the session because we don't know the current status.
t.Reset(0)
continue
case err != nil && ctx.Err() == nil:
log.Errorf(ctx, "sqlliveness failed to extend session: %v", err)
fallthrough
case err != nil:
// TODO(ajwerner): Decide whether we actually should exit the heartbeat loop here if the context is not
// canceled. Consider the case of an ambiguous result error: shouldn't we try again?
l.clearSession()
return
case !found:
// No existing session found, immediately create one.
l.clearSession(ctx)
// Start next loop iteration immediately to insert a new session.
t.Reset(0)
default:
if log.V(2) {
log.Infof(ctx, "extended SQL liveness session %s", s.ID())
}
t.Reset(l.gb()
}
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Release note: None
3b43176
to
59a70ba
Compare
bors r=ajwerner |
Build succeeded: |
Thank you! Can we get a 22.1 backport? CC serverless is running 22.1. Major upgrade can take some time in practice. The lack of these HBs has led to inactionable alerts hitting the pagers of our SREs. |
Why do major upgrades lead to unavailability? I feel like I was missing some context on the motivation here. |
Ah sorry for confusion! Major upgrades don't lead to unavailability. The issue fixed by this PR has nothing to do with major upgrades. CC serverless is simply running 22.1, and I expect it'll take some time to move to 22.2, based on past experience with serverless major upgrades. As a result, I'd love it if we backport this change into 22.1, not just 22.2, so as to avoid reoccurrence of the issue fixed by this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sqlliveness/slinstance/BUILD.bazel
line 40 at r14 (raw file):
"//pkg/sql/sqlliveness", "//pkg/sql/sqlliveness/slstorage", "//pkg/testutils",
for future reference, you definitely want to squash a diff like this into the prior commit. All commits must build.
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.
Resolves #85541
Release note: None
Release justification: low-risk bugfix to existing functionality