-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs aren't clear about when SQL pod exits due to a sqlliveness session expiring #85540
Labels
A-observability-inf
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-sre
For issues SRE opened or otherwise cares about tracking.
Comments
joshimhoff
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-sre
For issues SRE opened or otherwise cares about tracking.
labels
Aug 3, 2022
aadityasondhi
pushed a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 7, 2022
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
aadityasondhi
pushed a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 8, 2022
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
aadityasondhi
pushed a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 12, 2022
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 27, 2022
This is being added to improve observability for when a sqlliveness session expires. Previously, it was difficult to know if the session ended due to expiry. Resolves cockroachdb#85540 Release note: None
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 28, 2022
Previously, it was difficult to know if the session ended due to expiry. This change passes an err to the shutdown logic through the `errChan` to exit gracefully but with an error. Resolves cockroachdb#85540 Release note: None
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 28, 2022
Previously, it was difficult to know if the session ended due to expiry. This change passes an err to the shutdown logic through the `errChan` to exit gracefully but with an error. Resolves cockroachdb#85540 Release note: None
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Sep 29, 2022
Previously, it was difficult to know if the session ended due to expiry. This change passes an err to the shutdown logic through the `errChan` to exit gracefully but with an error. Resolves cockroachdb#85540 Release note: None
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Oct 17, 2022
In a previous patch, timeouts and retries were added to the sqlliveness heartbeat loop. This introduced cases where the session could potentially go between valid, invalid, and valid state again. This patch introduces a tenant flag for the sqlliveness instance. When enabled, the heartbeat loop is terminated if there are any errors in the session create or extend calls. This patch also introduces an error channel which is used to propogate sqlliveness errors to the server and cli level for better visibility on the reason for process termination. Resolves cockroachdb#85540 Release note: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 18, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which has more complex shutdown logic - it prints progress as the stopper stops (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 19, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 21, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 24, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 24, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 25, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 26, 2022
…lems This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in cockroachdb#90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes cockroachdb#85540 Release note: None Epic: None
craig bot
pushed a commit
that referenced
this issue
Oct 27, 2022
90208: server,sqlliveness: rework shutting down SQL pods on sqlliveness problems r=andreimatei a=andreimatei This patch changes how stopping the stopper and, in turn, terminating the process happens for tenant SQL pods when the sqlliveness session is expired. Before this patch, the liveness code was calling the Stopper.Stop() directly. This patch replaces that with bubbling up an event up to the cli layer, which is where the logic around terminating the process lives. That guy has more complex shutdown logic - it has the opportunity to perform draining if it wants (or at least it will in #90143), and it also prints a message with the cause of the exit. An alternative considered was to extend the Stopper.Stop() interface somehow with a termination message, and continue to use the stopper as the mechanism used by arbitrary modules to signal the desire to terminate the server. I've decided against it, though, in favor of introducing another mechanism for signaling requests to terminate up to the cli layer, on the argument that the stopper is passed around so widely and used in a lot of non-production libraries, such that it's hard to trace and tell what production modules call Stop() on it and under which circumstances. Also, stopper.Stop() cannot be deadlocks if called from inside a stopper task (and our code generally runs inside stopper tasks), so we'd also need to add a new async stop interface to the stopper. Thus, the goal of this patch is to keep the number of callers to Stopper.Stop() small - right now there are two of them: one in the cli, and one in the drainServer. In a followup patch, I might remove the drainServer call, and then we'd be left with exactly one caller to Stopper.Stop(). Fixes #85540 Release note: None Epic: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-observability-inf
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-sre
For issues SRE opened or otherwise cares about tracking.
Describe the problem
The logs aren't clear about when a SQL pod exits due to a sqlliveness session expiring. See the code here:
cockroach/pkg/sql/sqlinstance/instanceprovider/instanceprovider.go
Lines 174 to 180 in 71e1af4
To Reproduce
N/A
Expected behavior
CRDB should log the shutdown reason at fatal severity. CRDB should also probably return a non-zero exit code, since it is not expected for sqlliveness sessions to expire.
Additional data / screenshots
Link to a related internal support issue: https://github.com/cockroachlabs/support/issues/1736
Jira issue: CRDB-18301
The text was updated successfully, but these errors were encountered: