sqlliveness: no deadlines on heartbeats #85541

joshimhoff · 2022-08-03T16:26:46Z

Describe the problem
From what I can see, there are no deadlines on sqlliveness heartbeats. This means a single hung heartbeat will expire a sqliveness session. A hung heartbeat can happen for various reasons, e.g. temporary disk overload. Multi-tenant SQL pods crash if a sqliveness session expires right now. So a single hung heartbeat can actually crash a multi-tenant SQL pod. I believe we have observed this in practice, tho I'm only medium confident about that claim (link to internal support issue: https://github.com/cockroachlabs/support/issues/1736).

To Reproduce
N/A

Expected behavior
There should be deadlines on sqlliveness heartbeats.

Additional data / screenshots
Link to internal support issue: https://github.com/cockroachlabs/support/issues/1736

Jira issue: CRDB-18302

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. TODO: tests TODO: is this the right way and place to introduce the timeouts? Resolves cockroachdb#85541 Release note: None

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality

87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality 88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in #88376 (comment). Fixes: #88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant 88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner Fixes #87246 This will also improve tracing. Release note: None 88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server. The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing. Release justification: non-production code change Release note: None 88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov Fixes #81881 Touches #72083 Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused 88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner A schema might not exist because it has been dropped. We need to mark the lookup as required. Fixes #87895 Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality

andreimatei · 2022-11-10T21:43:54Z

@joshimhoff , the timeout that was introduced as a result of this issue having been filed has been removed because its implementation was buggy. Further, we couldn't convince ourselves that timeouts over there serve any purpose - why would one want to fail the respective query that's trying to heartbeat the session record, only for it to be retried, as opposed to letting it hang?
There's a link above to a slack conversation that suggests that perhaps observability would be better if these queries failed instead of hanging, but that seems unconvincing - there is a metric for the success of these queries, and the absence of success tells a story.

If you have thoughts, feel free to reopen the issue.

joshimhoff · 2022-11-14T14:34:04Z

I don't think it's about o11y. I think it's about reliability. The ticket description re: this is admittedly a bit vague. I include it down below to see if we can make the argument more concrete.

This means a single hung heartbeat will expire a sqliveness session. A hung heartbeat can happen for various reasons, e.g. temporary disk overload. Multi-tenant SQL pods crash if a sqliveness session expires right now. So a single hung heartbeat can actually crash a multi-tenant SQL pod.

I think we can rephrase ^^ as a question: Are there situations that happen in production where without deadlines & retries a hung heartbeat will stay hung forever leading to a SQL pod crash? Unless we are very sure the answer is no, I think setting a deadline plus doing retries is a good pattern, in particular since the consequence of session expiration is currently a SQL pod crashing. CC @ajwerner in case he has thoughts.

andreimatei · 2022-11-14T17:12:01Z

Are there situations that happen in production where without deadlines & retries a hung heartbeat will stay hung forever leading to a SQL pod crash?

A heartbeat query can hang when faced with various outages, but in those situations the retries will hang as well - so I don't think they serve any purpose other than to muddy the code. You can imagine bugs that would cause a retry to succeed where the original query is hanging, but I don't think we want to protect against such things.

joshimhoff · 2022-11-14T17:25:53Z

You can imagine bugs that would cause a retry to succeed where the original query is hanging, but I don't think we want to protect against such things.

By "bug" do you mean a bug in the CRDB code, or do you mean certain poor infra conditions, e.g. blips in the network? I think we certainly want to be robust to the latter. Re: the former, I think we do too, but I can see that as being a more contentious view.

andreimatei · 2022-12-08T15:32:33Z

I mean a bug in CRDB code. I don't think sprinkling timeouts in more or less random places to protect against hypothetical bugs is a good general policy. To go to pedantry, in my opinion, as far as most code is concerned, the only thing that matters when a bug is present is how easy it is to find the bug - so things that help debugging are good; things that hide bugs are bad; and things that transparently recover from bugs are generally not worth it. In this case, the timeout we're talking about, at least in the straight-forward way to implement it, would destroy the state of the stuck operations (i.e. the stacks traces involved are gone).

joshimhoff · 2022-12-08T15:48:34Z

We don't have the same view about the utility of higher level controls reducing the impact of bugs in lower level code, but the main reason I want a timeout in place is NOT to protect against bugs in the CRDB code. I want it to keep poor infra conditions e.g. in the network from leading to SQL pod crashes. I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries). As I think more about it, I realize I haven't introspected deeply on why I want timeouts until this discussion, so thanks for having it with me. At same time, there is a more basic argument tho which is that we should not crash the server unless that actually benefits the user in some way. In the case of a stuck SQL liveness heartbeat, crashing the SQL pod is doing nothing useful for the user, and it does reduce reliability, at the very least by slowing down recovery (restart takes time esp. in some environments) and breaking a bunch of stateful SQL connections.

I am interested in hearing what some other people think about this, as I realize my view is a bit uncertain.

CC @ajwerner @bdarnell

andreimatei · 2022-12-08T16:18:07Z

I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries).

I don't think it's a real thing.

I agree that killing the process on session expiration is a bad idea. The only tiny written about it that I can find is here. I have failed to impress others to rethink this, and I didn't see it within my power to change when I was touching this code (and I was touching it, btw, to fix bugs partially introduced by this timeout having been added). Killing the process is unfortunately tied to other decisions about how the sql instance IDs are allocated - other instances are quick to steal the ID corresponding to an expired session, and having multiple servers use the same ID is not good for various reasons.

ajwerner · 2022-12-08T16:20:52Z

We can definitely do more to avoid killing the process. I don't think that work is particularly related to the discussion on timeouts.

bdarnell · 2022-12-08T18:05:02Z

I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries).

I think this is possible if the network is acting up and one TCP connection is horked but the packets are still flowing and new connections can be created. This is why we have RPC-level heartbeats on our inter-node connections. Are those being used for SQL pods? I think they should be. Without them we're reliant on TCP keepalives which IIRC have limited configurability and long timeouts.

Other than network-level issues which are better addressed with network-level heartbeats and timeouts, I think most things that would cause one heartbeat to hang would also cause the next attempt to hang, so there's little value in using a timeout to retry the exact same request.

However, AIUI, it's not the exact same request - the update includes an expiration time which was in the future when the attempt started. If that time is now in the past, it's better to cancel the previous attempt and start over. I don't think that's going to actually make much difference because the new attempt will still hang, but at least it means that things have a chance of working as soon as the blockage is cleared, instead of leaving the now-useless original request in flight.

We implemented timeouts for this reason in node liveness, and I think it's turned out to be important there. So it seems like it's probably going to be worthwhile to do something similar for sqlliveness.

ajwerner · 2022-12-08T18:08:11Z

Are those being used for SQL pods?

Yes, the connections with the sql pods use the same heartbeat logic.

andreimatei · 2022-12-08T18:35:07Z

However, AIUI, it's not the exact same request - the update includes an expiration time which was in the future when the attempt started. If that time is now in the past, it's better to cancel the previous attempt and start over. I don't think that's going to actually make much difference because the new attempt will still hang, but at least it means that things have a chance of working as soon as the blockage is cleared, instead of leaving the now-useless original request in flight.

We implemented timeouts for this reason in node liveness, and I think it's turned out to be important there. So it seems like it's probably going to be worthwhile to do something similar for sqlliveness.

The context is different between this sql liveness session that's being heartbeated here, and node liveness. For node liveness, recovering as soon as possible is pretty important: if the liveness range is only able to process one hb every, say, 10s, that one hb that it eventually processes better be one that's not already expired. (Although flapping between alive and expired is probably worse than being consistently expired, so I question even that timeout).
Here, there's two differences:

The expiration of sql liveness is much slower - 40s. So the respective range has much more time to process the heartbeats before sessions expire.
Once the session expires, the process shuts down. So, given the 40s liveness expiration, there's quite a limited window where new requests are better than the original, stuck request.
Do you buy that this color makes a difference?

How I feel about this timeout is also influenced by the fact that, when we added this timeout, we also introduced a fat bug. I take that as proof that this kind of complications to the code should not be done gratuitously.

ajwerner · 2022-12-08T18:37:14Z

It seems to me like the order of operations here should be to make it unlikely that the process needs to shut down upon expiration first, and then to talk about adding a timeout.

joshimhoff · 2022-12-08T18:45:11Z

Appreciate the color from everyone; it's clarifying.

It seems to me like the order of operations here should be to make it unlikely that the process needs to shut down upon expiration first, and then to talk about adding a timeout.

Seems reasonable esp. given IIUC the point Andrei is making about there being little time post expiration but pre server shutdown.

How I feel about this timeout is also influenced by the fact that, when we added this timeout, we also introduced a fat bug. I take that as proof that this kind of complications to the code should not be done gratuitously.

Alternatively, that code is too complex, with the proof being that we introduced a bug by adding timeouts ;)

bdarnell · 2022-12-08T19:35:39Z

Once the session expires, the process shuts down. So, given the 40s liveness expiration, there's quite a limited window where new requests are better than the original, stuck request.
Do you buy that this color makes a difference?

Yes. We'd need to make session expiration non-fatal before adding a timeout would make sense.

andreimatei · 2022-12-08T20:36:24Z

Thanks all. Closing...

joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sre For issues SRE opened or otherwise cares about tracking. labels Aug 3, 2022

ajwerner changed the title ~~No deadlines on sqlliveness heartbeats~~ sqlliveness: no deadlines on sqlliveness heartbeats Aug 3, 2022

ajwerner changed the title ~~sqlliveness: no deadlines on sqlliveness heartbeats~~ sqlliveness: no deadlines on heartbeats Aug 3, 2022

dhartunian added the T-observability-inf label Aug 3, 2022

blathers-crl bot added the A-observability-inf label Aug 3, 2022

dhartunian mentioned this issue Aug 3, 2022

[WIP] sqlliveness: add timeouts to heartbeats #85551

Closed

exalate-issue-sync bot assigned aadityasondhi Sep 6, 2022

aadityasondhi mentioned this issue Sep 7, 2022

sqlliveness: add timeouts to heartbeats #87533

Merged

craig bot closed this as completed in 4bea89f Sep 23, 2022

blathers-crl bot mentioned this issue Sep 23, 2022

release-22.2: sqlliveness: add timeouts to heartbeats #88601

Closed

aadityasondhi mentioned this issue Sep 26, 2022

release-22.1: sqlliveness: add timeouts to heartbeats #88731

Closed

joshimhoff reopened this Nov 14, 2022

andreimatei closed this as completed Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlliveness: no deadlines on heartbeats #85541

sqlliveness: no deadlines on heartbeats #85541

joshimhoff commented Aug 3, 2022 •

edited

Loading

andreimatei commented Nov 10, 2022

joshimhoff commented Nov 14, 2022

andreimatei commented Nov 14, 2022

joshimhoff commented Nov 14, 2022 •

edited

Loading

andreimatei commented Dec 8, 2022

joshimhoff commented Dec 8, 2022 •

edited

Loading

andreimatei commented Dec 8, 2022

ajwerner commented Dec 8, 2022

bdarnell commented Dec 8, 2022

ajwerner commented Dec 8, 2022

andreimatei commented Dec 8, 2022

ajwerner commented Dec 8, 2022

joshimhoff commented Dec 8, 2022

bdarnell commented Dec 8, 2022

andreimatei commented Dec 8, 2022

sqlliveness: no deadlines on heartbeats #85541

sqlliveness: no deadlines on heartbeats #85541

Comments

joshimhoff commented Aug 3, 2022 • edited Loading

andreimatei commented Nov 10, 2022

joshimhoff commented Nov 14, 2022

andreimatei commented Nov 14, 2022

joshimhoff commented Nov 14, 2022 • edited Loading

andreimatei commented Dec 8, 2022

joshimhoff commented Dec 8, 2022 • edited Loading

andreimatei commented Dec 8, 2022

ajwerner commented Dec 8, 2022

bdarnell commented Dec 8, 2022

ajwerner commented Dec 8, 2022

andreimatei commented Dec 8, 2022

ajwerner commented Dec 8, 2022

joshimhoff commented Dec 8, 2022

bdarnell commented Dec 8, 2022

andreimatei commented Dec 8, 2022

joshimhoff commented Aug 3, 2022 •

edited

Loading

joshimhoff commented Nov 14, 2022 •

edited

Loading

joshimhoff commented Dec 8, 2022 •

edited

Loading