-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlliveness: no deadlines on heartbeats #85541
Comments
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. TODO: tests TODO: is this the right way and place to introduce the timeouts? Resolves cockroachdb#85541 Release note: None
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. This change also adds a `log.Fatal()` if the session expires while trying to extend it. Resolves cockroachdb#85541 Resolves cockroachdb#85540 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality 88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in #88376 (comment). Fixes: #88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant 88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner Fixes #87246 This will also improve tracing. Release note: None 88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server. The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing. Release justification: non-production code change Release note: None 88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov Fixes #81881 Touches #72083 Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused 88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner A schema might not exist because it has been dropped. We need to mark the lookup as required. Fixes #87895 Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves cockroachdb#85541 Release note: None Release justification: low-risk bugfix to existing functionality
@joshimhoff , the timeout that was introduced as a result of this issue having been filed has been removed because its implementation was buggy. Further, we couldn't convince ourselves that timeouts over there serve any purpose - why would one want to fail the respective query that's trying to heartbeat the session record, only for it to be retried, as opposed to letting it hang? If you have thoughts, feel free to reopen the issue. |
I don't think it's about o11y. I think it's about reliability. The ticket description re: this is admittedly a bit vague. I include it down below to see if we can make the argument more concrete.
I think we can rephrase ^^ as a question: Are there situations that happen in production where without deadlines & retries a hung heartbeat will stay hung forever leading to a SQL pod crash? Unless we are very sure the answer is no, I think setting a deadline plus doing retries is a good pattern, in particular since the consequence of session expiration is currently a SQL pod crashing. CC @ajwerner in case he has thoughts. |
A heartbeat query can hang when faced with various outages, but in those situations the retries will hang as well - so I don't think they serve any purpose other than to muddy the code. You can imagine bugs that would cause a retry to succeed where the original query is hanging, but I don't think we want to protect against such things. |
By "bug" do you mean a bug in the CRDB code, or do you mean certain poor infra conditions, e.g. blips in the network? I think we certainly want to be robust to the latter. Re: the former, I think we do too, but I can see that as being a more contentious view. |
I mean a bug in CRDB code. I don't think sprinkling timeouts in more or less random places to protect against hypothetical bugs is a good general policy. To go to pedantry, in my opinion, as far as most code is concerned, the only thing that matters when a bug is present is how easy it is to find the bug - so things that help debugging are good; things that hide bugs are bad; and things that transparently recover from bugs are generally not worth it. In this case, the timeout we're talking about, at least in the straight-forward way to implement it, would destroy the state of the stuck operations (i.e. the stacks traces involved are gone). |
We don't have the same view about the utility of higher level controls reducing the impact of bugs in lower level code, but the main reason I want a timeout in place is NOT to protect against bugs in the CRDB code. I want it to keep poor infra conditions e.g. in the network from leading to SQL pod crashes. I guess the basic idea here is that there are infra conditions that happen in practice that lead to a hung heartbeat that succeeds on retry. I don't know if that is actually a real thing (e.g. TCP is already doing retries). As I think more about it, I realize I haven't introspected deeply on why I want timeouts until this discussion, so thanks for having it with me. At same time, there is a more basic argument tho which is that we should not crash the server unless that actually benefits the user in some way. In the case of a stuck SQL liveness heartbeat, crashing the SQL pod is doing nothing useful for the user, and it does reduce reliability, at the very least by slowing down recovery (restart takes time esp. in some environments) and breaking a bunch of stateful SQL connections. I am interested in hearing what some other people think about this, as I realize my view is a bit uncertain. |
I don't think it's a real thing. I agree that killing the process on session expiration is a bad idea. The only tiny written about it that I can find is here. I have failed to impress others to rethink this, and I didn't see it within my power to change when I was touching this code (and I was touching it, btw, to fix bugs partially introduced by this timeout having been added). Killing the process is unfortunately tied to other decisions about how the sql instance IDs are allocated - other instances are quick to steal the ID corresponding to an expired session, and having multiple servers use the same ID is not good for various reasons. |
We can definitely do more to avoid killing the process. I don't think that work is particularly related to the discussion on timeouts. |
I think this is possible if the network is acting up and one TCP connection is horked but the packets are still flowing and new connections can be created. This is why we have RPC-level heartbeats on our inter-node connections. Are those being used for SQL pods? I think they should be. Without them we're reliant on TCP keepalives which IIRC have limited configurability and long timeouts. Other than network-level issues which are better addressed with network-level heartbeats and timeouts, I think most things that would cause one heartbeat to hang would also cause the next attempt to hang, so there's little value in using a timeout to retry the exact same request. However, AIUI, it's not the exact same request - the update includes an expiration time which was in the future when the attempt started. If that time is now in the past, it's better to cancel the previous attempt and start over. I don't think that's going to actually make much difference because the new attempt will still hang, but at least it means that things have a chance of working as soon as the blockage is cleared, instead of leaving the now-useless original request in flight. We implemented timeouts for this reason in node liveness, and I think it's turned out to be important there. So it seems like it's probably going to be worthwhile to do something similar for sqlliveness. |
Yes, the connections with the sql pods use the same heartbeat logic. |
The context is different between this sql liveness session that's being heartbeated here, and node liveness. For node liveness, recovering as soon as possible is pretty important: if the liveness range is only able to process one hb every, say, 10s, that one hb that it eventually processes better be one that's not already expired. (Although flapping between alive and expired is probably worse than being consistently expired, so I question even that timeout).
How I feel about this timeout is also influenced by the fact that, when we added this timeout, we also introduced a fat bug. I take that as proof that this kind of complications to the code should not be done gratuitously. |
It seems to me like the order of operations here should be to make it unlikely that the process needs to shut down upon expiration first, and then to talk about adding a timeout. |
Appreciate the color from everyone; it's clarifying.
Seems reasonable esp. given IIUC the point Andrei is making about there being little time post expiration but pre server shutdown.
Alternatively, that code is too complex, with the proof being that we introduced a bug by adding timeouts ;) |
Yes. We'd need to make session expiration non-fatal before adding a timeout would make sense. |
Thanks all. Closing... |
Describe the problem
From what I can see, there are no deadlines on sqlliveness heartbeats. This means a single hung heartbeat will expire a sqliveness session. A hung heartbeat can happen for various reasons, e.g. temporary disk overload. Multi-tenant SQL pods crash if a sqliveness session expires right now. So a single hung heartbeat can actually crash a multi-tenant SQL pod. I believe we have observed this in practice, tho I'm only medium confident about that claim (link to internal support issue: https://github.com/cockroachlabs/support/issues/1736).
To Reproduce
N/A
Expected behavior
There should be deadlines on sqlliveness heartbeats.
Additional data / screenshots
Link to internal support issue: https://github.com/cockroachlabs/support/issues/1736
Jira issue: CRDB-18302
The text was updated successfully, but these errors were encountered: