-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest/mixedversion: WaitForReplication
must ensure SQL is ready
#138109
base: master
Are you sure you want to change the base?
Conversation
10 runs using the SEED (
|
0ee60c8
to
fdbbbb0
Compare
// WaitForReady waits until the given nodes report ready via health checks. | ||
// This implies that the node has completed server startup, is heartbeating its | ||
// liveness record, and can serve SQL clients. | ||
// FIXME(srosenberg): This function is a bit of a misnomer. It doesn't actually ensure that SQL is ready to serve, only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewbaptist Just a heads-up since I believe this helper is used only by the tests you authored. While the Admin endpoint may be up, it doesn't guarantee that SQL is ready. The current implementation may be "ok" for these tests, but it's more pronounced in shared-process, multi-tenant configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less of an issue now after the last iteration, which unconditionally calls WaitForSQLReady
, for every started db node.
0d880fb
to
d89b8e1
Compare
|
Previously, mixedversion roachtests would occassionally fail while trying to execute SQL against a freshly started node; specifically, `get-user-session` would time out. The precise root cause is described in [1]. As of [2], `WaitForReplication` is the default used by all mixedversion roachtests. Thus, it suffices to ensure that "SQL is ready" before `WaitForReplication` is invoked. However, a number of roachtests don't opt into `StartOpts.WaitForReplication`, which makes them susceptible to the same failure mode, e.g., [3]. Thus, we unconditionally `WaitForSQLReady`, for the system tenant, on every started DB node. [1] cockroachdb#137988 [2] cockroachdb#136607 [3] cockroachdb#137382 Fixes: cockroachdb#137988 Informs: cockroachdb#137332 Epic: none Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
if !startOpts.RoachprodOpts.SkipInit { | ||
// Wait for SQL to be ready on all nodes, for 'system' tenant, only. | ||
for _, n := range nodes { | ||
conn, err := c.ConnE(ctx, l, nodes[0], option.VirtualClusterName("system")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: install.SystemInterfaceName
instead of "system"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll update.
// N.B. We must explicitly pass the virtual cluster name to `ConnE`, otherwise the default may turn out to be a | ||
// secondary tenant, in which case we would only check the tenant's key range, not the whole system's. | ||
// See "Unhidden Bug" in https://github.com/cockroachdb/cockroach/issues/137988 | ||
conn, err := c.ConnE(ctx, l, nodes[0], option.VirtualClusterName("system")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above
nodes := selectedNodesOrDefault(opts, c.CRDBNodes()) | ||
// N.B. If `SkipInit` is set, we don't wait for SQL since node(s) may not join the cluster in any definite time. | ||
if !startOpts.RoachprodOpts.SkipInit { | ||
// Wait for SQL to be ready on all nodes, for 'system' tenant, only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something similar be added for the virtual cluster start counterpart: StartServiceForVirtualClusterE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess it's not needed because we're mainly waiting for the privileges cache to be ready which is at the system level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes, we do need to wait for shared-process tenants. However, there could be any number of them, e.g.,
I241212 11:19:11.321291 53045 sql/syntheticprivilegecache/cache.go:198 â‹® [T3,Vmixed-version-tenant-mbqzz,n3] 145 warmed privileges for virtual tables in 10.007996435s
I thought it's probably best to tackle them in the mixedversion framework, since the above code already feels heavy wrt additional waiting. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, there could be any number of them, e.g.
But don't you just need to wait for the tenant you are starting? Which it should already know from startOpts.RoachprodOpts.VirtualClusterName
?
since the above code already feels heavy wrt additional waiting
Yeah I was thinking the same thing, especially since the first SQL check is redundant if WaitForReplicationFactor > 0
since WaitForReplication
is also checking if SQL is ready. No arguments from me against doing it in the mixed version framework.
Previously, mixedversion roachtests would occassionally fail while trying to execute SQL against a freshly started node; specifically,
get-user-session
would time out. The precise root cause is described in [1].As of [2],
WaitForReplication
is the default used by all mixedversion roachtests. Thus, it sufficesto ensure that "SQL is ready" before
WaitForReplication
is invoked.However, a number of roachtests don't opt into
StartOpts.WaitForReplication
, which makes themsusceptible to the same failure mode, e.g., [3].
Thus, we unconditionally
WaitForSQLReady
, forthe system tenant, on every started DB node.
[1] #137988
[2] #136607
[3] #137382
Fixes: #137988
Informs: #137332
Epic: none
Release note: None