Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/mixedversion: WaitForReplication must ensure SQL is ready #138109

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

srosenberg
Copy link
Member

@srosenberg srosenberg commented Dec 31, 2024

Previously, mixedversion roachtests would occassionally fail while trying to execute SQL against a freshly started node; specifically, get-user-session would time out. The precise root cause is described in [1].

As of [2], WaitForReplication is the default used by all mixedversion roachtests. Thus, it suffices
to ensure that "SQL is ready" before
WaitForReplication is invoked.

However, a number of roachtests don't opt into
StartOpts.WaitForReplication, which makes them
susceptible to the same failure mode, e.g., [3].
Thus, we unconditionally WaitForSQLReady, for
the system tenant, on every started DB node.

[1] #137988
[2] #136607
[3] #137382

Fixes: #137988
Informs: #137332

Epic: none
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg
Copy link
Member Author

10 runs using the SEED (COCKROACH_RANDOM_SEED=-7845636561506696193) from [1],

artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_1/test.log:2024/12/31 10:02:18 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#1 (17286.21s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_3/test.log:2024/12/31 10:01:44 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#3 (17265.54s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_4/test.log:2024/12/31 10:01:57 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#4 (17272.15s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_5/test.log:2024/12/31 11:00:15 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#5 (17533.08s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_6/test.log:2024/12/31 14:50:13 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#6 (17271.72s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_7/test.log:2024/12/31 14:50:56 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#7 (17303.44s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_8/test.log:2024/12/31 14:51:06 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#8 (17291.17s)
artifacts-137332/tpcc/mixed-headroom/n5cpu16/run_9/test.log:2024/12/31 15:48:04 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#9 (17231.92s)

[1] #137332 (comment)

@srosenberg srosenberg marked this pull request as ready for review December 31, 2024 23:00
@srosenberg srosenberg requested a review from a team as a code owner December 31, 2024 23:00
@srosenberg srosenberg requested review from nameisbhaskar, vidit-bhat and DarrylWong and removed request for a team December 31, 2024 23:00
// WaitForReady waits until the given nodes report ready via health checks.
// This implies that the node has completed server startup, is heartbeating its
// liveness record, and can serve SQL clients.
// FIXME(srosenberg): This function is a bit of a misnomer. It doesn't actually ensure that SQL is ready to serve, only
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewbaptist Just a heads-up since I believe this helper is used only by the tests you authored. While the Admin endpoint may be up, it doesn't guarantee that SQL is ready. The current implementation may be "ok" for these tests, but it's more pronounced in shared-process, multi-tenant configurations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less of an issue now after the last iteration, which unconditionally calls WaitForSQLReady, for every started db node.

@srosenberg
Copy link
Member Author

SELECT_PROBABILITY=0.6

@srosenberg srosenberg force-pushed the sr/137988 branch 4 times, most recently from 0d880fb to d89b8e1 Compare January 1, 2025 22:55
@srosenberg
Copy link
Member Author

srosenberg commented Jan 1, 2025

SELECT_PROBABILITY=0.6

decommission/drains failed (because sql on n4 wasn't ready), which made me reconsider the initial approach of tying WaitForSQLReady to WaitForReplication. (Even in that case, it only does it for n1.) Now, each started db node has to pass WaitForSQLReady, regardless of whether WaitForReplication is enabled. In light of that, let's try take two,

SELECT_PROBABILITY=0.8

Previously, mixedversion roachtests would occassionally
fail while trying to execute SQL against a freshly
started node; specifically, `get-user-session` would
time out. The precise root cause is described in [1].

As of [2], `WaitForReplication` is the default used
by all mixedversion roachtests. Thus, it suffices
to ensure that "SQL is ready" before
`WaitForReplication` is invoked.

However, a number of roachtests don't opt into
`StartOpts.WaitForReplication`, which makes them
susceptible to the same failure mode, e.g., [3].
Thus, we unconditionally `WaitForSQLReady`, for
the system tenant, on every started DB node.

[1] cockroachdb#137988
[2] cockroachdb#136607
[3] cockroachdb#137382

Fixes: cockroachdb#137988
Informs: cockroachdb#137332

Epic: none
Release note: None
Copy link
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

if !startOpts.RoachprodOpts.SkipInit {
// Wait for SQL to be ready on all nodes, for 'system' tenant, only.
for _, n := range nodes {
conn, err := c.ConnE(ctx, l, nodes[0], option.VirtualClusterName("system"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: install.SystemInterfaceName instead of "system"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update.

// N.B. We must explicitly pass the virtual cluster name to `ConnE`, otherwise the default may turn out to be a
// secondary tenant, in which case we would only check the tenant's key range, not the whole system's.
// See "Unhidden Bug" in https://github.com/cockroachdb/cockroach/issues/137988
conn, err := c.ConnE(ctx, l, nodes[0], option.VirtualClusterName("system"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as above

nodes := selectedNodesOrDefault(opts, c.CRDBNodes())
// N.B. If `SkipInit` is set, we don't wait for SQL since node(s) may not join the cluster in any definite time.
if !startOpts.RoachprodOpts.SkipInit {
// Wait for SQL to be ready on all nodes, for 'system' tenant, only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something similar be added for the virtual cluster start counterpart: StartServiceForVirtualClusterE?

Copy link
Contributor

@DarrylWong DarrylWong Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess it's not needed because we're mainly waiting for the privileges cache to be ready which is at the system level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes, we do need to wait for shared-process tenants. However, there could be any number of them, e.g.,

I241212 11:19:11.321291 53045 sql/syntheticprivilegecache/cache.go:198 â‹® [T3,Vmixed-version-tenant-mbqzz,n3] 145  warmed privileges for virtual tables in 10.007996435s

I thought it's probably best to tackle them in the mixedversion framework, since the above code already feels heavy wrt additional waiting. What do you think?

Copy link
Contributor

@DarrylWong DarrylWong Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, there could be any number of them, e.g.

But don't you just need to wait for the tenant you are starting? Which it should already know from startOpts.RoachprodOpts.VirtualClusterName?

since the above code already feels heavy wrt additional waiting

Yeah I was thinking the same thing, especially since the first SQL check is redundant if WaitForReplicationFactor > 0 since WaitForReplication is also checking if SQL is ready. No arguments from me against doing it in the mixed version framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest/mixedversion: WaitForReplication (temporally) depends on lazily initialized SyntheticPrivilegeCache
3 participants