kv: ensure secondary tenants route follower reads to the closest replica #85853

arulajmani · 2022-08-09T21:48:43Z

The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.

Resolves #81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.

cockroach-teamcity · 2022-08-09T21:48:53Z

This change is

arulajmani · 2022-08-09T21:50:24Z

@nvanbenschoten, I'm planning on tacking on a third commit that removes the use of getNodeDescriptor on here, but other than that this should be good for a look.

nvanbenschoten

Reviewed 1 of 1 files at r1, 13 of 13 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @smg260)

-- commits line 36 at r2:
This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.

pkg/kv/kvserver/replica_follower_read.go line 105 at r2 (raw file):

	// TODO(tschottdorf): once a read for a timestamp T has been served, the replica may
	// serve reads for that and smaller timestamps forever.
	log.Eventf(ctx, "%s; query timestamp below closed timestamp by %s", safedetails.Safe(kvbase.FollowerReadServingMsg), -tsDiff)

I think you want redact.Safe. This function is deprecated and not used elsewhere.

pkg/server/node.go line 155 at r1 (raw file):

		settings.NonNegativeDurationWithMaximum(maxGraphiteInterval),
	).WithPublic()
	redactServerTracesForSecondaryTenants = settings.RegisterBoolSetting(

It's probably worthwhile to have @knz take a look at this commit as well.

pkg/server/node.go line 1168 at r1 (raw file):

	ctx context.Context, tenID roachpb.TenantID, ba *roachpb.BatchRequest,
) (context.Context, spanForRequest) {
	return setupSpanForIncomingRPC(ctx, tenID, ba, n.storeCfg.AmbientCtx.Tracer, n.settings)

Do we need to add the new field to Node? Isn't this the same as n.storeCfg.Settings?

pkg/server/server.go line 358 at r2 (raw file):

	}
	retryOpts.Closer = stopper.ShouldQuiesce()
	distSenderCfg := kvcoord.DistSenderConfig{

nit here and in tenant.go: keep the fields in the same order as in the struct definition.

pkg/sql/physicalplan/replicaoracle/oracle.go line 178 at r2 (raw file):

		return roachpb.ReplicaDescriptor{}, err
	}
	replicas.OptimizeReplicaOrder(o.nodeDesc.NodeID, o.latencyFunc, o.nodeDesc.Locality)

How does this work in a multi-tenant cluster? Is o.nodeDesc empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the same NodeDescriptor -> (NodeID, Locality) switch in DistSQLPlanner.SetSQLInstanceInfo and physicalplan.NewSpanResolver.

arulajmani

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)

pkg/kv/kvserver/replica_follower_read.go line 105 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you want redact.Safe. This function is deprecated and not used elsewhere.

Faster than CI haha

pkg/server/node.go line 155 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's probably worthwhile to have @knz take a look at this commit as well.

I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!

arulajmani

Thanks for the quick look, addressed your comments, should be good for another pass.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)

-- commits line 36 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.

Good point, amended the commit message.

pkg/server/node.go line 1168 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to add the new field to Node? Isn't this the same as n.storeCfg.Settings?

Done.

pkg/sql/physicalplan/replicaoracle/oracle.go line 178 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this work in a multi-tenant cluster? Is o.nodeDesc empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the same NodeDescriptor -> (NodeID, Locality) switch in DistSQLPlanner.SetSQLInstanceInfo and physicalplan.NewSpanResolver.

You're right, made the change in the last commit. It might be worth syncing with someone from queries/@ajstorm about the effects of this change and if we can unblock some tests for secondary tenants given this change.

nvanbenschoten

Reviewed 1 of 3 files at r3, 16 of 16 files at r4, 2 of 2 files at r5, 13 of 13 files at r6, 2 of 2 files at r7, 8 of 8 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, and @smg260)

pkg/server/server.go line 1395 at r8 (raw file):

		func(desc roachpb.NodeDescriptor) {
			s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(desc.NodeID))
			s.sqlServer.execCfg.DistSQLPlanner.MaybeConstructAndSetSpanResolver(desc.NodeID, desc.Locality)

Should we use s.cfg.Locality so we maintain consistency with the locality information used by DistSender, instead of passing through the NodeDescriptor?

EDIT: actually, this nodeDescriptorCallback handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in to s.node.start.

arulajmani

I also modified one of the oracle tests to actually make use of the locality information.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)

pkg/server/server.go line 1395 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we use s.cfg.Locality so we maintain consistency with the locality information used by DistSender, instead of passing through the NodeDescriptor?

EDIT: actually, this nodeDescriptorCallback handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in to s.node.start.

Done.

nvanbenschoten

Reviewed 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @knz, and @smg260)

knz

Reviewed 1 of 3 files at r3, 3 of 16 files at r4, 1 of 2 files at r5, 7 of 13 files at r6, 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, and @smg260)

pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go line 307 at r11 (raw file):

		// down in kvserver we disable their redaction.
		systemTenantSQL := sqlutils.MakeSQLRunner(tc.StorageClusterConn())
		systemTenantSQL.Exec(t, `SET CLUSTER SETTING server.secondary_tenants.redact_trace.enabled = 'false'`)

This might cause the test to flake. Have you thought about a setting override instead?

pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go line 792 at r11 (raw file):

		// use follower_read_timestamp(). Note that we need to override the setting
		// for the tenant as well, because the builtin is run in the tenant's sql pod.
		systemSQL := sqlutils.MakeSQLRunner(tc.Conns[0])

Have you thought about setting overrides, instead of this and the wait loop below?

pkg/kv/kvclient/kvcoord/dist_sender.go line 562 at r11 (raw file):

	// replica (if one exists). Not knowing the node ID, and thus not being able
	// to take advantage of this optimization is okay, given secondary tenants
	// run in separate processes to KV nodes (so there's no optimization to take

See my other comments: secondary tenants can definitely run in-process with KV instances.

The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.

The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."

pkg/server/node.go line 155 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!

A setting looks fine.

Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)

pkg/server/node.go line 1123 at r11 (raw file):

		// (and tenID might be set), but it will go to the tenant only indirectly
		// through the response of a parent RPC. In that case, the parent RPC is
		// responsible for the redaction.

See my other comments about same-process vs different-process: secondary tenants may run inside the same process, and may use the local connector. I'm not sure this paragraph is correct.

However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?

pkg/server/server.go line 1304 at r11 (raw file):

	// Ensure components in the DistSQLPlanner that rely on the node ID are
	// initialized before store startup continues.
	s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(state.nodeID))

Why is this done here and not in server_sql.go, newSQLServer?

pkg/sql/distsql_physical_planner.go line 228 at r11 (raw file):

// MaybeConstructAndSetSpanResolver constructs and sets the planner's
// SpanResolver if it is unset. It's a no-op otherwise.
// TODO(arul): Should we get rid of this maybe and just fatal if it's set

Yes that sounds right. Or panic/return an assertion error.

pkg/sql/physicalplan/replicaoracle/oracle.go line 156 at r11 (raw file):

	// NodeID may be 0 (and is expected to be in case of secondary tenants), in
	// which case the current node will not be given any preference. This is sane,
	// given secondary tenant SQL pods are never run in process with KV nodes.

This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.

I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):

// NodeID may be zero, to indicate there is no KV instance available inside the same process.
// If there is a same-process KV instance, we prefer that if it contains a replica.

pkg/sql/physicalplan/replicaoracle/oracle.go line 210 at r11 (raw file):

	// current node and others "close" to it.
	//
	// NodeID may be 0 (and is expected to be in case of secondary tenants), in

ditto

Previously, all operations run on the server on behalf of a secondary tenant would have their logs redacted. This patch introduces a new cluster setting that controls whether logs should be redacted or not. This is only settable by the system tenant and maintains the previous behaviour by default. Release note (ops change): introduce a new cluster setting (`server.secondary_tenants.redact_trace`) which controls if traces should be redacted for ops run on behalf of secondary tenants.

We probabilistically run SQL using secondary tenant connections in tests. This breaks TestBoundedStalenessDataDriven and TestFollowerReadsWithStaleDescriptor. Both these tests configure SQL testing knobs to trace statements and then make assertions on the collected traces. As these testing knobs aren't shared by the secondary tenant `TestServer` starts the test simply doesn't work. For now, we simply eschew running these tests with secondary tenants. Release note: None

The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. It also shouldn't be a concern given tenant SQL pods don't run in process with KV nodes today. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.

This function is no longer used by the DistSender to optimize replica order. Release note: None

Previously, we used a node descriptor for locality information about the current node when doing dist SQL planning. This only worked for the system tenant as secondary tenant's used a fake node descriptor. Given the changes to `OptimizeReplicaOrder` in the prior commit, this patch switches to threading in the tenant pod's locality to allow secondary tenant's to make use of this information in dist sql planning. Release note: None

arulajmani

Dismissed @knz from 5 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)

pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go line 307 at r11 (raw file):

Previously, knz (kena) wrote…

This might cause the test to flake. Have you thought about a setting override instead?

I realized this getting this test to work with secondary tenants is a bit harder than what I'd initially thought, given it makes use of SQL testing knobs that aren't passed into the TestTenant. Ended up reverting this.

pkg/kv/kvclient/kvcoord/dist_sender.go line 562 at r11 (raw file):

Previously, knz (kena) wrote…

See my other comments: secondary tenants can definitely run in-process with KV instances.

The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.

The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."

Done.

pkg/server/node.go line 155 at r1 (raw file):

Previously, knz (kena) wrote…

A setting looks fine.

Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)

D'op, switched to SystemOnly.

pkg/server/node.go line 1123 at r11 (raw file):

However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?

Yeah, I'm not quite sure what this paragraph was trying to say and much of this was here before I got here. Removed the stuff around deployment styles.

pkg/server/server.go line 1304 at r11 (raw file):

Previously, knz (kena) wrote…

Why is this done here and not in server_sql.go, newSQLServer?

I'm definitely not intimate with this code, but from my reading we only get to know our NodeID in when trying to Start the server (in PreStart) so this is where we bind it.

pkg/sql/distsql_physical_planner.go line 228 at r11 (raw file):

Previously, knz (kena) wrote…

Yes that sounds right. Or panic/return an assertion error.

Done.

pkg/sql/physicalplan/replicaoracle/oracle.go line 156 at r11 (raw file):

Previously, knz (kena) wrote…

This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.

I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):
// NodeID may be zero, to indicate there is no KV instance available inside the same process.
// If there is a same-process KV instance, we prefer that if it contains a replica.

Done.

pkg/sql/physicalplan/replicaoracle/oracle.go line 210 at r11 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

arulajmani · 2022-08-14T22:01:02Z

TFTRs!

bors r+

craig · 2022-08-14T23:06:28Z

Build succeeded:

Bazel Essential CI (Cockroach)

arulajmani requested a review from nvanbenschoten August 9, 2022 21:48

arulajmani requested review from a team as code owners August 9, 2022 21:48

arulajmani requested a review from a team August 9, 2022 21:48

arulajmani requested review from a team as code owners August 9, 2022 21:48

arulajmani requested review from smg260 and removed request for a team August 9, 2022 21:48

arulajmani force-pushed the multi-tenant-follower-reads branch from ac27425 to a1fca9e Compare August 9, 2022 22:01

nvanbenschoten reviewed Aug 9, 2022

View reviewed changes

arulajmani commented Aug 9, 2022

View reviewed changes

arulajmani force-pushed the multi-tenant-follower-reads branch 2 times, most recently from 4845806 to 47f2d6e Compare August 10, 2022 17:51

arulajmani commented Aug 10, 2022

View reviewed changes

arulajmani requested a review from nvanbenschoten August 10, 2022 17:55

nvanbenschoten reviewed Aug 10, 2022

View reviewed changes

arulajmani force-pushed the multi-tenant-follower-reads branch from 47f2d6e to 6de875f Compare August 10, 2022 21:32

arulajmani commented Aug 10, 2022

View reviewed changes

arulajmani requested a review from nvanbenschoten August 10, 2022 21:35

nvanbenschoten approved these changes Aug 10, 2022

View reviewed changes

knz reviewed Aug 11, 2022

View reviewed changes

arulajmani force-pushed the multi-tenant-follower-reads branch from 6de875f to f710993 Compare August 14, 2022 17:16

arulajmani added 4 commits August 14, 2022 13:42

kvcoord: remove ds.getNodeDescriptor

befd340

This function is no longer used by the DistSender to optimize replica order. Release note: None

arulajmani force-pushed the multi-tenant-follower-reads branch from f710993 to 0e08303 Compare August 14, 2022 17:42

arulajmani commented Aug 14, 2022

View reviewed changes

craig bot merged commit b6d1689 into cockroachdb:master Aug 14, 2022

arulajmani deleted the multi-tenant-follower-reads branch August 14, 2022 23:10

cockroach-teamcity mentioned this pull request Aug 14, 2022

server: add cluster setting to control tenant trace redaction cockroachdb/docs#14823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: ensure secondary tenants route follower reads to the closest replica #85853

kv: ensure secondary tenants route follower reads to the closest replica #85853

arulajmani commented Aug 9, 2022

cockroach-teamcity commented Aug 9, 2022

arulajmani commented Aug 9, 2022

nvanbenschoten left a comment

arulajmani left a comment

arulajmani left a comment

nvanbenschoten left a comment

arulajmani left a comment

nvanbenschoten left a comment

knz left a comment

arulajmani left a comment

arulajmani commented Aug 14, 2022

craig bot commented Aug 14, 2022

kv: ensure secondary tenants route follower reads to the closest replica #85853

kv: ensure secondary tenants route follower reads to the closest replica #85853

Conversation

arulajmani commented Aug 9, 2022

cockroach-teamcity commented Aug 9, 2022

arulajmani commented Aug 9, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

arulajmani commented Aug 14, 2022

craig bot commented Aug 14, 2022