Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: ensure secondary tenants route follower reads to the closest replica #85853

Merged
merged 5 commits into from
Aug 14, 2022

Conversation

arulajmani
Copy link
Collaborator

The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.

Resolves #81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.

@arulajmani arulajmani requested review from a team as code owners August 9, 2022 21:48
@arulajmani arulajmani requested a review from a team August 9, 2022 21:48
@arulajmani arulajmani requested review from a team as code owners August 9, 2022 21:48
@arulajmani arulajmani requested review from smg260 and removed request for a team August 9, 2022 21:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@arulajmani
Copy link
Collaborator Author

@nvanbenschoten, I'm planning on tacking on a third commit that removes the use of getNodeDescriptor on here, but other than that this should be good for a look.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 13 of 13 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @smg260)


-- commits line 36 at r2:
This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.


pkg/kv/kvserver/replica_follower_read.go line 105 at r2 (raw file):

	// TODO(tschottdorf): once a read for a timestamp T has been served, the replica may
	// serve reads for that and smaller timestamps forever.
	log.Eventf(ctx, "%s; query timestamp below closed timestamp by %s", safedetails.Safe(kvbase.FollowerReadServingMsg), -tsDiff)

I think you want redact.Safe. This function is deprecated and not used elsewhere.


pkg/server/node.go line 155 at r1 (raw file):

		settings.NonNegativeDurationWithMaximum(maxGraphiteInterval),
	).WithPublic()
	redactServerTracesForSecondaryTenants = settings.RegisterBoolSetting(

It's probably worthwhile to have @knz take a look at this commit as well.


pkg/server/node.go line 1168 at r1 (raw file):

	ctx context.Context, tenID roachpb.TenantID, ba *roachpb.BatchRequest,
) (context.Context, spanForRequest) {
	return setupSpanForIncomingRPC(ctx, tenID, ba, n.storeCfg.AmbientCtx.Tracer, n.settings)

Do we need to add the new field to Node? Isn't this the same as n.storeCfg.Settings?


pkg/server/server.go line 358 at r2 (raw file):

	}
	retryOpts.Closer = stopper.ShouldQuiesce()
	distSenderCfg := kvcoord.DistSenderConfig{

nit here and in tenant.go: keep the fields in the same order as in the struct definition.


pkg/sql/physicalplan/replicaoracle/oracle.go line 178 at r2 (raw file):

		return roachpb.ReplicaDescriptor{}, err
	}
	replicas.OptimizeReplicaOrder(o.nodeDesc.NodeID, o.latencyFunc, o.nodeDesc.Locality)

How does this work in a multi-tenant cluster? Is o.nodeDesc empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the same NodeDescriptor -> (NodeID, Locality) switch in DistSQLPlanner.SetSQLInstanceInfo and physicalplan.NewSpanResolver.

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)


pkg/kv/kvserver/replica_follower_read.go line 105 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you want redact.Safe. This function is deprecated and not used elsewhere.

Faster than CI haha


pkg/server/node.go line 155 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's probably worthwhile to have @knz take a look at this commit as well.

I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!

@arulajmani arulajmani force-pushed the multi-tenant-follower-reads branch 2 times, most recently from 4845806 to 47f2d6e Compare August 10, 2022 17:51
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick look, addressed your comments, should be good for another pass.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)


-- commits line 36 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.

Good point, amended the commit message.


pkg/server/node.go line 1168 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to add the new field to Node? Isn't this the same as n.storeCfg.Settings?

Done.


pkg/sql/physicalplan/replicaoracle/oracle.go line 178 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this work in a multi-tenant cluster? Is o.nodeDesc empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the same NodeDescriptor -> (NodeID, Locality) switch in DistSQLPlanner.SetSQLInstanceInfo and physicalplan.NewSpanResolver.

You're right, made the change in the last commit. It might be worth syncing with someone from queries/@ajstorm about the effects of this change and if we can unblock some tests for secondary tenants given this change.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r3, 16 of 16 files at r4, 2 of 2 files at r5, 13 of 13 files at r6, 2 of 2 files at r7, 8 of 8 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, and @smg260)


pkg/server/server.go line 1395 at r8 (raw file):

		func(desc roachpb.NodeDescriptor) {
			s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(desc.NodeID))
			s.sqlServer.execCfg.DistSQLPlanner.MaybeConstructAndSetSpanResolver(desc.NodeID, desc.Locality)

Should we use s.cfg.Locality so we maintain consistency with the locality information used by DistSender, instead of passing through the NodeDescriptor?

EDIT: actually, this nodeDescriptorCallback handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in to s.node.start.

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also modified one of the oracle tests to actually make use of the locality information.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)


pkg/server/server.go line 1395 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we use s.cfg.Locality so we maintain consistency with the locality information used by DistSender, instead of passing through the NodeDescriptor?

EDIT: actually, this nodeDescriptorCallback handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in to s.node.start.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @knz, and @smg260)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r3, 3 of 16 files at r4, 1 of 2 files at r5, 7 of 13 files at r6, 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, and @smg260)


pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go line 307 at r11 (raw file):

		// down in kvserver we disable their redaction.
		systemTenantSQL := sqlutils.MakeSQLRunner(tc.StorageClusterConn())
		systemTenantSQL.Exec(t, `SET CLUSTER SETTING server.secondary_tenants.redact_trace.enabled = 'false'`)

This might cause the test to flake. Have you thought about a setting override instead?


pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go line 792 at r11 (raw file):

		// use follower_read_timestamp(). Note that we need to override the setting
		// for the tenant as well, because the builtin is run in the tenant's sql pod.
		systemSQL := sqlutils.MakeSQLRunner(tc.Conns[0])

Have you thought about setting overrides, instead of this and the wait loop below?


pkg/kv/kvclient/kvcoord/dist_sender.go line 562 at r11 (raw file):

	// replica (if one exists). Not knowing the node ID, and thus not being able
	// to take advantage of this optimization is okay, given secondary tenants
	// run in separate processes to KV nodes (so there's no optimization to take

See my other comments: secondary tenants can definitely run in-process with KV instances.

The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.

The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."


pkg/server/node.go line 155 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!

A setting looks fine.

Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)


pkg/server/node.go line 1123 at r11 (raw file):

		// (and tenID might be set), but it will go to the tenant only indirectly
		// through the response of a parent RPC. In that case, the parent RPC is
		// responsible for the redaction.

See my other comments about same-process vs different-process: secondary tenants may run inside the same process, and may use the local connector. I'm not sure this paragraph is correct.

However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?


pkg/server/server.go line 1304 at r11 (raw file):

	// Ensure components in the DistSQLPlanner that rely on the node ID are
	// initialized before store startup continues.
	s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(state.nodeID))

Why is this done here and not in server_sql.go, newSQLServer?


pkg/sql/distsql_physical_planner.go line 228 at r11 (raw file):

// MaybeConstructAndSetSpanResolver constructs and sets the planner's
// SpanResolver if it is unset. It's a no-op otherwise.
// TODO(arul): Should we get rid of this maybe and just fatal if it's set

Yes that sounds right. Or panic/return an assertion error.


pkg/sql/physicalplan/replicaoracle/oracle.go line 156 at r11 (raw file):

	// NodeID may be 0 (and is expected to be in case of secondary tenants), in
	// which case the current node will not be given any preference. This is sane,
	// given secondary tenant SQL pods are never run in process with KV nodes.

This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.

I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):

// NodeID may be zero, to indicate there is no KV instance available inside the same process.
// If there is a same-process KV instance, we prefer that if it contains a replica.

pkg/sql/physicalplan/replicaoracle/oracle.go line 210 at r11 (raw file):

	// current node and others "close" to it.
	//
	// NodeID may be 0 (and is expected to be in case of secondary tenants), in

ditto

Previously, all operations run on the server on behalf of a secondary
tenant would have their logs redacted. This patch introduces a new
cluster setting that controls whether logs should be redacted or
not. This is only settable by the system tenant and maintains the
previous behaviour by default.

Release note (ops change): introduce a new cluster setting
(`server.secondary_tenants.redact_trace`) which controls if traces
should be redacted for ops run on behalf of secondary tenants.
We probabilistically run SQL using secondary tenant connections
in tests. This breaks TestBoundedStalenessDataDriven and
TestFollowerReadsWithStaleDescriptor.

Both these tests configure SQL testing knobs to trace statements and
then make assertions on the collected traces. As these testing knobs
aren't shared by the secondary tenant `TestServer` starts the test
simply doesn't work. For now, we simply eschew running these tests with
secondary tenants.

Release note: None
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic. It also shouldn't
be a concern given tenant SQL pods don't run in process with KV nodes
today.

Resolves cockroachdb#81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.
This function is no longer used by the DistSender to optimize replica
order.

Release note: None
Previously, we used a node descriptor for locality information about
the current node when doing dist SQL planning. This only worked for
the system tenant as secondary tenant's used a fake node descriptor.
Given the changes to `OptimizeReplicaOrder` in the prior commit, this
patch switches to threading in the tenant pod's locality to allow
secondary tenant's to make use of this information in dist sql
planning.

Release note: None
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @knz from 5 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)


pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go line 307 at r11 (raw file):

Previously, knz (kena) wrote…

This might cause the test to flake. Have you thought about a setting override instead?

I realized this getting this test to work with secondary tenants is a bit harder than what I'd initially thought, given it makes use of SQL testing knobs that aren't passed into the TestTenant. Ended up reverting this.


pkg/kv/kvclient/kvcoord/dist_sender.go line 562 at r11 (raw file):

Previously, knz (kena) wrote…

See my other comments: secondary tenants can definitely run in-process with KV instances.

The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.

The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."

Done.


pkg/server/node.go line 155 at r1 (raw file):

Previously, knz (kena) wrote…

A setting looks fine.

Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)

D'op, switched to SystemOnly.


pkg/server/node.go line 1123 at r11 (raw file):

However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?

Yeah, I'm not quite sure what this paragraph was trying to say and much of this was here before I got here. Removed the stuff around deployment styles.


pkg/server/server.go line 1304 at r11 (raw file):

Previously, knz (kena) wrote…

Why is this done here and not in server_sql.go, newSQLServer?

I'm definitely not intimate with this code, but from my reading we only get to know our NodeID in when trying to Start the server (in PreStart) so this is where we bind it.


pkg/sql/distsql_physical_planner.go line 228 at r11 (raw file):

Previously, knz (kena) wrote…

Yes that sounds right. Or panic/return an assertion error.

Done.


pkg/sql/physicalplan/replicaoracle/oracle.go line 156 at r11 (raw file):

Previously, knz (kena) wrote…

This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.

I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):

// NodeID may be zero, to indicate there is no KV instance available inside the same process.
// If there is a same-process KV instance, we prefer that if it contains a replica.

Done.


pkg/sql/physicalplan/replicaoracle/oracle.go line 210 at r11 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

@arulajmani
Copy link
Collaborator Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 14, 2022

Build succeeded:

@craig craig bot merged commit b6d1689 into cockroachdb:master Aug 14, 2022
@arulajmani arulajmani deleted the multi-tenant-follower-reads branch August 14, 2022 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvcoord: secondary tenants do not take network latency into account when routing batch requests
4 participants