-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: ensure secondary tenants route follower reads to the closest replica #85853
kv: ensure secondary tenants route follower reads to the closest replica #85853
Conversation
@nvanbenschoten, I'm planning on tacking on a third commit that removes the use of |
ac27425
to
a1fca9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 13 of 13 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @smg260)
-- commits
line 36 at r2:
This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.
pkg/kv/kvserver/replica_follower_read.go
line 105 at r2 (raw file):
// TODO(tschottdorf): once a read for a timestamp T has been served, the replica may // serve reads for that and smaller timestamps forever. log.Eventf(ctx, "%s; query timestamp below closed timestamp by %s", safedetails.Safe(kvbase.FollowerReadServingMsg), -tsDiff)
I think you want redact.Safe
. This function is deprecated and not used elsewhere.
pkg/server/node.go
line 155 at r1 (raw file):
settings.NonNegativeDurationWithMaximum(maxGraphiteInterval), ).WithPublic() redactServerTracesForSecondaryTenants = settings.RegisterBoolSetting(
It's probably worthwhile to have @knz take a look at this commit as well.
pkg/server/node.go
line 1168 at r1 (raw file):
ctx context.Context, tenID roachpb.TenantID, ba *roachpb.BatchRequest, ) (context.Context, spanForRequest) { return setupSpanForIncomingRPC(ctx, tenID, ba, n.storeCfg.AmbientCtx.Tracer, n.settings)
Do we need to add the new field to Node
? Isn't this the same as n.storeCfg.Settings
?
pkg/server/server.go
line 358 at r2 (raw file):
} retryOpts.Closer = stopper.ShouldQuiesce() distSenderCfg := kvcoord.DistSenderConfig{
nit here and in tenant.go
: keep the fields in the same order as in the struct definition.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 178 at r2 (raw file):
return roachpb.ReplicaDescriptor{}, err } replicas.OptimizeReplicaOrder(o.nodeDesc.NodeID, o.latencyFunc, o.nodeDesc.Locality)
How does this work in a multi-tenant cluster? Is o.nodeDesc
empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the same NodeDescriptor
-> (NodeID, Locality)
switch in DistSQLPlanner.SetSQLInstanceInfo
and physicalplan.NewSpanResolver
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)
pkg/kv/kvserver/replica_follower_read.go
line 105 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think you want
redact.Safe
. This function is deprecated and not used elsewhere.
Faster than CI haha
pkg/server/node.go
line 155 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's probably worthwhile to have @knz take a look at this commit as well.
I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!
4845806
to
47f2d6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick look, addressed your comments, should be good for another pass.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This also shouldn't be a concern in practice because tenant SQL pods don't run in process with KV nodes.
Good point, amended the commit message.
pkg/server/node.go
line 1168 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need to add the new field to
Node
? Isn't this the same asn.storeCfg.Settings
?
Done.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 178 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How does this work in a multi-tenant cluster? Is
o.nodeDesc
empty? Do we need to plumb in the locality here as well so that DistSQL planning is handled properly as well? I think that might mean doing the sameNodeDescriptor
->(NodeID, Locality)
switch inDistSQLPlanner.SetSQLInstanceInfo
andphysicalplan.NewSpanResolver
.
You're right, made the change in the last commit. It might be worth syncing with someone from queries/@ajstorm about the effects of this change and if we can unblock some tests for secondary tenants given this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r3, 16 of 16 files at r4, 2 of 2 files at r5, 13 of 13 files at r6, 2 of 2 files at r7, 8 of 8 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, @knz, and @smg260)
pkg/server/server.go
line 1395 at r8 (raw file):
func(desc roachpb.NodeDescriptor) { s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(desc.NodeID)) s.sqlServer.execCfg.DistSQLPlanner.MaybeConstructAndSetSpanResolver(desc.NodeID, desc.Locality)
Should we use s.cfg.Locality
so we maintain consistency with the locality information used by DistSender
, instead of passing through the NodeDescriptor
?
EDIT: actually, this nodeDescriptorCallback
handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in to s.node.start
.
47f2d6e
to
6de875f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also modified one of the oracle tests to actually make use of the locality information.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @nvanbenschoten, and @smg260)
pkg/server/server.go
line 1395 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we use
s.cfg.Locality
so we maintain consistency with the locality information used byDistSender
, instead of passing through theNodeDescriptor
?EDIT: actually, this
nodeDescriptorCallback
handling seems awkward and unnecessary. I think we can get rid of it now that we're operating on NodeID and Locality, both of which are passed in tos.node.start
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @knz, and @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r3, 3 of 16 files at r4, 1 of 2 files at r5, 7 of 13 files at r6, 1 of 11 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, and @smg260)
pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go
line 307 at r11 (raw file):
// down in kvserver we disable their redaction. systemTenantSQL := sqlutils.MakeSQLRunner(tc.StorageClusterConn()) systemTenantSQL.Exec(t, `SET CLUSTER SETTING server.secondary_tenants.redact_trace.enabled = 'false'`)
This might cause the test to flake. Have you thought about a setting override instead?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go
line 792 at r11 (raw file):
// use follower_read_timestamp(). Note that we need to override the setting // for the tenant as well, because the builtin is run in the tenant's sql pod. systemSQL := sqlutils.MakeSQLRunner(tc.Conns[0])
Have you thought about setting overrides, instead of this and the wait loop below?
pkg/kv/kvclient/kvcoord/dist_sender.go
line 562 at r11 (raw file):
// replica (if one exists). Not knowing the node ID, and thus not being able // to take advantage of this optimization is okay, given secondary tenants // run in separate processes to KV nodes (so there's no optimization to take
See my other comments: secondary tenants can definitely run in-process with KV instances.
The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.
The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."
pkg/server/node.go
line 155 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I was initially going with a testing knob, but @ajstorm mentioned on Slack that we might want a cluster setting here instead. Rafa, lemme know what you think as well!
A setting looks fine.
Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)
pkg/server/node.go
line 1123 at r11 (raw file):
// (and tenID might be set), but it will go to the tenant only indirectly // through the response of a parent RPC. In that case, the parent RPC is // responsible for the redaction.
See my other comments about same-process vs different-process: secondary tenants may run inside the same process, and may use the local connector. I'm not sure this paragraph is correct.
However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?
pkg/server/server.go
line 1304 at r11 (raw file):
// Ensure components in the DistSQLPlanner that rely on the node ID are // initialized before store startup continues. s.sqlServer.execCfg.DistSQLPlanner.SetGatewaySQLInstanceID(base.SQLInstanceID(state.nodeID))
Why is this done here and not in server_sql.go
, newSQLServer
?
pkg/sql/distsql_physical_planner.go
line 228 at r11 (raw file):
// MaybeConstructAndSetSpanResolver constructs and sets the planner's // SpanResolver if it is unset. It's a no-op otherwise. // TODO(arul): Should we get rid of this maybe and just fatal if it's set
Yes that sounds right. Or panic/return an assertion error.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 156 at r11 (raw file):
// NodeID may be 0 (and is expected to be in case of secondary tenants), in // which case the current node will not be given any preference. This is sane, // given secondary tenant SQL pods are never run in process with KV nodes.
This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.
I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):
// NodeID may be zero, to indicate there is no KV instance available inside the same process.
// If there is a same-process KV instance, we prefer that if it contains a replica.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 210 at r11 (raw file):
// current node and others "close" to it. // // NodeID may be 0 (and is expected to be in case of secondary tenants), in
ditto
6de875f
to
f710993
Compare
Previously, all operations run on the server on behalf of a secondary tenant would have their logs redacted. This patch introduces a new cluster setting that controls whether logs should be redacted or not. This is only settable by the system tenant and maintains the previous behaviour by default. Release note (ops change): introduce a new cluster setting (`server.secondary_tenants.redact_trace`) which controls if traces should be redacted for ops run on behalf of secondary tenants.
We probabilistically run SQL using secondary tenant connections in tests. This breaks TestBoundedStalenessDataDriven and TestFollowerReadsWithStaleDescriptor. Both these tests configure SQL testing knobs to trace statements and then make assertions on the collected traces. As these testing knobs aren't shared by the secondary tenant `TestServer` starts the test simply doesn't work. For now, we simply eschew running these tests with secondary tenants. Release note: None
The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. It also shouldn't be a concern given tenant SQL pods don't run in process with KV nodes today. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.
This function is no longer used by the DistSender to optimize replica order. Release note: None
Previously, we used a node descriptor for locality information about the current node when doing dist SQL planning. This only worked for the system tenant as secondary tenant's used a fake node descriptor. Given the changes to `OptimizeReplicaOrder` in the prior commit, this patch switches to threading in the tenant pod's locality to allow secondary tenant's to make use of this information in dist sql planning. Release note: None
f710993
to
0e08303
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @knz from 5 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajstorm, @arulajmani, @knz, @nvanbenschoten, and @smg260)
pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go
line 307 at r11 (raw file):
Previously, knz (kena) wrote…
This might cause the test to flake. Have you thought about a setting override instead?
I realized this getting this test to work with secondary tenants is a bit harder than what I'd initially thought, given it makes use of SQL testing knobs that aren't passed into the TestTenant
. Ended up reverting this.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 562 at r11 (raw file):
Previously, knz (kena) wrote…
See my other comments: secondary tenants can definitely run in-process with KV instances.
The comment is incorrect, even though the logic is fine. What matters is not secondary vs system tenant; is in-process vs separate-process.
The comment should explain "the node ID optimization is only available to tenants running in-process; if they are not, it doesn't matter anyway because the connection will be networked in the end."
Done.
pkg/server/node.go
line 155 at r1 (raw file):
Previously, knz (kena) wrote…
A setting looks fine.
Why TenantReadOnly though? Is this not SystemOnly? (the redaction occurs on the kv side, right?)
D'op, switched to SystemOnly.
pkg/server/node.go
line 1123 at r11 (raw file):
However, does it matter though? It doesn't look like the condition you've implemented is sensitive to the deployment style?
Yeah, I'm not quite sure what this paragraph was trying to say and much of this was here before I got here. Removed the stuff around deployment styles.
pkg/server/server.go
line 1304 at r11 (raw file):
Previously, knz (kena) wrote…
Why is this done here and not in
server_sql.go
,newSQLServer
?
I'm definitely not intimate with this code, but from my reading we only get to know our NodeID
in when trying to Start the server (in PreStart
) so this is where we bind it.
pkg/sql/distsql_physical_planner.go
line 228 at r11 (raw file):
Previously, knz (kena) wrote…
Yes that sounds right. Or panic/return an assertion error.
Done.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 156 at r11 (raw file):
Previously, knz (kena) wrote…
This comment is incorrect. tenant SQL pods will run in-process with KV nodes, sooner than later.
I think the semantics are OK, but you need to explain them differently here (and possibly in the commit message):
// NodeID may be zero, to indicate there is no KV instance available inside the same process. // If there is a same-process KV instance, we prefer that if it contains a replica.
Done.
pkg/sql/physicalplan/replicaoracle/oracle.go
line 210 at r11 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
TFTRs! bors r+ |
Build succeeded: |
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.
We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.
However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.
Resolves #81000
Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.