Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

Closed
arulajmani opened this issue May 4, 2022 · 5 comments · Fixed by #85853
Assignees
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@arulajmani
Copy link
Collaborator

arulajmani commented May 4, 2022

Describe the problem

The DistSender sends batch requests that interact with a range to its replicas. In general, it does so by ordering replicas based on the latency between the requesting node and the node on which the replica lives. If the request must be sent to the leaseholder (and the leaseholder is known) the leaseholder is moved to the front of the queue. See:

switch ba.RoutingPolicy {
case roachpb.RoutingPolicy_LEASEHOLDER:
// First order by latency, then move the leaseholder to the front of the
// list, if it is known.
if !ds.dontReorderReplicas {
replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), ds.latencyFunc)
}
idx := -1
if leaseholder != nil {
idx = replicas.Find(leaseholder.ReplicaID)
}
if idx != -1 {
replicas.MoveToFront(idx)
leaseholderFirst = true
} else {
// The leaseholder node's info must have been missing from gossip when we
// created replicas.
log.VEvent(ctx, 2, "routing to nearest replica; leaseholder not known")
}
case roachpb.RoutingPolicy_NEAREST:
// Order by latency.
log.VEvent(ctx, 2, "routing to nearest replica; leaseholder not required")
replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), ds.latencyFunc)

For follower read requests, this has the effect that they are always routed to the nearest replica. Unfortunately, this doesn't work quite as intended for secondary tenants. Instead, because a secondary tenant's DistSender is unaware of which node it is running on, it ends up randomly ordering the replica slice because of:

if nodeDesc == nil {
shuffle.Shuffle(rs)
return
}

To see why a secondary tenant's dist sender is unaware of which node it is running on, it's because this cast fails:

// TODO(nvanbenschoten): open an issue about the effect of this.
g, ok := ds.nodeDescs.(*gossip.Gossip)
if !ok {
return nil
}

@nvanbenschoten was this the hazard you had in mind in the TODO above, or is there more here?

To Reproduce

I discovered this when trying to convert our follower_reads roachtest to run as secondary tenants. TODO(arul): link a WIP PR.

Expected behavior

Secondary tenants should take network latency into account when routing requests. Consequently, follower reads issued by secondary tenants should be served from the nearest replica (instead of a random one).

Additional context

We'd want to solve this to harness many of the benefits listed in #72593 for secondary tenants.

cc @cockroachdb/kv

Jira issue: CRDB-15402

Epic: CRDB-14202

@arulajmani arulajmani added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 4, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 4, 2022
@knz
Copy link
Contributor

knz commented May 5, 2022

A request when investigating this: be careful to highlight the problem exists with "secondary tenant running as separate SQL-only servers".

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

@knz knz added the A-multitenancy Related to multi-tenancy label May 5, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@nvanbenschoten
Copy link
Member

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

@knz
Copy link
Contributor

knz commented Jun 22, 2022

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

We don't need the entire node descriptor though? just the ID and locality attributes? We're already planning to include those in the sql_livness table.

@andy-kimball
Copy link
Contributor

We'll definitely need this fixed before we can make MR Serverless a reality. It's a "ship-stopper" issue.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jul 27, 2022

This looks straightforward. The fix will involve a bit of plumbing and a generalization of ReplicaSlice.OptimizeReplicaOrder. My suggestion is to change the signature of that method from:

func (rs ReplicaSlice) OptimizeReplicaOrder(nodeDesc *roachpb.NodeDescriptor, latencyFn LatencyFunc)

to

// nodeID can be 0, in which case it is ignored
// latencyFn can be nil, in which case it will not be used
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeID roachpb.NodeID, latencyFn LatencyFunc, locality roachpb.Locality)

Once we have that,

  1. use Gossip if available to grab the client's NodeID (DistSender.getNodeID)
  2. get rid of DistSender.getNodeDescriptor
  3. finally, plumb in the local process's Locality into the DistSender (cfg.Locality in server.NewServer and server.makeTenantSQLServerArgs)

@arulajmani arulajmani self-assigned this Jul 27, 2022
arulajmani added a commit to arulajmani/cockroach that referenced this issue Aug 9, 2022
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.

Resolves cockroachdb#81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Aug 9, 2022
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.

Resolves cockroachdb#81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Aug 10, 2022
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic. It also shouldn't
be a concern given tenant SQL pods don't run in process with KV nodes.

Resolves cockroachdb#81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Aug 10, 2022
The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic. It also shouldn't
be a concern given tenant SQL pods don't run in process with KV nodes.

Resolves cockroachdb#81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.
craig bot pushed a commit that referenced this issue Aug 14, 2022
85853: kv: ensure secondary tenants route follower reads to the closest replica r=arulajmani a=arulajmani

The dist sender uses node locality information to rank replicas of a
range by latency. Previously, this node locality information was read
off a node descriptor available in Gossip. Unfortunately, secondary
tenants do not have access to Gossip, and as such, would end up
randomizing this list of replicas. This manifested itself through
unpredictable latencies when running follower reads.

We're no longer susceptible to this hazard with this patch. This is done
by eschewing the need of a node descriptor from gossip in the
DistSender; instead, we now instantiate the DistSender with locality
information.

However, we do still use Gossip to get the current node's
ID when ranking replicas. This is done to ascertain if there is a local
replica, and if there is, to always route to it. Unfortunately, because
secondary tenants don't have access to Gossip, they can't conform to
these semantics. They're susceptible to a hazard where a request may
be routed to another replica in the same locality tier as the client
even though the client has a local replica as well. This shouldn't be
a concern in practice given the diversity heuristic.

Resolves #81000

Release note (bug fix): fix an issue where secondary tenants could
route follower reads to a random, far away replica instead of one
closer.

85878: gcjob: issue DeleteRange tombstones and then wait for GC r=ajwerner a=ajwerner

Note that this does not change anything about tenant GC.

Fixes #70427

Release note (sql change): The asynchronous garbage collection process has
been changed such that very soon after dropping a table, index, or database, or
after refreshing a materialized view, the system will issue range deletion
tombstones over the dropped data. These tombstones will result in the KV
statistics properly counting these bytes as garbage. Before this change, the
asynchronous "gc job" would wait out the TTL and then issue a lower-level
operation to clear out the data. That meant that while the job was waiting
out the TTL, the data would appear in the statistics to still be live. This
was confusing.

Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
@craig craig bot closed this as completed in 3d87dde Aug 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants