kv: unexpected preemptive transfer of leases upon adding new nodes #51867

bobvawter · 2020-07-23T21:57:39Z

Impact: causes performance disruption during cluster scale-up

Background

An enterprise customer was running a "hot" cluster and, while adding additional nodes, saw unexpected lease transfers that exacerbated CPU use on heavily-loaded nodes. There was no data loss or downtime, but the concern is that a "galloping stampede" failure is made more likely by this behavior.

This behavior can be reliably recreated by the customer and in my testing as well.

Observation

Adding a node in one locality appears to cause all of the other nodes in that locality to drain away their leases. In this graph, I added a new node to DC1 whose leases-per-store values are seen to be going down. They’re transferring to DC2.

There is a DC3, but the locality and lease preference are configured as follows. Notice that the locality and lease preferences wind up ganging DC1 and DC2 together.

DC1: cloud=gce,continent=na,zone=us-east1-b
DC2: cloud=gce,continent=na,zone=us-west1-b
DC3: cloud=gce,continent=eu,zone=europe-west2-b

ALTER DATABASE tpcc CONFIGURE ZONE USING
constraints = '[]',
lease_preferences = '[[+continent=na]]';

The leases eventually rebalance.

This behavior occurs with no load, light load, and in over-loaded cases.

Tested with 20.1.1.

gz#5876

The text was updated successfully, but these errors were encountered:

lunevalex · 2020-07-27T15:17:12Z

@bobvawter thanks for raising this issue, can you add your reproduction steps, so we could test this exact scenario?

bobvawter · 2020-07-28T19:22:42Z

export CLUSTER=$USER-leases
./roachprod create $CLUSTER -n 10
./roachprod stage $CLUSTER release v20.1.1
./roachprod start $CLUSTER:1-3 --secure -a "--locality=cloud=gce,continent=na,zone=us-east1-b"
./roachprod start $CLUSTER:4-6 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b"
./roachprod start $CLUSTER:7-9 --secure -a "--locality=cloud=gce,continent=eu,zone=europe-west1-b"

./roachprod sql $CLUSTER:1
create user me with password 'HelloWorld!';
grant admin to me;

# Load some data into the cluster
./roachprod ssh $CLUSTER:10
./cockroach workload init tpcc --drop --data-loader import --warehouses 500 'postgres://root@10.142.0.158:26257?sslcert=certs%2Fnode.crt&sslkey=certs%2Fnode.key&sslrootcert=certs%2Fca.crt&sslmode=verify-full'

# Move the leases and wait for DB to settle
ALTER DATABASE tpcc CONFIGURE ZONE USING
constraints = '[]',
lease_preferences = '[[+continent=na]]';


# Add a new node and watch the leases move
./roachprod start $CLUSTER:10 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b"

ajwerner · 2020-09-11T21:58:53Z

I've been digging into this one a bit. My main observation is that we're moving these leases as part of the simulation for rebalancing for the new node. All of the nodes decide they want to remove themselves for various reasons. Some of it seems random and some of it seems like there are these hard to understand calculations around convergence. We end up transferring the leases super fast but we can't move the data nearly that fast.

Even in less dramatic cases than this, in a single region, we see crazy oscillations due to this very code path. My sense is that what would fix it is to make the considerRebalance target respect the usual lease transfer rate limiting in the case where it would choose to move a replica. I'm going to experiment with such a change.

ajwerner · 2020-09-12T05:37:36Z

tl;dr I think we should make this better in the short term with an inbound PR that just slows the rate at which we transfer away leases for the purpose of rebalancing, long term we should centralize the rebalancing planning in a whole new allocator scheme.

Alright, I'm becoming increasingly convinced that the shedding of leases inside the locality of the node being added is going to be quite hard to fix, or at least, subtle to fix. We generally always seem to prioritize rebalancing over lease transfers for lease balancing. In that world, the ranges which will always shed their leases when simulating rebalancing to the new node are the ones in the same locality.

As noted above, I think a more severe problem in some ways is the rate at which these nodes can shed their leases. This is actually worse in cases where there is just one locality. In this case the leaseholder will shed it's lease far faster than the new leaseholder will rebalance the range away. We can see some stores shedding a number of leases a second.

Below is a repro with the transfer rate capped according to our usual lease transfer max rate. In this scenario we still see leases drained away before coming back to the new node's locality but it's less severe.

To me, the more compelling part of this prototype is that it remedies the severe fluctuations in leases in the single locality case. Customers with active workloads have observed much more dramatic oscillations.

This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on cockroachdb#51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node.

54322: kvserver: observe lease transfer interval in considerRebalance r=ajwerner a=ajwerner This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on #51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on cockroachdb#51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node.

ajwerner · 2021-03-16T18:38:39Z

#61038 does not completely solve #51867 but it does dramatically reduce the impact. It was sort of a two-level problem. Ideally, when adding a new node, leases and data would flow to that node and at no point would the leases or data on other nodes diminish below the steady state where the cluster will eventually converge. My change doesn't make that the case.
The root of the bug is that when we decide we want to move data to the new node and we decide that we want to replace the current leaseholder then we have to transfer the lease away. The problem is that by the time we transfer the lease away, the new leaseholder, may, for various reasons (usually snapshot concurrency/bandwidth), not want to send the snapshot immediately.

However, the bigger problem here, on some level, was that a very big storm of lease transfers is quite bad for latency and availability. The above dynamics where the current leaseholder would transfer leases away because of a placement decision was not being governed by our lease transfer rate limiting. As such, they were shedding leases far faster than the lease rebalancing logic was doing anything about it. At that point the leases would get way out of whack and the lease rebalancing logic would get aggressive. When the lease transfers are slower than the information propagation, stuff remains pretty smooth and latency remains reasonably undeterred.

All that being said, if you're moving a bunch of data around reasonably slowly, then the lease are going to become unbalanced for a time.

This is one of those deeper problems with a decentralized, heuristic based placement driver.

nvanbenschoten · 2021-09-10T05:12:21Z

Merging this issue into #67740, as both issues have identified the same behavior.

bobvawter added C-investigation Further steps needed to qualify. C-label will change. A-kv-distribution Relating to rebalancing and leasing. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) labels Jul 23, 2020

bobvawter assigned johnrk-zz and lunevalex Jul 23, 2020

ajwerner mentioned this issue Sep 13, 2020

kvserver: observe lease transfer interval in considerRebalance #54322

Merged

ajwerner mentioned this issue Feb 24, 2021

kv: investigate brown-out while scaling up under load #61005

Closed

ajwerner mentioned this issue Feb 24, 2021

release-20.2: kvserver: observe lease transfer interval in considerRebalance #61038

Merged

piyush-singh assigned mwang1026 and unassigned johnrk-zz Mar 30, 2021

mwang1026 removed their assignment Mar 30, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

lunevalex mentioned this issue Jul 19, 2021

kvserver: lease counts diverge when a new node is added to a cluster #67740

Closed

nvanbenschoten closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

bobvawter commented Jul 23, 2020 •

edited by RoachietheSupportRoach

Loading

lunevalex commented Jul 27, 2020

bobvawter commented Jul 28, 2020

ajwerner commented Sep 11, 2020

ajwerner commented Sep 12, 2020 •

edited

Loading

ajwerner commented Mar 16, 2021

nvanbenschoten commented Sep 10, 2021

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

Comments

bobvawter commented Jul 23, 2020 • edited by RoachietheSupportRoach Loading

Background

Observation

lunevalex commented Jul 27, 2020

bobvawter commented Jul 28, 2020

ajwerner commented Sep 11, 2020

ajwerner commented Sep 12, 2020 • edited Loading

ajwerner commented Mar 16, 2021

nvanbenschoten commented Sep 10, 2021

bobvawter commented Jul 23, 2020 •

edited by RoachietheSupportRoach

Loading

ajwerner commented Sep 12, 2020 •

edited

Loading