-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: unexpected preemptive transfer of leases upon adding new nodes #51867
Comments
@bobvawter thanks for raising this issue, can you add your reproduction steps, so we could test this exact scenario? |
export CLUSTER=$USER-leases
./roachprod create $CLUSTER -n 10
./roachprod stage $CLUSTER release v20.1.1
./roachprod start $CLUSTER:1-3 --secure -a "--locality=cloud=gce,continent=na,zone=us-east1-b"
./roachprod start $CLUSTER:4-6 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b"
./roachprod start $CLUSTER:7-9 --secure -a "--locality=cloud=gce,continent=eu,zone=europe-west1-b"
./roachprod sql $CLUSTER:1
create user me with password 'HelloWorld!';
grant admin to me;
# Load some data into the cluster
./roachprod ssh $CLUSTER:10
./cockroach workload init tpcc --drop --data-loader import --warehouses 500 'postgres://root@10.142.0.158:26257?sslcert=certs%2Fnode.crt&sslkey=certs%2Fnode.key&sslrootcert=certs%2Fca.crt&sslmode=verify-full'
# Move the leases and wait for DB to settle
ALTER DATABASE tpcc CONFIGURE ZONE USING
constraints = '[]',
lease_preferences = '[[+continent=na]]';
# Add a new node and watch the leases move
./roachprod start $CLUSTER:10 --secure -a "--locality=cloud=gce,continent=na,zone=us-west1-b" |
I've been digging into this one a bit. My main observation is that we're moving these leases as part of the simulation for rebalancing for the new node. All of the nodes decide they want to remove themselves for various reasons. Some of it seems random and some of it seems like there are these hard to understand calculations around convergence. We end up transferring the leases super fast but we can't move the data nearly that fast. Even in less dramatic cases than this, in a single region, we see crazy oscillations due to this very code path. My sense is that what would fix it is to make the |
tl;dr I think we should make this better in the short term with an inbound PR that just slows the rate at which we transfer away leases for the purpose of rebalancing, long term we should centralize the rebalancing planning in a whole new allocator scheme. Alright, I'm becoming increasingly convinced that the shedding of leases inside the locality of the node being added is going to be quite hard to fix, or at least, subtle to fix. We generally always seem to prioritize rebalancing over lease transfers for lease balancing. In that world, the ranges which will always shed their leases when simulating rebalancing to the new node are the ones in the same locality. As noted above, I think a more severe problem in some ways is the rate at which these nodes can shed their leases. This is actually worse in cases where there is just one locality. In this case the leaseholder will shed it's lease far faster than the new leaseholder will rebalance the range away. We can see some stores shedding a number of leases a second. Below is a repro with the transfer rate capped according to our usual lease transfer max rate. In this scenario we still see leases drained away before coming back to the new node's locality but it's less severe. To me, the more compelling part of this prototype is that it remedies the severe fluctuations in leases in the single locality case. Customers with active workloads have observed much more dramatic oscillations. |
This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on cockroachdb#51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node.
54322: kvserver: observe lease transfer interval in considerRebalance r=ajwerner a=ajwerner This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on #51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This commit plumbs the lease transfer rate limit into the lease transfer which occurs as a part of rebalancing decision making. The decision about whether a node should remove itself for the purpose of rebalancing happens at a much higher rate than the the actual rebalancing. This can lead to massive oscillations of leaseholders as the calculation on which replica should be removed in the face of node addition changes dramatically. We utilize a rate limit in lease transfer decisions to smooth this decision making elsewhere. This limit however is not utilized currently when deciding to transfer a lease for the purpose of rebalancing data. This change has shown very positive impact mitigating these lease transfer storms. See the commentary on cockroachdb#51867. Release note (bug fix): Made lease transfers during rebalancing adhere to the rate limit utilized in other lease transfer cases which eliminates unexpected lease oscillations when adding a new node.
#61038 does not completely solve #51867 but it does dramatically reduce the impact. It was sort of a two-level problem. Ideally, when adding a new node, leases and data would flow to that node and at no point would the leases or data on other nodes diminish below the steady state where the cluster will eventually converge. My change doesn't make that the case. However, the bigger problem here, on some level, was that a very big storm of lease transfers is quite bad for latency and availability. The above dynamics where the current leaseholder would transfer leases away because of a placement decision was not being governed by our lease transfer rate limiting. As such, they were shedding leases far faster than the lease rebalancing logic was doing anything about it. At that point the leases would get way out of whack and the lease rebalancing logic would get aggressive. When the lease transfers are slower than the information propagation, stuff remains pretty smooth and latency remains reasonably undeterred. All that being said, if you're moving a bunch of data around reasonably slowly, then the lease are going to become unbalanced for a time. This is one of those deeper problems with a decentralized, heuristic based placement driver. |
Merging this issue into #67740, as both issues have identified the same behavior. |
Impact: causes performance disruption during cluster scale-up
Background
An enterprise customer was running a "hot" cluster and, while adding additional nodes, saw unexpected lease transfers that exacerbated CPU use on heavily-loaded nodes. There was no data loss or downtime, but the concern is that a "galloping stampede" failure is made more likely by this behavior.
This behavior can be reliably recreated by the customer and in my testing as well.
Observation
Adding a node in one locality appears to cause all of the other nodes in that locality to drain away their leases. In this graph, I added a new node to DC1 whose leases-per-store values are seen to be going down. They’re transferring to DC2.
There is a DC3, but the locality and lease preference are configured as follows. Notice that the locality and lease preferences wind up ganging DC1 and DC2 together.
DC1: cloud=gce,continent=na,zone=us-east1-b
DC2: cloud=gce,continent=na,zone=us-west1-b
DC3: cloud=gce,continent=eu,zone=europe-west2-b
The leases eventually rebalance.
This behavior occurs with no load, light load, and in over-loaded cases.
Tested with 20.1.1.
gz#5876
The text was updated successfully, but these errors were encountered: