-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] leader balance don't work well #5669
Comments
In your example, what is the minLoad of h0, 18? |
I think the scenario you describe do exists, h0 only has overlaps with h1, h2, h3, h4, but they all have 18 leaders. But do we really need to make it perfect 18? |
Yes, minLoad is 18, maxLoad is 19 |
When the cluster has high access pressure, for example, the server's CPU usage is nearly full, the client will receive much error as one or more machines have higher pressure, but other machines may still have buffer. I think if each server's leader is perfect 18, it'll be better, and if it can be done easily, I think there is no harm, so, it's a good thing to do it. |
We are observing this imbalance in v3.6.0, below is our cluster info: After several BALANCE LEADER attempts
@songqing you have only 8 hosts; aren't you supposed to have odd number of hosts for Raft? |
I think host number has nothing to do with the leader distribution, both odd number and even number are ok. The leader balance algo is the key problem. |
Maybe for distribution, but aren't you supposed to have odd number of hosts? In any case, this leader imbalance effecting the perf very badly on huge graph. Our space has total Vertices Count: 2.8 Billion |
Metad hosts' number should be odd, storaged's has no this limitation I think |
Yes, we could have even numbers of storage hosts, the things to be odd should be the replica factor for spaces. |
Describe the bug (required)
In our cluster, there are 8 hosts, and each host has 54 partitions, as the replica factor is 3, each host should have 18 leaders on average.
However, after leader balance, the leader distribution is 15, 18, 18, 18, 18, 19, 19, 19 on different hosts, for example, the hosts is h0, h1, h2, h3, h4, h5, h6, h7.
I think the balance result is not good enough, can we try to balance and make each host 18 leaders?
More information is that, the partition peers of h0 is only h1, h2, h3, h4, the 4 hosts have 18 leaders each.
Leader balance code is here,
it seems that, when h0 wants to get a leader from h1, h2, h3 or h4, it will be failed, as the condition "minLoad < sourceLeaders.size()" is not met.
So, maybe we need a better strategy for leader balance, for example, we may need to consider more when doing leader balance, instead of only focusing on partition's peers, but the whole cluster.
Your Environments (required)
uname -a
g++ --version
orclang++ --version
lscpu
a3ffc7d8
)How To Reproduce(required)
Steps to reproduce the behavior:
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: