Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] leader balance don't work well #5669

Open
songqing opened this issue Aug 10, 2023 · 9 comments
Open

[Bug] leader balance don't work well #5669

songqing opened this issue Aug 10, 2023 · 9 comments
Labels
type/enhancement Type: make the code neat or more efficient

Comments

@songqing
Copy link
Contributor

Describe the bug (required)

In our cluster, there are 8 hosts, and each host has 54 partitions, as the replica factor is 3, each host should have 18 leaders on average.
However, after leader balance, the leader distribution is 15, 18, 18, 18, 18, 19, 19, 19 on different hosts, for example, the hosts is h0, h1, h2, h3, h4, h5, h6, h7.
I think the balance result is not good enough, can we try to balance and make each host 18 leaders?

More information is that, the partition peers of h0 is only h1, h2, h3, h4, the 4 hosts have 18 leaders each.

Leader balance code is here,
it seems that, when h0 wants to get a leader from h1, h2, h3 or h4, it will be failed, as the condition "minLoad < sourceLeaders.size()" is not met.
f2

So, maybe we need a better strategy for leader balance, for example, we may need to consider more when doing leader balance, instead of only focusing on partition's peers, but the whole cluster.

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

Additional context

@songqing songqing added the type/bug Type: something is unexpected label Aug 10, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Aug 10, 2023
@critical27
Copy link
Contributor

it seems that, when h0 wants to get a leader from h1, h2, h3 or h4, it will be failed, as the condition "minLoad < sourceLeaders.size()" is not met.

In your example, what is the minLoad of h0, 18?

@critical27
Copy link
Contributor

I think the scenario you describe do exists, h0 only has overlaps with h1, h2, h3, h4, but they all have 18 leaders.

But do we really need to make it perfect 18?

@songqing
Copy link
Contributor Author

songqing commented Sep 1, 2023

it seems that, when h0 wants to get a leader from h1, h2, h3 or h4, it will be failed, as the condition "minLoad < sourceLeaders.size()" is not met.

In your example, what is the minLoad of h0, 18?

Yes, minLoad is 18, maxLoad is 19

@songqing
Copy link
Contributor Author

songqing commented Sep 1, 2023

I think the scenario you describe do exists, h0 only has overlaps with h1, h2, h3, h4, but they all have 18 leaders.

But do we really need to make it perfect 18?

When the cluster has high access pressure, for example, the server's CPU usage is nearly full, the client will receive much error as one or more machines have higher pressure, but other machines may still have buffer.

I think if each server's leader is perfect 18, it'll be better, and if it can be done easily, I think there is no harm, so, it's a good thing to do it.

@porscheme
Copy link

porscheme commented Sep 26, 2023

@wey-gu

We are observing this imbalance in v3.6.0, below is our cluster info:
metad: 3
graphd: 3
storaged: 7
replicaFactor: 3
No of partitions: 140

After several BALANCE LEADER attempts

Expected leader distribution: 20, 20, 20, 20, 20, 20, 20
Actual leader distribution:   26, 26, 27, 15, 14, 17, 15

@songqing you have only 8 hosts; aren't you supposed to have odd number of hosts for Raft?

@songqing
Copy link
Contributor Author

@wey-gu

We are observing this imbalance in v3.6.0, below is our cluster info: metad: 3 graphd: 3 storaged: 7 replicaFactor: 3 No of partitions: 140

After several BALANCE LEADER attempts

Expected leader distribution: 20, 20, 20, 20, 20, 20, 20
Actual leader distribution:   26, 26, 27, 15, 14, 17, 15

@songqing you have only 8 hosts; are your supposed to have odd number of hosts for Raft?

I think host number has nothing to do with the leader distribution, both odd number and even number are ok. The leader balance algo is the key problem.

@porscheme
Copy link

porscheme commented Sep 26, 2023

@songqing you have only 8 hosts; are your supposed to have odd number of hosts for Raft?

I think host number has nothing to do with the leader distribution, both odd number and even number are ok. The leader balance algo is the key problem.

Maybe for distribution, but aren't you supposed to have odd number of hosts?

In any case, this leader imbalance effecting the perf very badly on huge graph. Our space has total Vertices Count: 2.8 Billion
total Edges Count: 1 Billon

@songqing
Copy link
Contributor Author

@songqing you have only 8 hosts; are your supposed to have odd number of hosts for Raft?

I think host number has nothing to do with the leader distribution, both odd number and even number are ok. The leader balance algo is the key problem.

Maybe for distribution, but aren't you supposed to have odd number of hosts?

In any case, this leader imbalance effecting the perf very badly on huge graph. Our space has total Vertices Count: 2.8 Billion total Edges Count: 1 Billon

Metad hosts' number should be odd, storaged's has no this limitation I think

@wey-gu
Copy link
Contributor

wey-gu commented Sep 26, 2023

Yes, we could have even numbers of storage hosts, the things to be odd should be the replica factor for spaces.

@QingZ11 QingZ11 added type/enhancement Type: make the code neat or more efficient and removed type/bug Type: something is unexpected severity/none Severity of bug affects/none PR/issue: this bug affects none version. labels Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Type: make the code neat or more efficient
Projects
None yet
Development

No branches or pull requests

5 participants