Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Technoboy- · 2022-11-24T09:26:02Z

Search before asking

I searched in the issues and found nothing similar.

Motivation

#16937 has corrected the misconfigured resource usage. But if the user configs the wrong one, the error log will print all the time. See the below logs:

And after diving into the modification, we find out that it's a breaking change.
Before #16937, the below test could pass, but after #16937, the below test fails


    @Test
    public void testBrokerThreshold() {
        LoadData loadData = new LoadData();
        LocalBrokerData broker1 = new LocalBrokerData();
        broker1.setCpu(new ResourceUsage(70, 100));    // Need to set `loadBalancerCPUResourceWeight=2`
        broker1.setMemory(new ResourceUsage(10, 100));
        broker1.setDirectMemory(new ResourceUsage(10, 100));
        broker1.setBandwidthIn(new ResourceUsage(500, 1000));
        broker1.setBandwidthOut(new ResourceUsage(500, 1000));
        broker1.setBundles(Sets.newHashSet("bundle-1", "bundle-2"));
        broker1.setMsgThroughputIn(Double.MAX_VALUE);

        LocalBrokerData broker2 = new LocalBrokerData();
        broker2.setCpu(new ResourceUsage(10, 100));
        broker2.setMemory(new ResourceUsage(10, 100));
        broker2.setDirectMemory(new ResourceUsage(10, 100));
        broker2.setBandwidthIn(new ResourceUsage(500, 1000));
        broker2.setBandwidthOut(new ResourceUsage(500, 1000));
        broker2.setBundles(Sets.newHashSet("bundle-3", "bundle-4"));

        BundleData bundleData = new BundleData();
        TimeAverageMessageData timeAverageMessageData = new TimeAverageMessageData();
        timeAverageMessageData.setMsgThroughputIn(1000);
        timeAverageMessageData.setMsgThroughputOut(1000);
        bundleData.setShortTermData(timeAverageMessageData);
        loadData.getBundleData().put("bundle-1", bundleData);

        loadData.getBrokerData().put("broker-1", new BrokerData(broker1));
        loadData.getBrokerData().put("broker-2", new BrokerData(broker2));

        assertFalse(thresholdShedder.findBundlesForUnloading(loadData, conf).isEmpty());
    }

This means the real CPU usage is only 70%, but we configure loadBalancerCPUResourceWeight= 2, so the current CPU usage is 140%. This will cause the broker to unload some bundles before #16937. But now, it won't.

And since #6772 has supported configured resources weight, #16937 breaks the case #6772 mentioned

It is hard to determine the threshold value, the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease.

When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold.

So I think we need to revert #16937

Solution

No response

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

eolivelli · 2022-11-24T10:34:31Z

Unfortunately #16937 has been released also with 2.10.2 !

Do you mean that now you cannot configure a weight that makes the usage over 100% ?

Technoboy- · 2022-11-24T10:47:19Z

Unfortunately #16937 has been released also with 2.10.2 !

Do you mean that now you cannot configure a weight that makes the usage over 100% ?

Ah, suppose the real CPU usage is only 70%, but we configure loadBalancerCPUResourceWeight= 2, so the current CPU usage is 140%. This will cause the broker to unload some bundles before #16937. But now, it won't.

eolivelli · 2022-11-24T13:11:34Z

#16937 added some useful logs.

The "problem" you are pointing out is here
https://github.com/apache/pulsar/pull/16937/files#diff-e1bcbd73e100f8ab5f179644dc45ee684a77ec10edbfe17f957e77ee2a043417R197

I am not sure that this is a real problem or not, it looks like that setting loadBalancerCPUResourceWeight= 100 and see CPU usage as 2000% is like a hack.

Maybe we can add a flag to allow the old behaviour.
I am not aware of production clusters with this kind of hacks.

Do you know some legit usecase ?

Currently I lean toward keeping the current version, and at most add a flag to allow the previous behaviour if you are aware of some user who can be hurt by this change.
(As I said before, this change is already in 2.10.2, if we revert here we should revert on branch-2.10 as well)

codelipenghui · 2022-11-25T01:46:01Z

I discussed this issue with Hang a few days ago. Hang is the initial designer of the threshold shedder. The weight is not required to <= 1. I think we mistakenly merged #16937. Users can use any non-negative number for the weight. I support reverting the PR first and revisiting the issue that @heesung-sn wants to fix.

Technoboy- · 2022-11-25T01:50:28Z

I am not sure that this is a real problem or not, it looks like that setting loadBalancerCPUResourceWeight= 100 and see CPU usage as 2000% is like a hack.

I think the demo test is not clear, I have updated the test.

hangc0276 · 2022-11-25T02:04:43Z

In the Pulsar load balance strategy, including OverloadShedder and ThresholdShedder, the weight of each resource is not ensured in [0, 1]. The total resourceUsage of each broker won't ensure to be less than 100%.

Incorrectly scaled resource load usage(cpu, memory, network usages bigger than 100%) can harm the load computation in the load balancer logics, as the load balancer computation expects all resource usages are normalized to the 100% scale.

For the motivation of #16937, the weight of the resource is not misconfigured, and it will break the old behavior, and lead to load balance not working after applying this PR in their cluster.

I support reverting this PR.

heesung-sn · 2022-11-25T03:56:42Z

Sure. If resourceUsage * weight is expected to be more than 100%, we should revert this change. The motivation does not hold.

Thanks for taking care of this.

heesung-sn · 2022-11-25T06:22:57Z

Note:

If resourceUsage > 100% becomes the winner, the moving avg function will decay it more slowly than others( unfair signal treatment).

But I assume this is the intention of the weighted configs too.

Technoboy- · 2022-11-25T09:53:36Z

@lhotari @Jason918 Could you give some ideas for this?

Jason918 · 2022-11-25T15:25:02Z

+1 for reverting the PR. And I don't think this feature (set the weight over 1.0) is widely used, So let's just do this the right way.

Please do the same with the 2.10 branch and we should add an important-note label as it's a breaking change.

eolivelli · 2022-11-25T20:39:36Z

Thanks for your explanations. I agree to revert the patch here and in 2.10

eolivelli changed the title ~~Discuss about reverting #16937~~ Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" Nov 24, 2022

sijie mentioned this issue Nov 25, 2022

ISSUE-18598: Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" streamnative/pulsar-archived#5175

Open

2 tasks

codelipenghui mentioned this issue Nov 25, 2022

[improve][broker] PIP-192: Define new load manager base interfaces #18084

Merged

4 tasks

Technoboy- mentioned this issue Nov 27, 2022

[fix][broker] Revert "[fix][load-balancer] skip mis-configured resource usage(>100%) in load balancer #18645

Merged

4 tasks

Technoboy- closed this as completed Nov 29, 2022

Technoboy- mentioned this issue Nov 29, 2022

[improve][test] Add test brokerReachThreshold for ThresholdShedderTest #18664

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Technoboy- commented Nov 24, 2022 •

edited

Loading

eolivelli commented Nov 24, 2022

Technoboy- commented Nov 24, 2022 •

edited

Loading

eolivelli commented Nov 24, 2022

codelipenghui commented Nov 25, 2022

Technoboy- commented Nov 25, 2022

hangc0276 commented Nov 25, 2022

heesung-sn commented Nov 25, 2022

heesung-sn commented Nov 25, 2022

Technoboy- commented Nov 25, 2022

Jason918 commented Nov 25, 2022

eolivelli commented Nov 25, 2022

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Comments

Technoboy- commented Nov 24, 2022 • edited Loading

Search before asking

Motivation

Solution

Alternatives

Anything else?

Are you willing to submit a PR?

eolivelli commented Nov 24, 2022

Technoboy- commented Nov 24, 2022 • edited Loading

eolivelli commented Nov 24, 2022

codelipenghui commented Nov 25, 2022

Technoboy- commented Nov 25, 2022

hangc0276 commented Nov 25, 2022

heesung-sn commented Nov 25, 2022

heesung-sn commented Nov 25, 2022

Technoboy- commented Nov 25, 2022

Jason918 commented Nov 25, 2022

eolivelli commented Nov 25, 2022

Technoboy- commented Nov 24, 2022 •

edited

Loading

Technoboy- commented Nov 24, 2022 •

edited

Loading