Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Closed
1 of 2 tasks
Technoboy- opened this issue Nov 24, 2022 · 11 comments
Closed
1 of 2 tasks

Comments

@Technoboy-
Copy link
Contributor

Technoboy- commented Nov 24, 2022

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

#16937 has corrected the misconfigured resource usage. But if the user configs the wrong one, the error log will print all the time. See the below logs:

image

And after diving into the modification, we find out that it's a breaking change.
Before #16937, the below test could pass, but after #16937, the below test fails


    @Test
    public void testBrokerThreshold() {
        LoadData loadData = new LoadData();
        LocalBrokerData broker1 = new LocalBrokerData();
        broker1.setCpu(new ResourceUsage(70, 100));    // Need to set `loadBalancerCPUResourceWeight=2`
        broker1.setMemory(new ResourceUsage(10, 100));
        broker1.setDirectMemory(new ResourceUsage(10, 100));
        broker1.setBandwidthIn(new ResourceUsage(500, 1000));
        broker1.setBandwidthOut(new ResourceUsage(500, 1000));
        broker1.setBundles(Sets.newHashSet("bundle-1", "bundle-2"));
        broker1.setMsgThroughputIn(Double.MAX_VALUE);

        LocalBrokerData broker2 = new LocalBrokerData();
        broker2.setCpu(new ResourceUsage(10, 100));
        broker2.setMemory(new ResourceUsage(10, 100));
        broker2.setDirectMemory(new ResourceUsage(10, 100));
        broker2.setBandwidthIn(new ResourceUsage(500, 1000));
        broker2.setBandwidthOut(new ResourceUsage(500, 1000));
        broker2.setBundles(Sets.newHashSet("bundle-3", "bundle-4"));

        BundleData bundleData = new BundleData();
        TimeAverageMessageData timeAverageMessageData = new TimeAverageMessageData();
        timeAverageMessageData.setMsgThroughputIn(1000);
        timeAverageMessageData.setMsgThroughputOut(1000);
        bundleData.setShortTermData(timeAverageMessageData);
        loadData.getBundleData().put("bundle-1", bundleData);

        loadData.getBrokerData().put("broker-1", new BrokerData(broker1));
        loadData.getBrokerData().put("broker-2", new BrokerData(broker2));

        assertFalse(thresholdShedder.findBundlesForUnloading(loadData, conf).isEmpty());
    }

This means the real CPU usage is only 70%, but we configure loadBalancerCPUResourceWeight= 2, so the current CPU usage is 140%. This will cause the broker to unload some bundles before #16937. But now, it won't.

And since #6772 has supported configured resources weight, #16937 breaks the case #6772 mentioned

It is hard to determine the threshold value, the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease.

When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold.

So I think we need to revert #16937

Solution

No response

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@eolivelli eolivelli changed the title Discuss about reverting #16937 Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" Nov 24, 2022
@eolivelli
Copy link
Contributor

Unfortunately #16937 has been released also with 2.10.2 !

Do you mean that now you cannot configure a weight that makes the usage over 100% ?

@Technoboy-
Copy link
Contributor Author

Technoboy- commented Nov 24, 2022

Unfortunately #16937 has been released also with 2.10.2 !

Do you mean that now you cannot configure a weight that makes the usage over 100% ?

Ah, suppose the real CPU usage is only 70%, but we configure loadBalancerCPUResourceWeight= 2, so the current CPU usage is 140%. This will cause the broker to unload some bundles before #16937. But now, it won't.

@eolivelli
Copy link
Contributor

#16937 added some useful logs.

The "problem" you are pointing out is here
https://github.com/apache/pulsar/pull/16937/files#diff-e1bcbd73e100f8ab5f179644dc45ee684a77ec10edbfe17f957e77ee2a043417R197

I am not sure that this is a real problem or not, it looks like that setting loadBalancerCPUResourceWeight= 100 and see CPU usage as 2000% is like a hack.

Maybe we can add a flag to allow the old behaviour.
I am not aware of production clusters with this kind of hacks.

Do you know some legit usecase ?

Currently I lean toward keeping the current version, and at most add a flag to allow the previous behaviour if you are aware of some user who can be hurt by this change.
(As I said before, this change is already in 2.10.2, if we revert here we should revert on branch-2.10 as well)

@codelipenghui
Copy link
Contributor

I discussed this issue with Hang a few days ago. Hang is the initial designer of the threshold shedder. The weight is not required to <= 1. I think we mistakenly merged #16937. Users can use any non-negative number for the weight. I support reverting the PR first and revisiting the issue that @heesung-sn wants to fix.

@Technoboy-
Copy link
Contributor Author

I am not sure that this is a real problem or not, it looks like that setting loadBalancerCPUResourceWeight= 100 and see CPU usage as 2000% is like a hack.

I think the demo test is not clear, I have updated the test.

@hangc0276
Copy link
Contributor

In the Pulsar load balance strategy, including OverloadShedder and ThresholdShedder, the weight of each resource is not ensured in [0, 1]. The total resourceUsage of each broker won't ensure to be less than 100%.

Incorrectly scaled resource load usage(cpu, memory, network usages bigger than 100%) can harm the load computation in the load balancer logics, as the load balancer computation expects all resource usages are normalized to the 100% scale.

For the motivation of #16937, the weight of the resource is not misconfigured, and it will break the old behavior, and lead to load balance not working after applying this PR in their cluster.

I support reverting this PR.

@heesung-sn
Copy link
Contributor

Sure. If resourceUsage * weight is expected to be more than 100%, we should revert this change. The motivation does not hold.

Thanks for taking care of this.

@heesung-sn
Copy link
Contributor

Note:

If resourceUsage > 100% becomes the winner, the moving avg function will decay it more slowly than others( unfair signal treatment).

But I assume this is the intention of the weighted configs too.

@Technoboy-
Copy link
Contributor Author

@lhotari @Jason918 Could you give some ideas for this?

@Jason918
Copy link
Contributor

+1 for reverting the PR. And I don't think this feature (set the weight over 1.0) is widely used, So let's just do this the right way.

Please do the same with the 2.10 branch and we should add an important-note label as it's a breaking change.

@eolivelli
Copy link
Contributor

Thanks for your explanations. I agree to revert the patch here and in 2.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants