PIP-192: New Pulsar Broker Load Balancer #16691

heesung-sn · 2022-07-20T05:08:38Z

Proposal: New Pulsar Broker Load Balancer

Motivation

As previously shared with the community, we observed many improvement areas around the Pulsar load balancer[1]. Since the improvement requires significant changes, first, we would like to share the overall goals for this project and the high-level components to design. This doc will highlight the architecture of the new broker load balancer.

Goals

We set up the project goals in the following areas.

User-facing goals

Logic

Balance cluster utilization as uniform as possible with minimal delays

Logs / Metrics

Show transparent load balance decisions with logs and metrics.

Admin API / Configurations

Provide ways to override system decisions.
Reduce the number of configurations to tune.
Provide the better configuration default values with explanations.
(second-phase) Provide ways to set custom load balance strategy.

Internal Implementation goals

Logic

Keep the three major load balance logics but make them more efficient and faster. We will discuss detailed algorithm improvements separately.
- Topic(Bundle)-broker-assignment: improve the randomization and assignment distribution
- Bundle-split: revisit the current threshold-based strategy
- Bundle-unload: revisit the frequency of unloading

Implementation

Distribute bundle-broker assignment and bundle-split decisions to local brokers.
Synchronize bundle unloading decisions by the leader broker.
Reduce load data replication among brokers.
Replace load data’s metadata stores with topic table-views.
Remove the client dependency in the load balance logic.
- Remove the client redirection in the assignment logic.
- Add the bundle transfer option in the unload logic instead of relying on clients’ broker discovery calls
Minimize the topic unavailability from bundle unloading with the bundle transfer option
Introduce the bundle state channel(table-view) to make bundle load balance operations consistent and fault-tolerant among brokers.
Isolate the new load balancer code in new classes.
Replace the bundle ownership metadata store(ZK znodes) with the bundle state channel.

Logs / Metrics

Add meaningful logs and metrics for all major load balance events.
Add documentation about how to read load balancer metrics and logs.

Admin API / Configurations

Add an admin CLI to transfer a bundle to a specific broker.
Add necessary configurations to override the load balance decisions.
Dynamically adjust internal configuration thresholds based on the load data.
Make the admin APIs fault-tolerant and easy to monitor.

Testing

Document the testing plan and coverage status.
Add unit tests for happy and unhappy cases.
Add global load balance logic tests, and compare the current load manager with the new load manager.

API Changes

We will add the transfer unload option --dest to specifically unload the topic(bundle) to the destination broker.

pulsar-admin topics unload persistent://tenant/namespace/topic --dest ${destination_broker}

Implementation (High-Level Components)

New Load Manager

It refactors the existing load balance logic with better modularity.
It isolates the new code in the new classes without breaking the existing logic.
This new load manager will be disabled in the first releases until proven stable.

Load Data Models

LocalBrokerData: broker’s factual data

e.g.) {webServiceUrl, pulsarServiceUrl, …}
Persisted in MetadataStore(ZK)

BrokerLoadData: broker’s load data

e.g.) {cpu, memory, io, msgIn/Out, ...}
Published in BrokerLoadDataStore(TableView)

BundlesLoadData: bundle’s load data

e.g.) { bundleName, msgIn/Out, ...}
Cached in the local broker only

TopBundlesLoadData: top-n high-loaded bundle load data from the broker

e.g.) {brokerUrl, high_load_bundles :[{bundleName, …}], …}
Published in TopBundlesLoadDataStore(TableView)

Load Data Write-Read Flow

LocalBrokerData

Write:

Upon init, each broker stores LocalBrokerData in its ephemeral znode in MetaDataStore(ZK) to monitor live brokers(same as now)
The broker-level load data moved to BrokerLoadData

Read:

All brokers check LocalBrokerData to confirm the list of live brokers.

BrokerLoadData

Write:

Each broker periodically computes local load(BrokerLoadData) and publishes it to BrokerLoadDataStore(TableView)(non-persistent)
- Because non-persistent TableView can often lose data, we will add a TTL policy to tombstone old KVs in BrokerLoadDataStore.

Read:

All brokers consume BrokerLoadDataStore
With aggregated BrokerLoadData, all brokers perform bundle assignments without going through the leader.

BundlesLoadData

Write:

Each broker monitors the allocated bundles' load and stores them in the local cache BundlesLoadData(In-memory-HashMap).
BundlesLoadData will not be replicated to other brokers’ caches.

Read:

Each broker locally reads BundlesLoadData and computes top n high load bundles, TopBundlesLoadData.
With the local BundlesLoadData, all brokers perform bundle splits without going through the leader.

TopBundlesLoadData

Write:

Each broker periodically compute TopBundlesLoadData and publishes it to TopBundlesLoadDataStore(TableView)(non-persistent)
- We will add a TTL policy to tombstone old KVs in TopBundlesLoadDataStore.

Read:

Only the leader broker consumes TopBundlesLoadDataStore
- With the aggregated TopBundlesLoadData and BrokerLoadData, the leader initiates bundle unload(transfer) operations.

Load Data Flow

Major Modifications on Bundle Split, Unload, and Assignment Flow

With the local BundlesLoadData, all brokers perform bundle splits without going through the leader. By default, newly split bundles will be the target to unload(transfer).
With aggregated BrokerLoadData, all brokers perform bundle assignments without going through the leader.
With aggregated TopBundlesLoadData and BrokerLoadData, the leader makes decisions to unload(transfer) bundles.
We will add a new bundle unload option, transfer, which transfers bundles from one broker to another.
We will introduce a global channel(Bundle State Channel) to share consistent/linearized bundle state changes with brokers.

Bundle State Channel

This bundle state channel is a persistent topic table-view used as a WAL to broadcast the total order of all bundle state changes in the cluster. All brokers will asynchronously consume messages in this channel in the same order and react to bundle state changes(sequential consistency). With the table-view compaction, the bundle state channel will eventually materialize the current bundle-broker ownership. Read operations on this channel can be deferred(e.g., clients’ topic lookup requests) in a few seconds, depending on the current state of the bundle.

Bundle State Lifecycles

We define the following states and actions and linearize the bundle state changes.
(This is a high-level design to explain the concept here. The final version may differ.)

Bundle Actions

Own: Own the bundle ownership
- The owner broker is selected by the local load manager.
Transfer: Transfer the bundle ownership to the destination broker.
- The source broker internally disables the bundle ownership.
- The destination broker owns the bundle.
Return: Return deferred client connections with the destination broker URL
- Close the connections if already being served
Split: Split the target(parent) bundle into child bundles.
Create: Create the child bundle entries in the channel, initially assigned to the local broker.
Discard: Discard the bundle entry in the channel(tombstone operation)
Unload: Unload the bundle ownership from the owner broker
- Disable the bundle ownership
- Close client connections under the bundle
- Run the Discard action

Bundle States

Assigned: assigned to a broker
Assigning: in the process of assigning the ownership
Splitting: in the process of splitting a bundle range.
Unassigned: unassigned to any broker (removed from the channel)

*New client connections to the bundle are deferred(with timeouts) in the Assigning state.

Bundle State Change Examples

The bundle state channel can be used like the followings.

Bundle Transfer Example

(State, Action) Sequence:
(Assigned, Transfer) => (Assigning, Return) => (Assigned,)

The leader finds target bundles from TopBundlesLoadData and initiates a bundle unload(a transfer) by broadcasting the unload state change to the bundle state channel, keyed by the bundleName.
e.g. {key:bundleName, value:{flow:transfer, action:transfer, state:assigning, from:A, to:B}}}
All brokers will consume the state change message in the channel.
Upon consuming the message from the channel, if any state change involves the local broker, the broker performs its role and updates the state back in the channel to continue the state change. If there are conflicting state changes with the ongoing one, ignore them.
Meanwhile, if other brokers(broker C) receive lookup requests for the bundle, the client's connections will be deferred(with timeouts) until they receive the “Return” action. When the “Return” action is broadcasted, all brokers will return the pending connections with the owner broker’s URL. Also, the existing connections from the source broker will be closed.

Bundle Split Example

(State, Action) Sequence:
(Assigned, Split) => (Splitting, Unload | Create) => {(Unassigned, ) | (Assigned, ), (Assigned, )}

Each owner broker monitors local BundlesLoadData and initiates a bundle split by broadcasting the transfer state change to the bundle state channel, keyed by the bundleName.
e.g. {key:bundleName, value:{flow: split, action:split, state: splitting, from: A, to: B, transfer: true}}}
Same as Bundle Transfer Example step 2.
Same as Bundle Transfer Example step 3.
a. After the “Split,” the owner broadcasts the children bundles’ ownership creation(state=assigned) and the parent bundle’s ownership unload(empty message).
b. By default, the owner publishes a message to the TopBundlesLoadData store asking the leader to unload(or transfer) the children bundles.

Bundle Assignment Example

(State, Action) Sequence:
(Unassigned, Own) => (Assigning, Return) => (Assigned,)

When requested by clients, the first connected brokers check if any broker in the state channel owns the bundle. Return the owner broker URL if found. Else, initiate a bundle assignment by broadcasting the assignment state change.
e.g. {key:bundleName, value:{flow: assignment, action:own, state:assigning, to: B}}}
Same as Bundle Transfer Example step 2.
Same as Bundle Transfer Example step 3.
Same as Bundle Transfer Example step 4.

Bundle-Broker Ownership State

Because the bundle state channel shows the current bundle-broker ownership, we can remove the redundant bundle ownership store(ZK znodes). Each broker will look up the bundle ownership channel to check which broker currently owns the requested bundles or is in the ownership assignment/unload(transfer) process. Besides, before return, the broker availability metadata store(LocalBrokerData znode existence) could be checked to confirm the owner brokers' availability further.

Bundle State Channel Owner Selection and Discovery

Bundle State Channel(BSC) is another topic, and because of its circular dependency, we can't use the BundleStateChannel to find the owner broker of the BSC topic. For example, when a cluster starts, each broker needs to initiate BSC TopicLookUp(to find the owner broker) in order to consume the messages in BSC. However, initially, each broker does not know which broker owns the BSC.

The ZK leader election can be a good option to break this circular dependency, like the followings.

Channel Owner Selection

The cluster can use the ZK leader election to select the owner broker. If the owner becomes unavailable, one of the followers will become the new owner. We can elect the owner for each bundle state partition.

Channel Owner Discovery

Then, in brokers’ TopicLookUp logic, we will add a special case to return the current leader(the elected BSC owner) for the BSC topics.

Conflict State Resolution(Race Conditions)

Without distributed locks, we can resolve conflicting state changes by a conflict state resolution algorithm in an optimistic and eventual manner. Brokers can take the first valid state change in the linearized view as the winner state and ignore the later ones.

One caveat is that because the current table-view compaction takes only the last ones as the result values, we need to introduce an internal compaction algo for this channel to follow the conflict resolution algorithm(the first valid state change as the result value).

Bundle State Conflict Resolution Algorithm Example

For each bundle:

    // A[i] is a linearized bundle state change action at i, and
    // S is the current bundle state after A[i-1],
    // where the sequence number i monotonically increases.
    for each A[i] and S:
	
        // no arrows in the state diagram
        If A[i] is invalid from S: 
            Reject A[i]

        Else: Accept A[i]

For instance, let’s say for bundle x, there are two conflicting assignments initiated. The linearized state change messages will be like the following.
(own, to:B), (own, to:A)
By the conflict resolution algorithm, the second state change (own, to:A) will be ignored by all brokers(and by the compaction algorithm). Eventually, the “return” message will be broadcasted by declaring that the owner is “B.”
(own, to:B), (own, to:A), (return, to:B)

Let’s take another example. Let’s say bundle x is already assigned to broker B, but another broker initiates the “own” action(before consuming the “return” action). This last “own” state change will be ignored since this action “own” is invalid from the previous state “assigned.” (in the above state diagram, there is no “own” action arrow from the “assigned” state.)
(own, to:B), (return, to:B), (own, to:A)

Failure Recovery

When a broker is down

When state change participants(brokers) are suddenly unavailable, the state change could become an orphan, as the participants do not play the role. For these orphan state changes, the leader broker will run orphan state clean-up logic. For instance, the leader can add the bundle state clean-up logic in the broker unavailability notification handler(znode watcher) in order to clean the pending bundle state changes and ownerships from unavailable brokers. Also, to make the clean-up logic further fault-tolerant, the leader broker will run the clean-up function when it initializes. Additionally, we could make the leader periodically call the clean-up in a separate monitor thread(we shouldn’t redundantly call this cleanup too often).

When the entire ZK is down and comes back

Every broker will be notified when its ZK session undergoes the connection issue. Then, the brokers will be in the "safe" mode, serving the existing topics as-is, but not allowing the ZK-related operations. The leader won't run the bundle cleanup, transfer, nor unload logic in this case when it knows ZK is down.

When ZK comes back, each broker will know ZK sessions are re-established. They will wait 2-3 mins for all brokers to complete the ZK hand-shaking. Then, they will recover the bundle state table-view and return to the normal mode.

Bundle State and Load Data TableView Scalability

Expected read/write traffic:
Write: there will be relatively fewer messages from the write path with occasional spikes
Read: the fan-out broadcast could cause bottlenecks when the cluster is enormous.

This bundle state channel is relatively lightweight from the producers because bundle state change is relatively less frequent. Still, message dispatch to consumers could be heavier if the cluster is very large. The same issue can happen to other table-views(BrokerLoadDataStorage) introduced in this proposal. We could consider the following methods to scale the table views’ produce/consume rates in a large cluster.

Split Broker Cluster to multiple clusters

Simply, one can split a massive broker cluster into multiple clusters with different endpoints. The bookkeeper and configuration layer can be shared among the broker clusters.

Partitioned Table-View (short-term)

One can make the table views based on partitioned topics. Then, we can distribute message load to multiple partition owner brokers.

Sharding (long-term)

As the conventional scalability method, one could shard the cluster to multiple groups of brokers. Then, we can create a separate channel for each shard of brokers. This means we need an additional discovery layer to map topics to broker shards(also need to align with Namespace Isolation Policies)

We need to mention that this metadata sync scalability issue is not new in Pulsar, as the current Pulsar uses n-replication. For instance, all brokers' and all bundles' load metadata are replicated to all brokers via ZK watchers. Currently, distributed ZK servers send znode watch notifications to its clients(brokers). In this proposal, multiple table-view owner brokers(with partitioned table-views) can dispatch metadata change messages to the participants(brokers).

We think this metadata sync scalability is relatively low-priority, as only a few customers run Pulsar clusters on such a large scale. We could ask the customers first to split the cluster into multiple clusters and then enable partitioned table views. It is not practical for a single cluster to have thousands of brokers. However, we still want to ensure this design is seamlessly extensible, as a two-way-door decision.

Reject Alternatives

why we can not enhance current load balancer

As the PIP changes almost every place (data models, event handlers, cache/storage, logs/metrics), creating a new load balancer and isolating the new code is safer and cleaner. Then, customers could safely enable/disable the new load balancer
by a configuration before deprecating the old one.

It gives the flexibility to start fresh without the existing baggage of choices and try a significantly different approach. The current ModularLoadManagerImpl will not go away. Once the new load manager will be ready and considered stable enough, there might be a new discussion on whether to change the default implementation. Even then, users will still be able to opt for the old load manager.

Modification Summary

The followings exclude logic and algorithm modifications as this pip does not focus on the logic and algorithm improvement.

Goals	Before	After
Make load balance operations fault-tolerant and consistent among brokers	The leader broker sends load balance commands to the owner brokers via RPC with retries.	We introduce a global bundle state channel(a persistent topic table-view), where a total order of bundle commands is reliably persisted and broadcasted by all brokers.
Distribute load balance operations	The leader broker decides on bundle assignment, unload, and splitting. The owner brokers run the unload and split operations notified via RPC.	Each broker decides and runs bundle assignment and split operations. The leader decides bundle unload(transfer), and the owner brokers run the unload operation, notified via the bundle state channel.
Reduce load data replication among brokers	All brokers’ and all bundles’ load data are stored in ZK and replicated to all brokers via ZK watchers.	All brokers’ load data is replicated to all brokers via a non-persistent topic(table-view). Only top-n bundles’ load data from each broker is replicated to the leader broker via a non-persistent topic(table-view).
Minimize the topic unavailability from unloading	After topic connections are closed, clients reconnect to a new broker, and the new broker initiates a new topic assignment. The leader broker assigns a new owner, and eventually, the client will be redirected to the new owner broker.	We introduce a new unload option, “transfer”, where the new owner is pre-assigned before the topic connections are closed. Clients immediately redirect to the new owner broker without the client-initiated topic assignments.
Share bundle-broker ownership metadata among brokers for owner broker discovery	The bundle-broker ownership data are stored in ZK. All brokers read bundle ownership info upon TopicLookUp requests(with caching local bundle ownership info).	The global ownership data is stored in the bundle state channel(a persistent topic table-view). With compaction, all brokers read its latest global ownership table-view(cached in memory) upon TopicLookUp requests.
Show transparent load balance decisions with logs and metrics	Emit logs best-effort basis.	We design logging/metrics as separate logical components. We document and share major log messages and metrics for all important load balance events

Post Update

Added ServiceConfiguration

### --- Load balancer extension --- ###

# Option to enable the debug mode for the load balancer logics.
# The debug mode prints more logs to provide more information such as load balance states and decisions.
# (only used in load balancer extension logics)
loadBalancerDebugModeEnabled=false

# The target standard deviation of the resource usage across brokers
# (100% resource usage is 1.0 load).
# The shedder logic tries to distribute bundle load across brokers to meet this target std.
# The smaller value will incur load balancing more frequently.
# (only used in load balancer extension TransferSheddeer)
loadBalancerBrokerLoadTargetStd=0.25

# Threshold to the consecutive count of fulfilled shedding(unload) conditions.
# If the unload scheduler consecutively finds bundles that meet unload conditions
# many times bigger than this threshold, the scheduler will shed the bundles.
# The bigger value will incur less bundle unloading/transfers.
# (only used in load balancer extension TransferSheddeer)
loadBalancerSheddingConditionHitCountThreshold=3

# Option to enable the bundle transfer mode when distributing bundle loads.
# On: transfer bundles from overloaded brokers to underloaded
# -- pre-assigns the destination broker upon unloading).
# Off: unload bundles from overloaded brokers
# -- post-assigns the destination broker upon lookups).
# (only used in load balancer extension TransferSheddeer)
loadBalancerTransferEnabled=true

# Maximum number of brokers to unload bundle load for each unloading cycle.
# The bigger value will incur more unloading/transfers for each unloading cycle.
# (only used in load balancer extension TransferSheddeer)
loadBalancerMaxNumberOfBrokerSheddingPerCycle=3

# Delay (in seconds) to the next unloading cycle after unloading.
# The logic tries to give enough time for brokers to recompute load after unloading.
# The bigger value will delay the next unloading cycle longer.
# (only used in load balancer extension TransferSheddeer)
loadBalanceSheddingDelayInSeconds=180

# Broker load data time to live (TTL in seconds).
# The logic tries to avoid (possibly unavailable) brokers with out-dated load data,
# and those brokers will be ignored in the load computation.
# When tuning this value, please consider loadBalancerReportUpdateMaxIntervalMinutes.
#The current default is loadBalancerReportUpdateMaxIntervalMinutes * 2.
# (only used in load balancer extension TransferSheddeer)
loadBalancerBrokerLoadDataTTLInSeconds=1800

# Max number of bundles in bundle load report from each broker.
# The load balancer distributes bundles across brokers,
# based on topK bundle load data and other broker load data.
# The bigger value will increase the overhead of reporting many bundles in load data.
# (only used in load balancer extension logics)
loadBalancerMaxNumberOfBundlesInBundleLoadReport=10

# Service units'(bundles) split interval. Broker periodically checks whether
# some service units(e.g. bundles) should split if they become hot-spots.
# (only used in load balancer extension logics)
loadBalancerSplitIntervalMinutes=1

# Max number of bundles to split to per cycle.
# (only used in load balancer extension logics)
loadBalancerMaxNumberOfBundlesToSplitPerCycle=10

# Threshold to the consecutive count of fulfilled split conditions.
# If the split scheduler consecutively finds bundles that meet split conditions
# many times bigger than this threshold, the scheduler will trigger splits on the bundles
# (if the number of bundles is less than loadBalancerNamespaceMaximumBundles).
# (only used in load balancer extension logics)
loadBalancerNamespaceBundleSplitConditionHitCountThreshold=3

# After this delay, the service-unit state channel tombstones any service units (e.g., bundles)
# in semi-terminal states. For example, after splits, parent bundles will be `deleted`,
# and then after this delay, the parent bundles' state will be `tombstoned`
# in the service-unit state channel.
# Pulsar does not immediately remove such semi-terminal states
# to avoid unnecessary system confusion,
# as the bundles in the `tombstoned` state might temporarily look available to reassign.
# Rarely, one could lower this delay in order to aggressively clean
# the service-unit state channel when there are a large number of bundles.
# minimum value = 30 secs
# (only used in load balancer extension logics)
loadBalancerServiceUnitStateTombstoneDelayTimeInSeconds=3600

# Option to automatically unload namespace bundles with affinity(isolation)
# or anti-affinity group policies.
# Such bundles are not ideal targets to auto-unload as destination brokers are limited.
# (only used in load balancer extension logics)
loadBalancerSheddingBundlesWithPoliciesEnabled = false

# Time to wait before fixing any stuck in-flight service unit states. 
# The leader monitor fixes any in-flight service unit(bundle) states 
# by reassigning the ownerships if stuck too long, longer than this period.
# (only used in load balancer extension logics)
loadBalancerInFlightServiceUnitStateWaitingTimeInMillis = 30000;

# The service unit(bundle) state channel is periodically monitored
# by the leader broker at this interval
# to fix any orphan bundle ownerships, stuck in-flight states, and other cleanup jobs.
# `loadBalancerServiceUnitStateTombstoneDelayTimeInSeconds` * 1000 must be bigger than 
# `loadBalancerInFlightServiceUnitStateWaitingTimeInMillis`.
# (only used in load balancer extension logics)
loadBalancerServiceUnitStateMonitorIntervalInSeconds = 60;

The text was updated successfully, but these errors were encountered:

heesung-sn · 2022-08-23T21:30:14Z

vote email : https://lists.apache.org/thread/4s0869ndt6ppjzgklqjnjk6bjt3zngkd

github-actions · 2022-09-23T02:19:07Z

The issue had no activity for 30 days, mark with Stale label.

### Motivation Currently, the table view only supports persistent topics with read compacted. However, some data don't require persistent storage, like load data in PIP-192 [#16691](#16691). ### Modifications Make table view support non-persistent topic

heesung-sn · 2022-11-17T17:19:35Z

Project Status: Sub-PIPs and PRs

Please help to review these items.

Sub-PIPs

PIP-215: Configurable TopicCompactionStrategy for StrategicTwoPhaseCompactor and TableView #18099
-> Vote Complete (got three +1 PMC votes)
PIP-220: TransferShedder (Only for PIP-192 New Broker Load Manager ) #18215
-> Vote Complete (got three +1 PMC votes)
PIP-221: Make TableView support TTL #18229
-> Pending Discussion

PRs

[improve][client] Make table view support read the non-persistent topic #18375
[improve][broker] PIP-192 Added ServiceUnitStateChannel interface #18079
[improve][broker] PIP-192: Define new load manager base interfaces #18084
[improve][broker][client] PIP-192 PIP-215 Added TopicCompactionStrategy for StrategicTwoPhaseCompactor and TableView. #18195
[improve][broker] PIP-192 Added ServiceUnitStateChannelImpl #18489
[improve][broker] PIP-192: Implement load data store #18777
[improve][broker] PIP-192: Implement broker registry for new load manager #18810
[improve][broker] PIP-192 PIP-220 Added TransferShedder #18865
[improve][broker] PIP-192: Write the child ownership to ServiceUnitStateChannel instead of ZK when handling bundle split #18858
[improve][broker] PIP-192 Added LeastResourceUsageWithWeight broker assignment strategy to the load balancer extensions #18964
[improve][broker] PIP-192 Added ServiceUnitStateCompactionStrategy #19045
[improve][broker] PIP-192: Implement extensible load manager #19102
[improve][broker] PIP-192: Implement broker version filter for new load manager #19023
[improve][broker] PIP-192 Moved the common broker load data feature(weightedMaxEMA) to BrokerLoadData #19154
[improve][broker] PIP-192 Added operation counters in ServiceUnitStateChannelImpl #19410
[improve][broker] PIP-192 Implemented ExtensibleLoadManagerWrapper.getLoadBalancingMetrics() #19440
[improve][broker] PIP-192 Added broker and top-bundles load reporters #19471
[improve][broker] PIP-192 Added namespace unload scheduler #19477
[improve][broker] PIP-192 Use msg property to mark compacted msgs from strategic compaction #19490
[improve][broker] PIP-192 Added Deleted and Init states in ServiceUnitState #19546
[improve][broker] PIP-192: Make unload and transfer admin API functional #19538
[improve][broker] PIP-192 Added SplitScheduler and DefaultNamespaceBundleSplitStrategyImpl #19622
[improve][broker] PIP-192: Added VersionId in ServiceUnitStateData #19620
[improve][broker] PIP-192: Add large topic count filter #19613
[improve][broker] PIP-192: Support broker isolation policy #19592
[improve][broker] PIP-192 Added --extensions option in BrokerMonitor #19654
[improve][broker] PIP-192 Switched Assigning and Releasing state order in the transfer protocol #19683
[improve][broker] PIP-192 Supports AntiAffinityGroupPolicy #19708
[improve][broker] PIP-192 Excluded bundles with isolation policy or anti-affinity-group policy from topk load bundles #19742
[improve][broker] PIP-192 Made only the leader consume TopBundlesLoadDataStore #19730
[improve][broker] PIP-192: Add metrics for unload operation #19749
[improve][broker] PIP-192: Make split admin API functional #19773
[improve][broker] PIP-192 Improved Auto Unload Logic #19813
[improve][broker] PIP-192: Support delete namespace bundle admin API #19851
[improve][broker] PIP-192 updated metrics and cleanup broker selector #19945
[improve][broker] PIP-192: Update the lookup data path to support deployment and rollback #19999
[improve][broker] PIP-192 made split handler idempotent #19988
[improve][broker] PIP-192: Filter the transfer dest broker #19958

…18084) Master Issue: #16691 ### Motivation We will start raising PRs to implement PIP-192, #16691 ### Modifications The PR adds base classes for the new broker load balance project and does not integrate with the existing load balance logic. This PR should not impact the existing broker load balance behavior. For the pip-192 project, this PR * defines the base interface under `org.apache.pulsar.broker.loadbalance.extensible` package. * defines this `BrokerRegistry` public interface and its expected behaviors. * defines `BrokerFilter` interfaces. * defines `LoadDataReporter` interfaces. * defines `NamespaceBundleSplitStrategy` interfaces. * defines `LoadManagerScheduler` interfaces. * defines `NamespaceUnloadStrategy` interfaces. * defines `LoadDataStore` interfaces. * defines `ExtensibleLoadManager` interfaces. * defines `LoadManagerContext` interfaces. * defines `BrokerLoadData` and `BrokerLookupData` data classes.

…t-forward cursor behavior after compaction (#20110) Master Issue: #16691 ### Motivation Raising a PR to implement: #16691 After the compaction, the cursor can fast-forward to the compacted horizon when a large number of messages are compacted before the next read. Hence, ServiceUnitStateCompactionStrategy also needs to cover this case. Currently, the existing and slow(their states are far behind) tableviews with ServiceUnitStateCompactionStrategy could not accept those compacted messages. In the load balance extension context, this means the ownership data could be inconsistent among brokers. ### Modifications This PR - fixes ServiceUnitStateCompactionStrategy to accept the state data if its version is bigger than the current version +1. - (minor fix) does not repeatedly update the replication_clusters in the policies when creating the system namespace. This update redundantly triggers ZK watchers when restarting brokers. - sets closeWithoutWaitingClientDisconnect=true, upon unload(following the same setting as the modular LM's) (cherry picked from commit 6cfa468)

cbornet · 2023-04-19T08:05:28Z

@merlimat this is marked for 3.0. Is everything done for this PIP ? Can it be closed ?

Anonymitaet · 2023-04-20T02:48:46Z

Hi @heesung-sn thanks for introducing this great feature!

I see some PRs related to this PIP were labeled with doc-not-need, so I want to double-check: for users, does this feature not pose any influence on usage? If it affects users, do we need to add “what is it” and “how to use it” to Pulsar docs (e.g. https://pulsar.apache.org/docs/next/administration-load-balance/)?
Thanks!

heesung-sn · 2023-04-20T02:54:54Z

Hi, Yes we are working on docs too.

@Demogorgon314 and I will raise PRs for the doc update.

Anonymitaet · 2023-04-20T03:10:39Z

@heesung-sn @Demogorgon314 thanks! Feel free to ping me if you need a review.

heesung-sn · 2023-05-04T16:32:46Z

The code work is done, and we are working on documentation.

PIP: #16691 ### Motivation When upgrading the pulsar version and changing the pulsar load manager to `ExtensibleLoadManagerImpl` it might cause NPE. The root cause is the old version of pulsar does not contain the `loadManagerClassName` field. ``` 2023-05-18T05:42:50,557+0000 [pulsar-io-4-1] INFO org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.6:51345] connected with role=[pulsarinstance-v3-0-n@test.dev](mailto:pulsarinstance-v3-0-n@test.dev) using authMethod=token, clientVersion=Pulsar Go 0.9.0, clientProtocolVersion=18, proxyVersion=null 2023-05-18T05:42:50,558+0000 [pulsar-io-4-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup [pulsarinstance-v3-0-n@test.dev](mailto:pulsarinstance-v3-0-n@test.dev) for topic persistent://xxx with error java.lang.NullPointerException: Cannot invoke “String.equals(Object)” because the return value of “org.apache.pulsar.broker.loadbalance.extensions.data.BrokerLookupData.getLoadManagerClassName()” is null java.util.concurrent.CompletionException: java.lang.NullPointerException: Cannot invoke “String.equals(Object)” because the return value of “org.apache.pulsar.broker.loadbalance.extensions.data.BrokerLookupData.getLoadManagerClassName()” is null at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1194) ~[?:?] at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2309) ~[?:?] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.selectAsync(ExtensibleLoadManagerImpl.java:385) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$assign$6(ExtensibleLoadManagerImpl.java:336) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187) ~[?:?] at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2309) ~[?:?] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$assign$10(ExtensibleLoadManagerImpl.java:333) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:409) ~[io.streamnative-pulsar-common-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:243) ~[io.streamnative-pulsar-common-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.assign(ExtensibleLoadManagerImpl.java:327) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerWrapper.findBrokerServiceUrl(ExtensibleLoadManagerWrapper.java:66) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.namespace.NamespaceService.lambda$getBrokerServiceUrlAsync$0(NamespaceService.java:191) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] ``` ### Modifications * Add null check when using`getLoadManagerClassName`. * Add test to cover this case. * Add `RedirectManager` unit test.

heesung-sn · 2023-05-23T19:55:12Z

Currently, the system topics are non-partitioned and owned by a single leader. We can further partition these system topics with multi-leaders for really large clusters. This can be a follow-up task if requested.

Demogorgon314 · 2023-05-24T02:22:02Z

Currently, the system topics are non-partitioned and owned by a single leader. We can further partition these system topics with multi-leaders for really large clusters. This can be a follow-up task if requested.

Should we use a single partition as system topic instead of non-partitioned topic? Since the non-partitioned topic can't transfer to a partitioned topic. If I understand correct.

heesung-sn · 2023-05-24T03:35:58Z

Currently, the system topics are non-partitioned and owned by a single leader. We can further partition these system topics with multi-leaders for really large clusters. This can be a follow-up task if requested.

Should we use a single partition as system topic instead of non-partitioned topic? Since the non-partitioned topic can't transfer to a partitioned topic. If I understand correct.

I expect the single partition could have some overhead over non-partitioned topic, but I agree that starting from a single partitioned topic can be easier to extend to multi-partitions in the future. For this first iteration, I think we can stick to non-partitioned topics.

…ce extension (#20370) PIP: #16691 ### Motivation We need to create system topics without partitions explicitly. Currently, we do not support partitioned system topics. ### Modifications create system topics without partitions explicitly

PIP: #16691 ### Motivation When upgrading the pulsar version and changing the pulsar load manager to `ExtensibleLoadManagerImpl` it might cause NPE. The root cause is the old version of pulsar does not contain the `loadManagerClassName` field. ``` 2023-05-18T05:42:50,557+0000 [pulsar-io-4-1] INFO org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.6:51345] connected with role=[pulsarinstance-v3-0-n@test.dev](mailto:pulsarinstance-v3-0-n@test.dev) using authMethod=token, clientVersion=Pulsar Go 0.9.0, clientProtocolVersion=18, proxyVersion=null 2023-05-18T05:42:50,558+0000 [pulsar-io-4-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup [pulsarinstance-v3-0-n@test.dev](mailto:pulsarinstance-v3-0-n@test.dev) for topic persistent://xxx with error java.lang.NullPointerException: Cannot invoke “String.equals(Object)” because the return value of “org.apache.pulsar.broker.loadbalance.extensions.data.BrokerLookupData.getLoadManagerClassName()” is null java.util.concurrent.CompletionException: java.lang.NullPointerException: Cannot invoke “String.equals(Object)” because the return value of “org.apache.pulsar.broker.loadbalance.extensions.data.BrokerLookupData.getLoadManagerClassName()” is null at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1194) ~[?:?] at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2309) ~[?:?] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.selectAsync(ExtensibleLoadManagerImpl.java:385) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$assign$6(ExtensibleLoadManagerImpl.java:336) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187) ~[?:?] at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2309) ~[?:?] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.lambda$assign$10(ExtensibleLoadManagerImpl.java:333) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:409) ~[io.streamnative-pulsar-common-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:243) ~[io.streamnative-pulsar-common-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.assign(ExtensibleLoadManagerImpl.java:327) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerWrapper.findBrokerServiceUrl(ExtensibleLoadManagerWrapper.java:66) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] at org.apache.pulsar.broker.namespace.NamespaceService.lambda$getBrokerServiceUrlAsync$0(NamespaceService.java:191) ~[io.streamnative-pulsar-broker-3.0.0.1.jar:3.0.0.1] ``` ### Modifications * Add null check when using`getLoadManagerClassName`. * Add test to cover this case. * Add `RedirectManager` unit test. (cherry picked from commit b7f0004)

…ce extension (#20370) PIP: #16691 ### Motivation We need to create system topics without partitions explicitly. Currently, we do not support partitioned system topics. ### Modifications create system topics without partitions explicitly (cherry picked from commit 1080ad5)

…list in bundle admin API (#20528) PIP: #16691 ### Motivation When using `ExtensibleLoadManager` and list in bundle admin API, it will redirect forever because `isServiceUnitOwned` method is checking the `ownershipCache` as the ownership storage, however, when using `ExtensibleLoadManager`, it stored the ownership to table view. ### Modifications * Call `isServiceUnitOwnedAsync ` when using `isServiceUnitOwned `. * Add unit test to cover this case.

…list in bundle admin API (#20528) PIP: #16691 When using `ExtensibleLoadManager` and list in bundle admin API, it will redirect forever because `isServiceUnitOwned` method is checking the `ownershipCache` as the ownership storage, however, when using `ExtensibleLoadManager`, it stored the ownership to table view. * Call `isServiceUnitOwnedAsync ` when using `isServiceUnitOwned `. * Add unit test to cover this case.

heesung-sn · 2023-12-05T17:08:21Z

Post Update section has been added to the description to enumerate the added ServiceConfigurations for this project.

heesung-sn added the type/PIP label Jul 20, 2022

sijie mentioned this issue Jul 20, 2022

ISSUE-16691: PIP-192: New Pulsar Broker Load Balancer streamnative/pulsar-archived#4581

Open

gaozhangmin assigned heesung-sn Jul 22, 2022

codelipenghui added this to the 2.12.0 milestone Aug 11, 2022

github-actions bot added the Stale label Sep 23, 2022

dave2wave mentioned this issue Sep 27, 2022

[feature] Add Metrics to understand Load Balancing Decisions #17860

Open

2 tasks

sijie mentioned this issue Sep 28, 2022

ISSUE-17860: [feature] Add Metrics to understand Load Balancing Decisions streamnative/pulsar-archived#4936

Open

2 tasks

heesung-sn mentioned this issue Oct 18, 2022

[improve][broker] PIP-192 Added ServiceUnitStateChannel interface #18079

Merged

14 tasks

Demogorgon314 mentioned this issue Oct 18, 2022

[improve][broker] PIP-192: Define new load manager base interfaces #18084

Merged

4 tasks

heesung-sn mentioned this issue Oct 18, 2022

PIP-215: Configurable TopicCompactionStrategy for StrategicTwoPhaseCompactor and TableView #18099

Closed

sijie mentioned this issue Oct 19, 2022

ISSUE-18099: PIP-215: Configurable TopicCompactionStrategy for StrategicTwoPhaseCompactor and TableView streamnative/pulsar-archived#5006

Open

heesung-sn mentioned this issue Oct 27, 2022

PIP-220: TransferShedder (Only for PIP-192 New Broker Load Manager ) #18215

Closed

sijie mentioned this issue Oct 27, 2022

ISSUE-18215: PIP-220: TransferShedder (Only for PIP-192 New Broker Load Manager ) streamnative/pulsar-archived#5044

Open

Demogorgon314 mentioned this issue Oct 28, 2022

PIP-221: Make TableView support TTL #18229

Closed

sijie mentioned this issue Oct 28, 2022

ISSUE-18229: PIP-221: Make TableView support TTL streamnative/pulsar-archived#5046

Open

This was referenced Nov 7, 2022

[improve][client] Make table view support read the non-persistent topic Demogorgon314/pulsar#5

Open

[improve][client] Make table view support read the non-persistent topic #18375

Merged

This was referenced Nov 16, 2022

[improve][broker] PIP-192 Added ServiceUnitStateChannelImpl heesung-sn/pulsar#14

Closed

[improve][broker] PIP-192 Added ServiceUnitStateChannelImpl #18489

Merged

github-actions bot removed the Stale label Nov 18, 2022

merlimat modified the milestones: 3.1.0, 3.0.0 Apr 19, 2023

Demogorgon314 mentioned this issue Apr 19, 2023

[improve][test] Add integration test for ExtensibleLoadManager #20138

Merged

4 tasks

heesung-sn closed this as completed May 4, 2023

heesung-sn mentioned this issue May 13, 2023

[improve][broker] Gracefully shut down load balancer extension #20315

Merged

15 tasks

Demogorgon314 mentioned this issue May 18, 2023

[fix][broker] Fix broker load manager class filter NPE #20350

Merged

4 tasks

heesung-sn mentioned this issue May 23, 2023

[fix][broker] pre-create non-partitioned system topics for load balance extension #20370

Merged

15 tasks

This was referenced Jun 7, 2023

[improve][broker] Emit the namespace bundle listener event on extensible load manager #20525

Merged

[fix][broker] Fix redirect loop when using ExtensibleLoadManager and list in bundle admin API #20528

Merged

JooHyukKim mentioned this issue Jun 8, 2023

Verify at least one URL is present for cluster creation #19866

Closed

2 tasks

This was referenced Jun 9, 2023

[fix][broker] Handle heartbeat namespace in ExtensibleLoadManager #20551

Merged

[improve][broker] Handle get owned namespaces admin API in ExtensibleLoadManager #20552

Merged

Anonymitaet mentioned this issue Jun 13, 2023

[doc][feat] add docs for broker load balancer #20570

Open

2 tasks

poorbarcode mentioned this issue Aug 10, 2023

[improve] [ml] Persist mark deleted ops to ZK if create cursor ledger was failed #20935

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIP-192: New Pulsar Broker Load Balancer #16691

PIP-192: New Pulsar Broker Load Balancer #16691

heesung-sn commented Jul 20, 2022 •

edited

Loading

heesung-sn commented Aug 23, 2022

github-actions bot commented Sep 23, 2022

heesung-sn commented Nov 17, 2022 •

edited

Loading

cbornet commented Apr 19, 2023

Anonymitaet commented Apr 20, 2023 •

edited

Loading

heesung-sn commented Apr 20, 2023

Anonymitaet commented Apr 20, 2023

heesung-sn commented May 4, 2023

heesung-sn commented May 23, 2023

Demogorgon314 commented May 24, 2023

heesung-sn commented May 24, 2023 •

edited

Loading

heesung-sn commented Dec 5, 2023

PIP-192: New Pulsar Broker Load Balancer #16691

PIP-192: New Pulsar Broker Load Balancer #16691

Comments

heesung-sn commented Jul 20, 2022 • edited Loading

Proposal: New Pulsar Broker Load Balancer

Motivation

Goals

User-facing goals

Logic

Logs / Metrics

Admin API / Configurations

Internal Implementation goals

Logic

Implementation

Logs / Metrics

Admin API / Configurations

Testing

API Changes

Implementation (High-Level Components)

New Load Manager

Load Data Models

LocalBrokerData: broker’s factual data

BrokerLoadData: broker’s load data

BundlesLoadData: bundle’s load data

TopBundlesLoadData: top-n high-loaded bundle load data from the broker

Load Data Write-Read Flow

LocalBrokerData

BrokerLoadData

BundlesLoadData

TopBundlesLoadData

Load Data Flow

Major Modifications on Bundle Split, Unload, and Assignment Flow

Bundle State Channel

Bundle State Lifecycles

Bundle State Change Examples

Bundle Transfer Example

Bundle Split Example

Bundle Assignment Example

Bundle-Broker Ownership State

Bundle State Channel Owner Selection and Discovery

Channel Owner Selection

Channel Owner Discovery

Conflict State Resolution(Race Conditions)

Failure Recovery

When a broker is down

When the entire ZK is down and comes back

Bundle State and Load Data TableView Scalability

Split Broker Cluster to multiple clusters

Partitioned Table-View (short-term)

Sharding (long-term)

Reject Alternatives

Modification Summary

Post Update

heesung-sn commented Aug 23, 2022

github-actions bot commented Sep 23, 2022

heesung-sn commented Nov 17, 2022 • edited Loading

Project Status: Sub-PIPs and PRs

Sub-PIPs

PRs

cbornet commented Apr 19, 2023

Anonymitaet commented Apr 20, 2023 • edited Loading

heesung-sn commented Apr 20, 2023

Anonymitaet commented Apr 20, 2023

heesung-sn commented May 4, 2023

heesung-sn commented May 23, 2023

Demogorgon314 commented May 24, 2023

heesung-sn commented May 24, 2023 • edited Loading

heesung-sn commented Dec 5, 2023

heesung-sn commented Jul 20, 2022 •

edited

Loading

heesung-sn commented Nov 17, 2022 •

edited

Loading

Anonymitaet commented Apr 20, 2023 •

edited

Loading

heesung-sn commented May 24, 2023 •

edited

Loading