Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix the broker registering might be blocked for long time #23371

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Sep 29, 2024

Fixes #23365

Motivation

#23359 would register the broker again if the registered metadata node was somehow deleted. However, the listeners in BrokerRegistryImpl are triggered in the load manager thread (via Pulsar#Service#loadManagerExecutor).

this.scheduler.submit(() -> {
String brokerId = t.getPath().substring(LOADBALANCE_BROKERS_ROOT.length() + 1);
for (BiConsumer<String, NotificationType> listener : listeners) {
listener.accept(brokerId, t.getType());
}

However, this thread executes many blocking tasks, including monitor(), playLeader() and playFollower() of ExtensibleLoadManagerImpl. If there were no available brokers, the blocking tasks might never complete until timeout exceeds. Here is an example:

2024-09-29T15:24:15,675 - WARN  - [pulsar-io-24-2:ConnectionHandler] - [non-persistent://pulsar/system/loadbalancer-broker-load-data] [reader-40952cbf05] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: {"errorMsg":"No available broker found.","reqId":175112655758658573, "remote":"localhost/127.0.0.1:56266", "local":"/127.0.0.1:56281"} -- Will try again in 3.146 s
2024-09-29T15:24:17,686 - WARN  - [pulsar-load-manager-22-1:ExtensibleLoadManagerImpl] - The broker:localhost:56270 failed to set the role. Retrying 1 th ...
org.apache.pulsar.broker.loadbalance.extensions.store.LoadDataStoreException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.broker.loadbalance.extensions.store.TableViewLoadDataStoreImpl.startTableView(TableViewLoadDataStoreImpl.java:179) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.store.TableViewLoadDataStoreImpl.start(TableViewLoadDataStoreImpl.java:157) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.store.TableViewLoadDataStoreImpl.init(TableViewLoadDataStoreImpl.java:141) ~[classes/:?]
	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl.playLeader(ExtensibleLoadManagerImpl.java:849) ~[classes/:?]

In this case, registerAsync would have to wait until these methods are done, while they could have been recovered by a registerAsync call.

While the direct cause of failed BrokerRegistryIntegrationTest is that TableViewImpl#flush would not complete if there is no message in the topic. See #23365 (comment)

Modifications

Two fixes:

  1. Fix the TableViewImpl#flush by filtering the empty topics via checking if the last message id's entry id is non-negative
  2. Call registerAsync directly in the metadata store callback thread instead of the load manager thread. It's an asynchronous metadata store operation so it won't block the thread.

Tests enhancements:

  • Enable BrokerRegistryIntegrationTest again to verify fix 1 works
  • Add BrokerRegistryMetadataStoreIntegrationTest to verify fix 2 works
  • Check the broker close time to verify fix 1 is still required even if we have fix 2

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower added this to the 4.0.0 milestone Sep 29, 2024
@BewareMyPower BewareMyPower self-assigned this Sep 29, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 29, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.56%. Comparing base (bbc6224) to head (26edb49).
Report is 614 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23371      +/-   ##
============================================
+ Coverage     73.57%   74.56%   +0.98%     
- Complexity    32624    34485    +1861     
============================================
  Files          1877     1934      +57     
  Lines        139502   145137    +5635     
  Branches      15299    15870     +571     
============================================
+ Hits         102638   108218    +5580     
+ Misses        28908    28628     -280     
- Partials       7956     8291     +335     
Flag Coverage Δ
inttests 27.80% <75.00%> (+3.22%) ⬆️
systests 24.53% <0.00%> (+0.21%) ⬆️
unittests 73.90% <100.00%> (+1.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ker/loadbalance/extensions/BrokerRegistryImpl.java 82.88% <100.00%> (-1.88%) ⬇️
...a/org/apache/pulsar/client/impl/TableViewImpl.java 86.59% <100.00%> (+1.28%) ⬆️

... and 602 files with indirect coverage changes

@codelipenghui codelipenghui merged commit 7d7dc80 into apache:master Sep 30, 2024
57 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-broker-registry-test branch September 30, 2024 06:34
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion
6 participants