Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix multiple race conditions in topic unloading and loading #20540

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jun 8, 2023

Motivation

In Pulsar, there's a unresolved issue around topic unloading and loading which could lead to "Attempting to add producer to a fenced topic" & "Topic is temporarily unavailable" errors.
Very recently #20526 was reported against Pulsar 2.11.1. Other related issues are #5284, #14941 and multiple others.

Modifications

  • stop continuing to load a topic if the topic future has already been completed
    • the completion of the future is used to signal that the topic is being unloaded or it has timed out
  • stop continue to load topics when Pulsar broker is shutting down
  • address possible race conditions when the topic future is completed.
    • check CompletableFuture.complete / .completeExceptionally method's boolean return value.
      • if it's false, this mean that the topic was possibly unloaded concurrently and the newly loaded instance should be closed.
        • close the topic if this is detected.
  • when unloading a namespace, wait for all topic closing futures to complete (although the current closing logic is asynchronous)
    • ignore possible errors at close time since that could lead to the namespace unloading logic to fail in unexpected ways.
  • add cleanup logic for topic loading throttling solution that gets used when the number of concurrent topic loads exceeds maxConcurrentTopicLoadRequest (default: 5000). Stop continuing to process the pendingTopicLoadingQueue if Pulsar broker is closing. Skip loading topics which have a topic future that has been completed (signals topic unloading).
  • add a solution to cleanup topic cache without doing a namespace bundle lookup
    • at cleanup time, remove the topic from the cache where it was added to.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: lhotari#150

@poorbarcode
Copy link
Contributor

@lhotari A quick update, I think these are helpful for these issues

@lhotari
Copy link
Member Author

lhotari commented Jul 5, 2023

@lhotari A quick update, I think these are helpful for these issues

@poorbarcode the PRs you referenced don't seem to make this current PR #20540 obsolete. Please review

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @lhotari. I left a few comments.

for (TopicLoadingContext topicLoadingContext = getNextPendingTopic(); topicLoadingContext != null;
topicLoadingContext = getNextPendingTopic()) {
topicLoadingContext.getTopicFuture()
.completeExceptionally(new NotAllowedException("Broker is shutting down"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what do you think about ServiceNotReady instead? I haven't looked too closely, but NotAllowed might not give the right client side behavior:

if (error.getError() == ServerError.NotAllowedError) {
log.error("Get not allowed error, {}", error.getMessage());
connectionFuture.completeExceptionally(new PulsarClientException.NotAllowedException(error.getMessage()));
}

TopicLoadingContext pendingTopic = pendingTopicLoadingQueue.poll();
if (!pulsar().isRunning()) {
log.warn("Pulsar is not running, skip create pending topic");
failPendingTopics();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of calling this here if the closeAsync method also calls it? The logic is correct, but it seems like we could skip this line and leave the cleanup logic to the closeAsync method.

Comment on lines -2276 to +2287
private void removeTopicFromCache(String topic, NamespaceBundle namespaceBundle,
CompletableFuture<Optional<Topic>> createTopicFuture) {
String bundleName = namespaceBundle.toString();
private void removeTopicFromCache(String topic,
String bundleName, CompletableFuture<Optional<Topic>> createTopicFuture) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I prefer to keep the types as long as possible to prevent misuse of a method in the future. Why remove the NamespaceBundle type here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that keeping types would make sense. In this case, there was already String topic so I thought that it's more consistent to use String also for the namespace bundle. We could also take it to another direction. :)

@codecov-commenter
Copy link

Codecov Report

Merging #20540 (bc9a806) into master (ac33311) will decrease coverage by 36.12%.
The diff coverage is 53.53%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #20540       +/-   ##
=============================================
- Coverage     73.00%   36.89%   -36.12%     
+ Complexity    32096    12083    -20013     
=============================================
  Files          1868     1691      -177     
  Lines        138999   129449     -9550     
  Branches      15292    14128     -1164     
=============================================
- Hits         101479    47760    -53719     
- Misses        29485    75426    +45941     
+ Partials       8035     6263     -1772     
Flag Coverage Δ
inttests 24.11% <36.36%> (?)
systests 25.19% <36.36%> (+0.09%) ⬆️
unittests 31.90% <50.50%> (-40.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/java/org/apache/pulsar/broker/service/Topic.java 9.09% <ø> (-22.73%) ⬇️
...oker/service/nonpersistent/NonPersistentTopic.java 45.30% <0.00%> (-27.21%) ⬇️
...sar/broker/service/persistent/PersistentTopic.java 54.92% <ø> (-24.90%) ⬇️
...rg/apache/pulsar/broker/service/BrokerService.java 56.22% <52.80%> (-25.24%) ⬇️
...rg/apache/pulsar/broker/service/AbstractTopic.java 60.34% <66.66%> (-27.81%) ⬇️

... and 1432 files with indirect coverage changes

@poorbarcode
Copy link
Contributor

poorbarcode commented Jul 21, 2023

the PRs you referenced don't seem to make this current PR #20540 obsolete.

Yes

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 31, 2023
@lightzhao
Copy link
Contributor

LGTM. Is there a way to add a test to verify this?

@github-actions github-actions bot removed the Stale label Sep 23, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@graysonzeng
Copy link
Contributor

graysonzeng commented Dec 7, 2023

Thanks for the fix, I tried using this PR but still encountered this problem on rolling restart bookies and was unable to consume until I restarted all the broker. I will try to troubleshoot this issue.

Pulsar version: v3.1.1

log:

WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/default2/input_test2-partition-1] [subtest15] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: {"errorMsg":"Topic is temporarily unavailable","reqId":1138682772387920139, "remote":"21.21.47.12/21.21.47.12:6650", "local":"/11.145.30.236:60372"}
2023-12-05T19:03:59,569+0800 [pulsar-client-io-1-1] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/default2/input_test2-partition-1] [subtest15] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: {"errorMsg":"Topic is temporarily unavailable","reqId":1138682772387920139, "remote":"21.21.47.12/21.21.47.12:6650", "local":"/11.145.30.236:60372"} -- Will try again in 0.779 s`

@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@lhotari lhotari force-pushed the lh-fix-topic-unloading-loading-raceconditions branch from bc9a806 to 76961f4 Compare December 26, 2023 19:50
@lhotari
Copy link
Member Author

lhotari commented Dec 27, 2023

This test app ends up with a topic being fenced: https://github.com/lhotari/pulsar-playground/blob/master/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioUnloading.java , at least with Pulsar 3.1.1 (docker run --rm --name pulsar-standalone -d -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.1.1 /pulsar/bin/pulsar standalone -nss -nfw) . Will check later if the fix helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants