-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Key_Shared subscription could deliver messages late and nonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist #23200
Comments
I also tested with PIP-299 feature |
Interestingly, there's a related problem description in another context written by @Shawyeok: #15445 (comment) . |
Search before asking
Read release policy
Version
This issue is present in master, branch-3.0 and branch-3.3 branch.
It seems to be present in all released Pulsar versions since #7105.
Minimal reproduce step
Start Pulsar 3.3.1 in Docker:
Clone, compile & run the test app (source code) that reproduces the issue:
The test takes over 7 minutes to run and the final log lines show the results:
What it tracked that there was a 53071 ms latency spike between 2 subsequent messages.
The maximum value of
totalNonContiguousDeletedMessagesRange
observed during the test was27103, which is more than the default value of
managedLedgerMaxUnackedRangesToPersist
.The test will use a random namespace name in the format
test_ns1723788349510
. use this command to find out the namespace name:After that, you can check the stats and stats-internal:
There's a large amount of "ack holes" seen in the
totalNonContiguousDeletedMessagesRange
stats metric of the subscriptionsub
.What did you expect to see?
In an application that uses Key_Shared subscription:
totalNonContiguousDeletedMessagesRange
) in the subscription so that it exceeds the defaultmanagedLedgerMaxUnackedRangesToPersist
of 10000The test application, https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java
What did you see instead?
In an application using Key_Shared subscription:
totalNonContiguousDeletedMessagesRange
could exceedmanagedLedgerMaxUnackedRangesToPersist
which means that subscription state would be lost in broker restart or topic load balancing eventAnything else?
This problem seems to reproduce only in the backlogged cases where there's already an existing backlog or when consumers aren't able to keep up with the producer. The problem is resolved after the consumers catch up, but the intermediate state is that messages get delivered late and totalNonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist during the catch up time. This seems to be completely unnecessary.
I made an experiment where I reverted some of the #7105 changes in this commit: lhotari@5665b11.
Here's how to build and run pulsar-standalone for this experiment (instead of running pulsar-standalone in docker):
after this, it's possible to run
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared
as explained before.The results for this experiment:
The observed problem is resolved in the experiment. The maximum latency spike is 922 ms compared to 53071.
There is no ack hole problem since maximum amount of observed ack holes (
totalNonContiguousDeletedMessagesRange
) is 711 compared to 27103 ack holes.However it's likely that the correct fix is a broader change in how replays are handled in Key_Shared subscription.
--
It's possible that the issue observed in this report is related to #21199 and #21657 / #21656.
The PR #21656 is related to replays with Key_Shared subscriptions.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: