Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Key_Shared subscription could deliver messages late and nonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist #23200

Closed
2 of 3 tasks
lhotari opened this issue Aug 19, 2024 · 2 comments · Fixed by #23231
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented Aug 19, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

This issue is present in master, branch-3.0 and branch-3.3 branch.
It seems to be present in all released Pulsar versions since #7105.

Minimal reproduce step

Start Pulsar 3.3.1 in Docker:

docker run --name pulsar-standalone --rm -it -e PULSAR_STANDALONE_USE_ZOOKEEPER=1 -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.3.1 /pulsar/bin/pulsar standalone -nss -nfw

Clone, compile & run the test app (source code) that reproduces the issue:

git clone -b key_shared_issue-2024-08-19 --depth=1 https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew build
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared

The test takes over 7 minutes to run and the final log lines show the results:

2024-08-19T15:31:35,983+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
max latency difference of subsequent messages: 53071 ms
max ack holes: 27103
2024-08-19T15:31:35,983+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 445 s, max latency difference of subsequent messages 52959 ms
2024-08-19T15:31:35,983+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 417 s, max latency difference of subsequent messages 50930 ms
2024-08-19T15:31:35,983+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 437 s, max latency difference of subsequent messages 53071 ms
2024-08-19T15:31:35,983+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 443 s, max latency difference of subsequent messages 52651 ms

What it tracked that there was a 53071 ms latency spike between 2 subsequent messages.
The maximum value of totalNonContiguousDeletedMessagesRange observed during the test was
27103, which is more than the default value of managedLedgerMaxUnackedRangesToPersist.

The test will use a random namespace name in the format test_ns1723788349510. use this command to find out the namespace name:

pulsar-admin namespaces list public

After that, you can check the stats and stats-internal:

pulsar-admin topics stats-internal persistent://public/test_ns1723788349510/test
pulsar-admin topics stats persistent://public/test_ns1723788349510/test

There's a large amount of "ack holes" seen in the totalNonContiguousDeletedMessagesRange stats metric of the subscription sub.

What did you expect to see?

In an application that uses Key_Shared subscription:

  • messages should be delivered without unnecessary delays.
  • there shouldn't be a reason that there's a large amount of "ack holes" (totalNonContiguousDeletedMessagesRange) in the subscription so that it exceeds the default managedLedgerMaxUnackedRangesToPersist of 10000

The test application, https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java

  • acknowledges all messages in the received order
  • in processing of each message, it contains a random processing delay of 100ms with 3% probability (this seems to be necessary to amplify the problem)

What did you see instead?

In an application using Key_Shared subscription:

  • messages got delivered messages late without a proper reason. In the repro app, a latency spike of over 50 seconds later is compared while comparing the end-to-end latency (from producing to consuming) of subsequent messages.
  • there were a lot of "ack holes". totalNonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist which means that subscription state would be lost in broker restart or topic load balancing event

Anything else?

This problem seems to reproduce only in the backlogged cases where there's already an existing backlog or when consumers aren't able to keep up with the producer. The problem is resolved after the consumers catch up, but the intermediate state is that messages get delivered late and totalNonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist during the catch up time. This seems to be completely unnecessary.

I made an experiment where I reverted some of the #7105 changes in this commit: lhotari@5665b11.

Here's how to build and run pulsar-standalone for this experiment (instead of running pulsar-standalone in docker):

git clone -b lh-key-shared-experiment-2024-08-19 --depth=2 --single-branch https://github.com/lhotari/pulsar
cd pulsar
mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dlicense.skip=true -DnarPluginPhase=none
rm -rf data
PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

after this, it's possible to run java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared as explained before.

The results for this experiment:

2024-08-19T15:48:49,718+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
max latency difference of subsequent messages: 922 ms
max ack holes: 711
2024-08-19T15:48:49,718+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 606 ms
2024-08-19T15:48:49,718+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 922 ms
2024-08-19T15:48:49,718+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 835 ms
2024-08-19T15:48:49,718+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 450 s, max latency difference of subsequent messages 716 ms

The observed problem is resolved in the experiment. The maximum latency spike is 922 ms compared to 53071.
There is no ack hole problem since maximum amount of observed ack holes (totalNonContiguousDeletedMessagesRange) is 711 compared to 27103 ack holes.
However it's likely that the correct fix is a broader change in how replays are handled in Key_Shared subscription.

--

It's possible that the issue observed in this report is related to #21199 and #21657 / #21656.
The PR #21656 is related to replays with Key_Shared subscriptions.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari lhotari added the type/bug The PR fixed a bug or issue reported a bug label Aug 19, 2024
@lhotari
Copy link
Member Author

lhotari commented Aug 20, 2024

I also tested with PIP-299 feature dispatcherPauseOnAckStatePersistentEnabled=true. It seems that the PIP-299 feature doesn't work properly so that it would keep "ack holes" below 10000. I can see that the value exceeds 10000. The highest value I observed in the test was "nonContiguousDeletedMessagesRanges" : 11097 . (These are the topic stats and internal stats.)

@lhotari lhotari changed the title [Bug] Key_Shared subscription could deliver messages late and totalNonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist [Bug] Key_Shared subscription could deliver messages late and nonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist Aug 20, 2024
@lhotari
Copy link
Member Author

lhotari commented Aug 26, 2024

Interestingly, there's a related problem description in another context written by @Shawyeok: #15445 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Development

Successfully merging a pull request may close this issue.

1 participant