-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Key_Shared or Shared subscription doesn't always deliver messages from the replay queue after a consumer disconnects and leaves a backlog unless new messages are produced #23845
Comments
There's already a solution to cancel a pending read when a hash gets unblocked: Lines 109 to 122 in acac72e
There seems to be a bug in this behavior so that it didn't catch the case that was encountered. One possible reason for this is that a consumer didn't have permits when the unblocking happened. There would need to be some logic to handle that case. |
It looks like this piece of code triggers a re-read here. I tried the related test cases, but no errors occurred. Could you provide more details about what you meant by "permit"? @lhotari |
@walkinggo You can read more about permits here: https://pulsar.apache.org/docs/4.0.x/developing-binary-protocol/#flow-control . Just to be clear, I'm not looking for contributions to address this particular issue, I've assigned it to myself and currently working it. |
ok,i got it. |
I think I finally found a potential race condition by analysing the code. When a consumer is removed, the pending messages get added to the replay queue a new read gets triggered: Lines 243 to 258 in ea56ada
This calls Lines 1432 to 1437 in ea56ada
In the "classic" implementation, there's a direct Lines 231 to 240 in ea56ada
This also supports this theory, that this problem appears in 4.0, but not with 3.x Key_Shared implementation. The reason why the Lines 338 to 346 in ea56ada
This is the code for Lines 792 to 814 in ea56ada
By default, the Lines 752 to 761 in ea56ada
It will first set the A similar race condition problem could also happen with the Shared subscription type, this is not specific to Key_Shared. |
I also noticed another subtle problem regarding Key_Shared subscription. Lines 230 to 255 in acac72e
For Shared subscription, it's not necessary to cancel the pending read, since ordering doesn't matter. The same applies when Key_Shared subscription is running in allowOutOfOrderDelivery mode.
|
Search before asking
Read release policy
Version
Pulsar 4.0.1
Minimal reproduce step
Exact steps to reproduce aren't yet confirmed.
This problem was faced in a test where there was a large number of consumers that were scaled in a way where consumers were added and removed. The problem was noticed at the end of the test case, where all messages didn't get delivered to consumers and remained in the backlog.
In the topic stats for the subscription,
msgInReplay
showed a positive value and in internal stats for the subscriptionsubscriptionHavePendingRead
was true. By looking at the code, it seems to be a case that isn't handled for PersistentDispatcherMultipleConsumers/PersistentStickyKeyDispatcherMultipleConsumers.What did you expect to see?
The cursor shouldn't go into completely into "waiting" state when there are messages in the replay queue.
What did you see instead?
Messages in the replay queue don't get dispatched to consumers.
Anything else?
Possible workaround is to set
dispatcherDispatchMessagesInSubscriptionThread=false
inbroker.conf
to prevent the race condition causing this issue from happening.Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: