-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flakiness issues with subscriptionKeySharedUseConsistentHashing=true / PIP-119 in CPP tests #13965
Comments
Yeah, here is a screenshot of my Java consumer application. The topic was created by C++ UT and received 3000 messages from C++ producer. Java consumers should have received 3000 messages in total, and sometimes it works well. The code is private static int receive(Consumer<byte[]> consumer) throws PulsarClientException {
int n = 0;
while (true) {
final Message<byte[]> msg = consumer.receive(2, TimeUnit.SECONDS);
if (msg == null) {
break;
}
n++;
System.out.println("Received " + new String(msg.getValue())
+ " from " + msg.getMessageId() + ", key: " + msg.getKey());
}
return n;
}
public static void main(String[] args) throws PulsarClientException {
final PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
final ConsumerBuilder<byte[]> builder = client.newConsumer()
.topicsPattern(".*KeySharedConsumerTest-multi-topics.*")
.subscriptionInitialPosition(SubscriptionInitialPosition.Earliest)
.subscriptionType(SubscriptionType.Key_Shared)
.subscriptionName("my-sub-1");
final Consumer<byte[]> consumer1 = builder.clone().subscribe();
final Consumer<byte[]> consumer2 = builder.clone().subscribe();
final Consumer<byte[]> consumer3 = builder.clone().subscribe();
int n1 = receive(consumer1);
int n2 = receive(consumer2);
int n3 = receive(consumer3);
System.out.println("n1: " + n1 + ", n2: " + n2 + ", n3: " + n3 + ", total: " + (n1 + n2 + n3));
client.close();
} But I cannot reproduce it with Java UT easily at the moment. |
It's weird that it's hard to reproduce it in a unit test but easy to reproduce with a standalone. Here is my project to reproduce it: https://github.com/BewareMyPower/pulsar-issue-13965-reproduce /cc @codelipenghui |
Great work on the repro case @BewareMyPower ! |
the problem can be reproduced with Pulsar 2.8.2 and Pulsar 2.7.4 too. A quick way to start Pulsar 2.8.2 standalone with
|
The result in the repro seems to be the same with |
Also tried by starting the container without
then running
The problem reproduces after restart too. |
Yeah. It's weird that I also tried C++ client to consume the topic produce by Java client with #include <assert.h>
#include <iostream>
#include <vector>
#include <pulsar/Client.h>
using namespace pulsar;
int main() {
Client client("pulsar://localhost:6650");
std::vector<Consumer> consumers(3);
for (size_t i = 0; i < consumers.size(); i++) {
ConsumerConfiguration conf;
conf.setConsumerType(ConsumerType::ConsumerKeyShared);
conf.setSchema(SchemaInfo(STRING, "String", ""));
conf.setSubscriptionInitialPosition(InitialPositionEarliest);
conf.setPatternAutoDiscoveryPeriod(1);
auto result =
client.subscribeWithRegex(".*KeySharedConsumerTest-multi-topics.*", "my-sub", conf, consumers[i]);
assert(result == ResultOk);
}
std::vector<int> numReceivedList;
int numTotal = 0;
for (Consumer& consumer : consumers) {
int n = 0;
Message msg;
while (true) {
auto result = consumer.receive(msg, 3000);
if (result == ResultTimeout) {
break;
}
assert(result == ResultOk);
n++;
}
numReceivedList.emplace_back(n);
numTotal += n;
}
for (int n : numReceivedList) {
std::cout << n << std::endl;
}
std::cout << numTotal << std::endl;
client.close();
} The code above is nearly the same as
But when I run C++ UT, it never failed now when ./tests/main --gtest_filter='*testMultiTopics' Here are 5 test results in a row:
|
The messages should not get acknowledged (and thus lost), right? |
I just checked, adding consumer acks, and the messages are there in the backlog, so this shouldn't be characterized |
I think there is a problem with the repro code above and with the original C++ test and the production code is actually correct here. The repro code is doing:
There are 2 problems in the way the repro code (and the test) are consuming:
What happens is that 1000 messages are pushed to the 1st consumer, and then the other consumers are stalled because the 1st consumer didn't acknowledge. |
A couple of ways to fix the test:
|
Yes, I applied BewareMyPower/pulsar-issue-13965-reproduce@8e85c77 and the test works now. But I'm still confused that what makes the difference that the C++ test |
I've tried both solutions above in C++ client and it works well. But I'm confused about why it works, could you explain a little more? @merlimat |
I found another problem with However, the Then, two consumers are created to consume these messages. I expect each consumer can consume a key. However, sometimes it failed with
We can see one consumer received all messages. Is it an expected behavior when |
While it was clear for the repro code that you share, I couldn't find the problem in the C++ test (since it was acking the messages).
The random behavior is due to the auto-generated consumer names. The selection of consumers is done like:
For 2 given message keys, you're not guaranteed that they will be assigned to 2 different consumers, although statistically, the keys will be evenly distributed across consumers. To make the test deterministic, you can set the consumer names (different) on the 2 consumers. |
I agree. I renamed the issue. I'll close this issue since it seems to be addressed. @BewareMyPower Please reopen if there's more to do. |
We are facing this issue that yields a constantly increasing backlog with millions of messages. Is there a fix for it or a guidance to implement our consumers? We end-up with 2 out of 16 consumer service pods that can subscribe with the others idling and not receiving any message. |
Describe the bug
Quoting @BewareMyPower from #13963
I also made similar observations based on C++ test logs:
Example of failures:
full logs in https://github.com/apache/pulsar/suites/5064608592/artifacts/150614790
To Reproduce
Steps to reproduce the behavior:
subscriptionKeySharedUseConsistentHashing=true
Expected behavior
There shouldn't be any message loss
The text was updated successfully, but these errors were encountered: