Fix pulsar intrinsic bugs #176

Rowanto · 2019-08-29T13:04:02Z

There are a few bugs currently to fix in order to run pulsar with liiklus:

Flux.when for pulsar consumer creation have a bug which creates too much consumers without proper resource cleanup, this is fixed in Fix closing of Pulsar's consumers #174 , this also take some code from there
Pulsar client does not automatically reconnect, and hence skipping AlreadyClosedException will cause the connection to be stuck, and also adding retry so that a new consumer can be re-created if the pulsar client consumer is for some reason disconnected
Reset subscription is always failing if the consumer is lagging because the offset for reset subscription will refer to closed or old ledger, and pulsar treat entryId check for these ledgers differently than the latest ledger (it's minus 1 entry id)
Failover subscription mode gets stuck if the consuming consumer is disconnected. Currently, liiklus connects to all partitions, and the balancing of consumer per partition happens on the record storage client side. However, if there are two pulsar consumer clients on failover mode for one subscription, and the consuming client is disconnected, the other client does not get any message. This might be due to the manual reset of the subscription offset (because pulsar server disconnect all consumers of a subscription on a reset/seek offset). Hence, switching to exclusive subscription type so that liiklus can handle the rebalancing properly. This model is closed to Reader interface of pulsar, which is made for manual offset management. However pulsar reader interface itself can't be used since it does not allow exclusive reading of a partition.
Currently the pulsar records storage is taking only event timestamp, however event timestamp does not always exist, and it should fall back to publish timestamp. Otherwise the liiklus client will see 1970-01-01 as timestamp.

bsideup · 2019-08-29T13:19:53Z

...ecords-storage/src/test/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorageTest.java

+    @Override
+    // currently each liiklus consumers consume all partition, and leaves the client to rebalance
+    // in pulsar, the pulsar client does not rebalance naturally on failover or exclusive
+    public void testMultipleGroups() throws Exception {


this test is pretty storage-agnostic, but does it have to be overridden? I am afraid we're introducing a regression here

I am not sure it's the regression, if anything I would say that the tests depends too much on the intrinsic behaviour of the underlying storage.

e.g. for kafka, we can have two consumers for a group name, and there will be rebalancing, and the rebalancing is working properly. e.g. if one consumer disconnnect, the second consumer will do a reset offset based on liiklus offset and continue consuming. However, in pulsar failover mode, having two consumers, doing an offset seek will disconnect all consumers, and cause the second consumer to also be stuck.

Doesn't exclusive mode allow only one consumer to be connected at a time? So that we get assignment-like behaviour, assumed by this test

it differs in subscription time, or let me check the tests in more detail

testExclusiveRecordDistribution
here the test checks that if two consumers connect at the same time, that each consumer will have messages consumed. however this is not correct, as the rebalancing can happens that only one consumer is the one handling all partitions. this is not very liiklus specific, but dependant on the records storage native client implementation.

testMultipleGroups
here the test checks that if one consumer is connected, that it's getting all messages. however the next thing tested is that if a second consumer is connected. the second consumer also have consumed some messages. this is also the wrong assumption as above as liiklus does not do any balancing, but the records storage native client implementation does.

Checking above tests, since liiklus always assign all partitions to any connecting consumers, the rebalancing happens based on the records storage native client implementation. So, it is the test which is not consistent, or in other words, the test is expecting some native client implementation behaviour (in this case pulsar).

or am I missing some understanding here?

it rebalance only if a consumer of a partition died because of any reason. but this rebalancing is more on liiklus side, which connects the pulsar consumer with exclusive subscription.

well, it falls back, but the "rebalancing" is not happening (where "rebalancing" is re-assigning partitions between consumers to balance the load)

Which makes me wonder.... unlike in Kafka, in Pulsar, partitions are topics, and there is N consumers for each "partitioned topic". Which means that Pulsar cannot detect that N consumers are coming from the same "process", hence no rebalancing.

Maybe there is some property to set when "exclusive" is used to make it detect overassigned instances and do the "rebalancing"?

Pulsar's abstraction is a bit different than Kafka. In Kafka, there is a partition -> consumer. I don't think the partition abstraction is that different (except that in pulsar it's exposed as another topic internally). In pulsar, there is a partition -> subscription -> consumer.

So, a pulsar client does not really connect to a partition, but it connects to a subscription (based on a subscription name). In turn, subscription is connected to 1 topic or more (in our case it's only 1 subscription (per 1 subscription name) per 1 partition. Exclusive means, that only one consumer can be connected to a subscription. With Failover, it means, we can have multiple consumers connected to a subscription, and if a consumer die, pulsar will push all unacked messages to other consumer. So there is not really any rebalancing happening, as if the first consumer is connected, the message in the partition will go to that consumer in any subscription mode.

However, in liiklus, we do a manual seek with the consumer. The offset of a partition actually lies with the subscription. Hence if a manual seek is done. Pulsar disconnect all consumers of a subscription. Hence, a rebalancing might happen on reconnect as the one to reconnect first might be another consumer. However this is a side effect of the last pulsar records storage implementation. The next issue is that, because of this seek. For some reason, if the receiving consumer is disconnected, the other connected consumer won't get any message.

Additionally as extra information, this is not further investigated yet. Since the seek only happens on initial subscription. It means in case of a failover switch, the second consumer will get all messages since the last seek as liiklus does not ack any messages in pulsar, and in case of failover pulsar send to the other consumer all un-acked messages (based on the last saved offset in the subscription). As the other connected consumer from liiklus can't detect when the other disconnect, it won't know when to do a new seek based on liiklus offset, and can just receive all messages the other consumer have previously received. However, as mentioned above, this does not happens in the test since we only test new consumer connection, and did not really test what happens on failover, is the one receiving message is killed.

So in short, to my understanding pulsar does not really rebalance automatically and we can't assume it will happen on new consumer creation.

p.s. there is actually a way to have always a rebalancing, and that is the Key_Shared or Shared subscription mode. However, liiklus offset and ack does not really support this in a good way as in pulsar, ack is actually done on a message basis, and in liiklus it's done on an offset basis.

Pulsar disconnect all consumers of a subscription.

FYI it is fixed in master: https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1567098757051900?thread_ts=1567090931.044200&cid=C5Z4T36F7

(I will comment more, just wanted to point out eagerly)

bsideup · 2019-08-29T13:20:17Z

...ecords-storage/src/test/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorageTest.java

+    @Override
+    // currently each liiklus consumers consume all partition, and leaves the client to rebalance
+    // in pulsar, the pulsar client does not rebalance naturally on failover or exclusive
+    public void testExclusiveRecordDistribution() throws Exception {


Same comment as in testMultipleGroups

same comment as well in testMultipleGroups

this test verifies that:

Eventually, both subscriptions will receive assignments

their received records are not overlapping

this is not Kafka specific, and if this test fails - there seems to be a bug in the storage implementation

bsideup · 2019-08-29T13:23:17Z

...ecords-storage/src/test/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorageTest.java

+
+    @Test
+    @Override
+    // since pulsar behave differently for closed ledger, need to overwrite this test to include the 1 entry setback


since the whole point of Liiklus is too abstract away the semantics of different event storages, I don't understand why this test does not pass as it is in the TCK?

Also, TCK tests should not be changed, only new tests can be added. There are assumptions for storages that do not support more than 1 partition, but other than that, the storages should behave similarly

For this one, it's indeed because there is an inconsistency in pulsar behaviour for checking what is a valid entry id. This would mean that we will need to patch pulsar and wait for a new release.

FYI I'm trying to get help in Pulsar's Slack about it, will report back as soon as I have any info

you can check the PulsarRecordsStorage comment on the adaptForSeek method.

See https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1567085799031200

have no access.

I would highly suggest to get it if you plan to work with Pulsar, they have a very active community and help with different kinds of questions :)

...ar-records-storage/src/main/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorage.java

bsideup · 2019-08-29T13:31:58Z

FYI I merged #174

P.S. right after the merge I noticed that "this is fixed in #174 , this also take some code from there" didn't say that the code was taken as it is, and it created a merge conflict. Sorry for that :(
I hope the merge won't be hard.

tck/src/main/java/com/github/bsideup/liiklus/records/tests/SubscribeTest.java

...ecords-storage/src/test/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorageTest.java

lanwen · 2019-09-10T15:56:31Z

...ar-records-storage/src/main/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorage.java

+                                toOffset(message.getMessageId())
+                        );
+                    })
+                    .delaySubscription(initialOffset.flatMap(offset -> resetSubscriptionOffset(consumer, offset)));


WDYT if we follow the code that on the current ledger we just don't subtract the offset?
that's maybe not the safest way in the theory since while we are asking about the latest message the ledger could be right after closed, but quick tests are working

.map(PulsarRecordsStorage::fromOffset) .cast(MessageIdImpl.class) .flatMap(messageId -> { return Mono.fromCompletionStage(() -> ConsumerImplAccessor.getLastMessageIdAsync(consumer)) .map(last -> { var msg = (MessageIdImpl) last; if (msg.getLedgerId() == messageId.getLedgerId()) { return messageId; } return adaptForSeek(messageId); }); })

@Rowanto @bsideup

fix encountered pulsar intrinsic bugs

775bb9e

Rowanto changed the title ~~fix encountered pulsar intrinsic bugs~~ Fix pulsar intrinsic bugs Aug 29, 2019

bsideup reviewed Aug 29, 2019

View reviewed changes

...ar-records-storage/src/main/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorage.java Outdated Show resolved Hide resolved

bsideup reviewed Aug 29, 2019

View reviewed changes

...ar-records-storage/src/main/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorage.java Outdated Show resolved Hide resolved

Rowanto added 2 commits August 29, 2019 15:28

colder run of consumer close

393e90b

remove logs prefix

a64b9ee

Merge branch 'master' into pulsar-client-fix

29467d3

bsideup mentioned this pull request Aug 29, 2019

Pulsar: use publishTime if eventTime is not set #177

Merged

bsideup reviewed Aug 29, 2019

View reviewed changes

tck/src/main/java/com/github/bsideup/liiklus/records/tests/SubscribeTest.java Outdated Show resolved Hide resolved

Rowanto and others added 2 commits August 29, 2019 17:09

Merge branch 'master' into pulsar-client-fix

1f3f0a4

Update SubscribeTest.java

b859f84

bsideup mentioned this pull request Aug 29, 2019

Pulsar: test that eventTime is preferred over publishTime #178

Merged

bsideup reviewed Aug 29, 2019

View reviewed changes

...ecords-storage/src/test/java/com/github/bsideup/liiklus/pulsar/PulsarRecordsStorageTest.java Outdated Show resolved Hide resolved

Merge branch 'master' into pulsar-client-fix

dbccc4f

lanwen reviewed Sep 10, 2019

View reviewed changes

Rowanto added 2 commits September 20, 2019 11:30

Merge branch 'master' into pulsar-client-fix

dd217fe

switch offset if it's not the last ledger

a889f7e

Rowanto closed this Aug 27, 2020

Rowanto deleted the pulsar-client-fix branch August 27, 2020 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pulsar intrinsic bugs #176

Fix pulsar intrinsic bugs #176

Rowanto commented Aug 29, 2019 •

edited

Loading

bsideup Aug 29, 2019

Rowanto Aug 29, 2019

bsideup Aug 29, 2019

Rowanto Aug 29, 2019

Rowanto Aug 29, 2019

Rowanto Aug 29, 2019 •

edited

Loading

bsideup Aug 29, 2019

bsideup Aug 29, 2019 •

edited

Loading

Rowanto Aug 30, 2019 •

edited

Loading

bsideup Aug 30, 2019

bsideup Aug 29, 2019

Rowanto Aug 29, 2019

bsideup Aug 29, 2019

bsideup Aug 29, 2019

Rowanto Aug 29, 2019

bsideup Aug 29, 2019

Rowanto Aug 29, 2019

bsideup Aug 29, 2019

Rowanto Aug 30, 2019

bsideup Aug 30, 2019

Rowanto Aug 30, 2019

bsideup commented Aug 29, 2019

lanwen Sep 10, 2019 •

edited

Loading

Fix pulsar intrinsic bugs #176

Fix pulsar intrinsic bugs #176

Conversation

Rowanto commented Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rowanto Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsideup Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

Rowanto Aug 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsideup commented Aug 29, 2019

lanwen Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

Rowanto commented Aug 29, 2019 •

edited

Loading

Rowanto Aug 29, 2019 •

edited

Loading

bsideup Aug 29, 2019 •

edited

Loading

Rowanto Aug 30, 2019 •

edited

Loading

lanwen Sep 10, 2019 •

edited

Loading