-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605
Conversation
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have context on the whole PR yet, but here are some initial comments.
log.info("[{}] redirect migrated producer to topic {}: " | ||
+ "producerId={}, producerName = {}, {}", remoteAddress, | ||
topicName, producerId, producerName, ex.getCause().getMessage()); | ||
commandSender.sendTopicMigrated(ResourceType.Producer, producerId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we're sending this message without checking the client's protocol version. I know that your PR didn't introduce this behavior, but I think we should make sure to handle that before this feature is released. Are you able to fix this @vineeth1995, or should we open an issue for the work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so sorry @vineeth1995, I didn't realize that commandSender.sendTopicMigrated
was doing that version check. I think it makes sense to log the error as you just did, but do we need to consider sending an unretriable error to the client to make sure they do not constantly attempt to reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not exactly non-retriable error but client has to retry until topic backlog will be drained , after that producers will be redirected to a new cluster. it's exactly similar to backlog-quota exceeded error where it's non-retriable until backlog is drained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaeljmarshall will handle this in a separate pr as this requires client side change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also add broker-side feature to restrict specific client versions to activate this functionality as such a feature requires all clients to upgrade the version. but that can be addressed into a separate PR.
@@ -1540,6 +1540,7 @@ private void buildProducerAndAddTopic(Topic topic, long producerId, String produ | |||
getPrincipal(), isEncrypted, metadata, schemaVersion, epoch, | |||
userProvidedProducerName, producerAccessMode, topicEpoch, supportsPartialProducer); | |||
|
|||
log.info("trying to add producer - {} to topic -{}", producerName, topicName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already log about this, so I think we can remove this before merging.
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java
Lines 1374 to 1377 in 8cc979d
if (log.isDebugEnabled()) { | |
log.debug("[{}][{}] Creating producer. producerId={}, schema is {}", remoteAddress, topicName, | |
producerId, schema == null ? "absent" : "present"); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
/pulsarbot run-failure-checks |
1 similar comment
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM..
cc0e0fb
to
5b8e137
Compare
15bc074
to
4761dee
Compare
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
/pulsarbot run-failure-checks |
return null; | ||
|
||
if (topic.isReplicationBacklogExist()) { | ||
log.info("Topic {} is migrated but replication backlog exist: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for isReplicationBacklogExist to become true later?
e.g. on one pass all replicators don't have backlog (isReplicationBacklogExist == false), next time one of the replicators gets more messages/backlog?
What guards against this case / is it a problem?
…apache#19605) Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>
There seems to be a new flaky test introduced by this PR. The flaky test issue is #20375. @vineeth1995 do you have a chance to take a look? thanks |
Hi,
I will create a fix for this.
Thanks,
Vineeth
…On Tue, May 23, 2023 at 7:40 AM Lari Hotari ***@***.***> wrote:
There seems to be a new flaky test introduced by this PR. The flaky test
issue is #20375 <#20375>.
@vineeth1995 <https://github.com/vineeth1995> do you have a chance to
take a look? thanks
—
Reply to this email directly, view it on GitHub
<#19605 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGAMY76GHMT6LJ2II2YTX3DXHTD7JANCNFSM6AAAAAAVEVRY6M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…[part-2] (apache#19605)" This reverts commit 34b6e89
Motivation
This completes #16551 and extension to this part-1 pr #17962
This handles Replicator and message ordering guarantee part for blue-green deployment.
Replicator and message ordering handling
A. Incoming replication messages from other region's replicator producers to Blue cluster
This will not impact ordering messages coming from the other regions to blue/green cluster. After marking blue cluster, blue cluster will reject replication writes from remote regions and redirects remote producers to the Green cluster where new messages will be written. Consumers of Blue clusters will only be redirected to green once they received all messages from blue. So, migration gives an ordering guarantee for messages replicating from remote regions.
B. Outgoing replication messages from Blue cluster's replicator producers to other regions
The broker can give an ordering guarantee in this case with the trade-off of topic unavailability until the blue cluster replicates all existing published messages in the blue cluster before the topic gets terminated.
Blue cluster marks topic terminated and migrated
Topic will not redirect producers/consumers until all the replicators reaches end of topic and replicates all messages to remote regions. Topic will send TOPIC_UNAVAILABLE message to producers/consumers so, they can keep retrying until replicators reach to end of topics.
Broker disconnects all the replicators and delete them once they reach end of topic.
Broker start sending migrated-command to producer/consumers to redirect clients to green cluster.
Modifications
Example use case:
producer1 sends messages msg1, msg2 -> region1
region1 replicator -> msg1 ->region2
but region2 has a connectivity issue with region1
as a result region1 has a replication backlog msg2 with region2
Marked blue-green
region1 -> region1A
If you redirect producer1 to region1A
producer1 sends msg3 to region1A
region1A is connected to region2
region1A sends msg3 to region2
Meanwhile if region1 gets it's connection back to region2
region1 sends msg2(replication backlog) to region2
region2 consumer consumes in the order msg1, msg3, msg2
which is a wrong order of messages as it should be msg1, msg2, msg3
So we don't want to redirect producer1 until Replicator has no backlog. This pr handles this use case by making sure replication backlog is drained before redirecting the producers to green cluster.
Verifying this change
Added end t end test to verify this change.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete