[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

vineeth1995 · 2023-02-22T18:24:08Z

Motivation

This completes #16551 and extension to this part-1 pr #17962

This handles Replicator and message ordering guarantee part for blue-green deployment.

Replicator and message ordering handling

A. Incoming replication messages from other region's replicator producers to Blue cluster
This will not impact ordering messages coming from the other regions to blue/green cluster. After marking blue cluster, blue cluster will reject replication writes from remote regions and redirects remote producers to the Green cluster where new messages will be written. Consumers of Blue clusters will only be redirected to green once they received all messages from blue. So, migration gives an ordering guarantee for messages replicating from remote regions.

B. Outgoing replication messages from Blue cluster's replicator producers to other regions
The broker can give an ordering guarantee in this case with the trade-off of topic unavailability until the blue cluster replicates all existing published messages in the blue cluster before the topic gets terminated.

Blue cluster marks topic terminated and migrated
Topic will not redirect producers/consumers until all the replicators reaches end of topic and replicates all messages to remote regions. Topic will send TOPIC_UNAVAILABLE message to producers/consumers so, they can keep retrying until replicators reach to end of topics.
Broker disconnects all the replicators and delete them once they reach end of topic.
Broker start sending migrated-command to producer/consumers to redirect clients to green cluster.

Modifications

Handle producers so that message ordering is guaranteed when topic has been migrated but replication backlog still exists.

Example use case:

producer1 sends messages msg1, msg2 -> region1

region1 replicator -> msg1 ->region2
but region2 has a connectivity issue with region1
as a result region1 has a replication backlog msg2 with region2

Marked blue-green
region1 -> region1A

If you redirect producer1 to region1A
producer1 sends msg3 to region1A
region1A is connected to region2
region1A sends msg3 to region2

Meanwhile if region1 gets it's connection back to region2
region1 sends msg2(replication backlog) to region2

region2 consumer consumes in the order msg1, msg3, msg2

which is a wrong order of messages as it should be msg1, msg2, msg3

So we don't want to redirect producer1 until Replicator has no backlog. This pr handles this use case by making sure replication backlog is drained before redirecting the producers to green cluster.

Verifying this change

Added end t end test to verify this change.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

vineeth1995 · 2023-02-22T18:50:17Z

/pulsarbot run-failure-checks

michaeljmarshall

I don't have context on the whole PR yet, but here are some initial comments.

michaeljmarshall · 2023-02-24T04:40:46Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

+                        log.info("[{}] redirect migrated producer to topic {}: "
+                                        + "producerId={}, producerName = {}, {}", remoteAddress,
+                                topicName, producerId, producerName, ex.getCause().getMessage());
+                        commandSender.sendTopicMigrated(ResourceType.Producer, producerId,


I see that we're sending this message without checking the client's protocol version. I know that your PR didn't introduce this behavior, but I think we should make sure to handle that before this feature is released. Are you able to fix this @vineeth1995, or should we open an issue for the work?

I'm so sorry @vineeth1995, I didn't realize that commandSender.sendTopicMigrated was doing that version check. I think it makes sense to log the error as you just did, but do we need to consider sending an unretriable error to the client to make sure they do not constantly attempt to reconnect?

it's not exactly non-retriable error but client has to retry until topic backlog will be drained , after that producers will be redirected to a new cluster. it's exactly similar to backlog-quota exceeded error where it's non-retriable until backlog is drained.

@michaeljmarshall will handle this in a separate pr as this requires client side change as well.

we can also add broker-side feature to restrict specific client versions to activate this functionality as such a feature requires all clients to upgrade the version. but that can be addressed into a separate PR.

michaeljmarshall · 2023-02-24T04:42:24Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

@@ -1540,6 +1540,7 @@ private void buildProducerAndAddTopic(Topic topic, long producerId, String produ
                getPrincipal(), isEncrypted, metadata, schemaVersion, epoch,
                userProvidedProducerName, producerAccessMode, topicEpoch, supportsPartialProducer);

+        log.info("trying to add producer - {} to topic -{}", producerName, topicName);


We already log about this, so I think we can remove this before merging.

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

Lines 1374 to 1377 in 8cc979d

if (log.isDebugEnabled()) {

log.debug("[{}][{}] Creating producer. producerId={}, schema is {}", remoteAddress, topicName,

producerId, schema == null ? "absent" : "present");

}

vineeth1995 · 2023-03-10T18:55:22Z

/pulsarbot run-failure-checks

vineeth1995 · 2023-03-21T23:12:22Z

/pulsarbot run-failure-checks

rdhabalia

LGTM..

vineeth1995 · 2023-04-09T22:32:03Z

/pulsarbot run-failure-checks

eolivelli

Lgtm

vineeth1995 · 2023-04-10T16:40:19Z

/pulsarbot run-failure-checks

dlg99 · 2023-04-10T19:28:34Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

-                    return null;
-
+                    if (topic.isReplicationBacklogExist()) {
+                        log.info("Topic {} is migrated but replication backlog exist: "


Is it possible for isReplicationBacklogExist to become true later?
e.g. on one pass all replicators don't have backlog (isReplicationBacklogExist == false), next time one of the replicators gets more messages/backlog?
What guards against this case / is it a problem?

…apache#19605) Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>

lhotari · 2023-05-23T14:40:40Z

There seems to be a new flaky test introduced by this PR. The flaky test issue is #20375. @vineeth1995 do you have a chance to take a look? thanks

vineeth1995 · 2023-05-23T15:25:49Z

Hi, I will create a fix for this. Thanks, Vineeth

…

On Tue, May 23, 2023 at 7:40 AM Lari Hotari ***@***.***> wrote: There seems to be a new flaky test introduced by this PR. The flaky test issue is #20375 <#20375>. @vineeth1995 <https://github.com/vineeth1995> do you have a chance to take a look? thanks — Reply to this email directly, view it on GitHub <#19605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGAMY76GHMT6LJ2II2YTX3DXHTD7JANCNFSM6AAAAAAVEVRY6M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…[part-2] (apache#19605)" This reverts commit 34b6e89

github-actions bot added doc-complete Your PR changes impact docs and the related docs have been already added. doc-not-needed Your PR changes do not impact docs and removed doc-complete Your PR changes impact docs and the related docs have been already added. labels Feb 22, 2023

vineeth1995 closed this Feb 22, 2023

vineeth1995 reopened this Feb 22, 2023

rdhabalia assigned vineeth1995 Feb 22, 2023

rdhabalia added the ready-to-test label Feb 22, 2023

rdhabalia added this to the 2.11.0 milestone Feb 22, 2023

michaeljmarshall reviewed Feb 24, 2023

View reviewed changes

rdhabalia approved these changes Mar 21, 2023

View reviewed changes

vineeth1995 force-pushed the blue-green branch 2 times, most recently from cc0e0fb to 5b8e137 Compare March 28, 2023 17:21

vineeth1995 force-pushed the blue-green branch 2 times, most recently from 15bc074 to 4761dee Compare April 9, 2023 22:03

vineeth1995 force-pushed the blue-green branch from 55d77ea to 8a0e290 Compare April 10, 2023 02:52

rdhabalia self-requested a review April 10, 2023 02:55

eolivelli approved these changes Apr 10, 2023

View reviewed changes

vineeth1995 force-pushed the blue-green branch from a7d95f3 to 6ce3799 Compare April 10, 2023 16:36

[feat] [broker] PIP-188 support blue-green cluster migration [part-2]

6f9c933

vineeth1995 force-pushed the blue-green branch from 6ce3799 to 6f9c933 Compare April 10, 2023 16:39

rdhabalia merged commit 34b6e89 into apache:master Apr 10, 2023

dlg99 reviewed Apr 10, 2023

View reviewed changes

Demogorgon314 pushed a commit to Demogorgon314/pulsar that referenced this pull request Apr 11, 2023

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] (…

ac6c47d

…apache#19605) Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>

rdhabalia mentioned this pull request Apr 13, 2023

[feat] [broker] PIP-188: Support option to disconnect clients that not support cluster migration feature #20084

Open

15 tasks

This was referenced May 23, 2023

[fix][pulsar-broker] Fix flaky test - testClusterMigrationWithReplicationBacklog vineeth1995/pulsar#11

Closed

[fix][broker] Fix flaky test - testClusterMigrationWithReplica… #20379

Merged

vineeth1995 mentioned this pull request Jun 8, 2023

[fix][broker] Fix flaky test case - testClusterMigrationWithReplicationBacklog #20546

Open

14 tasks

mukesh-ctds added a commit to datastax/pulsar that referenced this pull request Apr 19, 2024

Revert "[feat] [broker] PIP-188 support blue-green cluster migration …

c0b175c

…[part-2] (apache#19605)" This reverts commit 34b6e89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

vineeth1995 commented Feb 22, 2023 •

edited

Loading

vineeth1995 commented Feb 22, 2023

michaeljmarshall left a comment

michaeljmarshall Feb 24, 2023

michaeljmarshall Feb 24, 2023

rdhabalia Mar 1, 2023

vineeth1995 Mar 8, 2023

rdhabalia Mar 21, 2023

michaeljmarshall Feb 24, 2023

vineeth1995 Mar 8, 2023

vineeth1995 commented Mar 10, 2023

vineeth1995 commented Mar 21, 2023

rdhabalia left a comment

vineeth1995 commented Apr 9, 2023

eolivelli left a comment

vineeth1995 commented Apr 10, 2023

dlg99 Apr 10, 2023

lhotari commented May 23, 2023

vineeth1995 commented May 23, 2023 via email

	if (log.isDebugEnabled()) {
	log.debug("[{}][{}] Creating producer. producerId={}, schema is {}", remoteAddress, topicName,
	producerId, schema == null ? "absent" : "present");
	}

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

Conversation

vineeth1995 commented Feb 22, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

vineeth1995 commented Feb 22, 2023

michaeljmarshall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vineeth1995 commented Mar 10, 2023

vineeth1995 commented Mar 21, 2023

rdhabalia left a comment

Choose a reason for hiding this comment

vineeth1995 commented Apr 9, 2023

eolivelli left a comment

Choose a reason for hiding this comment

vineeth1995 commented Apr 10, 2023

Choose a reason for hiding this comment

lhotari commented May 23, 2023

vineeth1995 commented May 23, 2023 via email

vineeth1995 commented Feb 22, 2023 •

edited

Loading