Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] [broker] PIP-188 support blue-green cluster migration [part-2] #19605

Merged
merged 1 commit into from
Apr 10, 2023

Conversation

vineeth1995
Copy link
Contributor

@vineeth1995 vineeth1995 commented Feb 22, 2023

Motivation

This completes #16551 and extension to this part-1 pr #17962

This handles Replicator and message ordering guarantee part for blue-green deployment.

Replicator and message ordering handling

A. Incoming replication messages from other region's replicator producers to Blue cluster
This will not impact ordering messages coming from the other regions to blue/green cluster. After marking blue cluster, blue cluster will reject replication writes from remote regions and redirects remote producers to the Green cluster where new messages will be written. Consumers of Blue clusters will only be redirected to green once they received all messages from blue. So, migration gives an ordering guarantee for messages replicating from remote regions.

B. Outgoing replication messages from Blue cluster's replicator producers to other regions
The broker can give an ordering guarantee in this case with the trade-off of topic unavailability until the blue cluster replicates all existing published messages in the blue cluster before the topic gets terminated.

Blue cluster marks topic terminated and migrated
Topic will not redirect producers/consumers until all the replicators reaches end of topic and replicates all messages to remote regions. Topic will send TOPIC_UNAVAILABLE message to producers/consumers so, they can keep retrying until replicators reach to end of topics.
Broker disconnects all the replicators and delete them once they reach end of topic.
Broker start sending migrated-command to producer/consumers to redirect clients to green cluster.

Modifications

  • Handle producers so that message ordering is guaranteed when topic has been migrated but replication backlog still exists.

Example use case:

producer1 sends messages msg1, msg2 -> region1

region1 replicator -> msg1 ->region2
but region2 has a connectivity issue with region1
as a result region1 has a replication backlog msg2 with region2

Marked blue-green
region1 -> region1A

If you redirect producer1 to region1A
producer1 sends msg3 to region1A
region1A is connected to region2
region1A sends msg3 to region2

Meanwhile if region1 gets it's connection back to region2
region1 sends msg2(replication backlog) to region2

region2 consumer consumes in the order msg1, msg3, msg2

which is a wrong order of messages as it should be msg1, msg2, msg3


So we don't want to redirect producer1 until Replicator has no backlog. This pr handles this use case by making sure replication backlog is drained before redirecting the producers to green cluster.

Verifying this change

Added end t end test to verify this change.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added doc-complete Your PR changes impact docs and the related docs have been already added. doc-not-needed Your PR changes do not impact docs and removed doc-complete Your PR changes impact docs and the related docs have been already added. labels Feb 22, 2023
@vineeth1995 vineeth1995 reopened this Feb 22, 2023
@rdhabalia rdhabalia added this to the 2.11.0 milestone Feb 22, 2023
@vineeth1995
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have context on the whole PR yet, but here are some initial comments.

log.info("[{}] redirect migrated producer to topic {}: "
+ "producerId={}, producerName = {}, {}", remoteAddress,
topicName, producerId, producerName, ex.getCause().getMessage());
commandSender.sendTopicMigrated(ResourceType.Producer, producerId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're sending this message without checking the client's protocol version. I know that your PR didn't introduce this behavior, but I think we should make sure to handle that before this feature is released. Are you able to fix this @vineeth1995, or should we open an issue for the work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry @vineeth1995, I didn't realize that commandSender.sendTopicMigrated was doing that version check. I think it makes sense to log the error as you just did, but do we need to consider sending an unretriable error to the client to make sure they do not constantly attempt to reconnect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not exactly non-retriable error but client has to retry until topic backlog will be drained , after that producers will be redirected to a new cluster. it's exactly similar to backlog-quota exceeded error where it's non-retriable until backlog is drained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaeljmarshall will handle this in a separate pr as this requires client side change as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also add broker-side feature to restrict specific client versions to activate this functionality as such a feature requires all clients to upgrade the version. but that can be addressed into a separate PR.

@@ -1540,6 +1540,7 @@ private void buildProducerAndAddTopic(Topic topic, long producerId, String produ
getPrincipal(), isEncrypted, metadata, schemaVersion, epoch,
userProvidedProducerName, producerAccessMode, topicEpoch, supportsPartialProducer);

log.info("trying to add producer - {} to topic -{}", producerName, topicName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already log about this, so I think we can remove this before merging.

if (log.isDebugEnabled()) {
log.debug("[{}][{}] Creating producer. producerId={}, schema is {}", remoteAddress, topicName,
producerId, schema == null ? "absent" : "present");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@vineeth1995
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@vineeth1995
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Contributor

@rdhabalia rdhabalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM..

@vineeth1995 vineeth1995 force-pushed the blue-green branch 2 times, most recently from cc0e0fb to 5b8e137 Compare March 28, 2023 17:21
@vineeth1995 vineeth1995 force-pushed the blue-green branch 2 times, most recently from 15bc074 to 4761dee Compare April 9, 2023 22:03
@vineeth1995
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@vineeth1995
Copy link
Contributor Author

/pulsarbot run-failure-checks

@rdhabalia rdhabalia merged commit 34b6e89 into apache:master Apr 10, 2023
return null;

if (topic.isReplicationBacklogExist()) {
log.info("Topic {} is migrated but replication backlog exist: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for isReplicationBacklogExist to become true later?
e.g. on one pass all replicators don't have backlog (isReplicationBacklogExist == false), next time one of the replicators gets more messages/backlog?
What guards against this case / is it a problem?

Demogorgon314 pushed a commit to Demogorgon314/pulsar that referenced this pull request Apr 11, 2023
…apache#19605)

Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>
@lhotari
Copy link
Member

lhotari commented May 23, 2023

There seems to be a new flaky test introduced by this PR. The flaky test issue is #20375. @vineeth1995 do you have a chance to take a look? thanks

@vineeth1995
Copy link
Contributor Author

vineeth1995 commented May 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants