-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIP-121: Pulsar cluster level auto failover on client side #13315
Comments
If cluster B has a different authentication data with cluster A, such as another token string, how to change the authentication setting in client when auto failover happen? |
Cluster A and Cluster B should be configured with the same authentication, otherwise the client should update authentication settings when switch cluster. |
Is it possible to auto update authentication settings when switch cluster by implements |
@wangjialing218 Thanks for your suggestion, i have updated the design to support different authentication configuration for different clusters. |
The proposal makes sense to me. So I would like to see it stated in the title of the PIP |
The PIP mentions primary and secondary. Can we make it more general for an arbitrary number of clusters? |
@eolivelli +1 for this, was thinking in the same direction: |
@eolivelli The secondary just represent the backup clusters, and I use a |
@hpvd Thanks for your suggestion. Choose cluster from secondary cluster list according to probe latency is a good idea, I will add a parameter to configure choose policy to support it. |
@eolivelli I updated the PIP title. |
@hangc0276 awesome! many thanks! |
Related to #13315 ### Modification 1. add Pulsar cluster level auto failover
The issue had no activity for 30 days, mark with Stale label. |
The implementation is #13316 , close this issue. |
) Related to apache#13315 ### Modification 1. add Pulsar cluster level auto failover
@hangc0276 你好, 看文档介绍, 受控的 failover 需要一个 service provider 管理元数据, 请问这个 provider 服务 Pulsar 会提供吗, 还是需要使用者自己写这个服务, 感谢! |
Motivation
We have geo-replication to support Pulsar cluster level failover. We can setup Pulsar cluster A as a primary cluster in data center A, and setup Pulsar cluster B as backup cluster in data center B. Then we configure geo-replication between cluster A and cluster B. All the clients are connected to the Pulsar cluster by DNS. If cluster A is down, we should switch the DNS to point the target Pulsar cluster from cluster A to cluster B. After the clients are resolved to cluster B, they can produce and consume messages normally. After cluster A recovers, the administrator should switch the DNS back to cluster A.
However, the current method has two shortcomings.
Goal
It's better to provide an automatic cluster level failure recovery mechanism to make pulsar cluster failover more effective. We should support pulsar clients auto switching from cluster A to cluster B when it detects cluster A has been down according to the configured detecting policy and switch back to cluster A when it has recovered. The reason why we should switch back to cluster A is that most applications may be deployed in data center A and they have low network cost for communicating with pulsar cluster A. If they keep visiting pulsar cluster B, they have high network cost, and cause high produce/consume latency.
In order to improve the DNS cache problem, we should provide an administrator controlled switch provider for administrators to update service URLs.
In the end, we should provide an auto service URL switch provider and administrator controlled switch provider.
Design
We have already provided the
ServiceUrlProvider
interface to support different service URLs. In order to support automatic cluster level failure auto recovery, we can provide different ServiceUrlProvider implementations. For current requirements, we can provideAutoClusterFailover
andControlledClusterFailover
.AutoClusterFailover
In order to support auto switching from the primary cluster to the secondary, we can provide a probe task, which will probe the activity of the primary cluster and the secondary one. When it finds the primary cluster failed more than
failoverDelayMs
, it will switch to the secondary cluster by callingupdateServiceUrl
. After switching to the secondary cluster, theAutoClusterFailover
will continue to probe the primary cluster. If the primary cluster comes back and remains active forswitchBackDelayMs
, it will switch back to the primary cluster.The APIs are listed as follows.
In order to support multiple secondary clusters, use List to store secondary cluster urls. When the primary cluster probe fails for failoverDelayMs, it will start to probe the secondary cluster list one by one, once it finds the active cluster, it will switch to the target cluster. Notice: If you configured multiple clusters, you should turn on cluster level geo-replication to ensure the topic data sync between all primary and secondary clusters. Otherwise, it may distribute the topic data into different clusters. And the consumers won’t get the whole data of the topic.
In order to support different authentication configurations between clusters, we provide the authentication relation configurations updated with the target cluster.
In order to create an
AutoClusterFailover
instance, we useAutoClusterFailoverBuilder
interface to build the target instance. TheAutoClusterFailoverBuilder
interface is located in thepulsar-client-api
package.In the
probeAvailable
method, we will probe the Pulsar service port, and check whether the port is open. This probe method has many disadvantages, such asWe're connecting to a Pulsar proxy, but there are no available brokers
Using Istio on server side, which always accepts the connection even if the broker is in a bad state
We might have deadlocks in (all) brokers and while the connections get accepted, the brokers are not able to serve them.
In order to solve this problem, we’d better provide a health check command on the broker side, just like Zookeeper’s
ruok
command.We can use the probe port method first, and in the next step, we will provide the health check command on the broker side.
ControlledClusterFailover
If the users want to control the cluster switch operation, they can provide the current service URL by a http service. The
ControlledClusterFailover
will get the newest service url from the provided http service periodically.The APIs are listed as follows.
The configuration we get from the third url provider, we define it as java Bean by json format. In the configuration, we provide authentication-related parameters to support different clusters that have different authentication configurations. These authentication-related parameters can support all current authentication plugin types.
In order to create an
ControlledClusterFailover
instance, we useControlledClusterFailoverBuilder
interface to build the target instance. TheControlledClusterFailoverBuilder
interface is located in thepulsar-client-api
package.API Changes
For the current
ServiceUrlProvider
interface, we should add aclose
method to close an allocated resource, such as a timer thread.Tests
Add tests for the two service provider implementations.
For
AutoClusterFailover
, when the primary cluster shuts down, it should switch to the secondary cluster. And then the primary cluster came back, we should switch back.For
ControlledClusterFailover
, when switching the service url on the http service side, it should switch to the newest service url.The text was updated successfully, but these errors were encountered: