You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.
We have geo-replication to support Pulsar cluster level failover. We can setup Pulsar cluster A as a primary cluster in data center A, and setup Pulsar cluster B as backup cluster in data center B. Then we configure geo-replication between cluster A and cluster B. All the clients are connected to the Pulsar cluster by DNS. If cluster A is down, we should switch the DNS to point the target Pulsar cluster from cluster A to cluster B. After the clients are resolved to cluster B, they can produce and consume messages normally. After cluster A recovers, the administrator should switch the DNS back to cluster A.
However, the current method has two shortcomings.
The administrator should monitor the status of all Pulsar clusters, and switch the DNS as soon as possible when cluster A is down. The switch and recovery is not automatic and recovery time is controlled by the administrator, which will put the administrator under heavy load.
The Pulsar client and DNS system have a cache. When the administrator switches the DNS from cluster A to Cluster B, it will take some time for cache trigger timeout, which will delay client recovery time and lead to the product/consumer message failing.
Goal
It's better to provide an automatic cluster level failure recovery mechanism to make pulsar cluster failover more effective. We should support pulsar clients auto switching from cluster A to cluster B when it detects cluster A has been down according to the configured detecting policy and switch back to cluster A when it has recovered. The reason why we should switch back to cluster A is that most applications may be deployed in data center A and they have low network cost for communicating with pulsar cluster A. If they keep visiting pulsar cluster B, they have high network cost, and cause high produce/consume latency.
In order to improve the DNS cache problem, we should provide an administrator controlled switch provider for administrators to update service URLs.
In the end, we should provide an auto service URL switch provider and administrator controlled switch provider.
Design
We have already provided the ServiceUrlProvider interface to support different service URLs. In order to support automatic cluster level failure auto recovery, we can provide different ServiceUrlProvider implementations. For current requirements, we can provide AutoClusterFailover and ControlledClusterFailover.
AutoClusterFailover
In order to support auto switching from the primary cluster to the secondary, we can provide a probe task, which will probe the activity of the primary cluster and the secondary one. When it found the primary cluster failed more than failoverDelayMs, it will switch to the secondary cluster by calling updateServiceUrl. After switch to the secondary cluster, the AutoClusterFailover will continue to probe the primary cluster. If the primary cluster comes back and remains active for switchBackDelayMs, it will switch back to the primary cluster.
The APIs are listed as follows.
In the probeAvailable method, we will probe the Pulsar service port, and check whether the port is open.
ControlledClusterFailover
If the users want to control the cluster switch operation, they can provide the current service URL by a http service. The ControlledClusterFailover will get the newest service url from the provided http service periodically.
The APIs are listed as follows.
publicclassControlledClusterFailoverimplementsServiceUrlProvider {
privateControlledClusterFailover(StringdefaultServiceUrl, StringurlProvider) throwsIOException {
}
@Overridepublicvoidinitialize(PulsarClientclient) {
this.pulsarClient = client;
// start to check service url every 30 secondsthis.timer.scheduleAtFixedRate(newTimerTask() {
@Overridepublicvoidrun() {
// do check and switch operation.
}
}, 30_000, 30_000);
}
privateStringfetchServiceUrl() throwsIOException {
// call the service to get service URL
}
@OverridepublicStringgetServiceUrl() {
returnthis.currentPulsarServiceUrl;
}
@Overridepublicvoidclose() {
this.timer.cancel();
}
API Changes
For the current ServiceUrlProvider interface, we should add a close method to close an allocated resource, such as a timer thread.
publicinterfaceServiceUrlProvider {
/** * Close the resource that the provider allocated. * */defaultvoidclose() {
// do nothing
}
}
Tests
Add tests for the two service provider implementations.
For AutoClusterFailover, when the primary cluster shuts down, it should switch to the secondary cluster. And then the primary cluster came back, we should switch back.
For ControlledClusterFailover, when switching the service url on the http service side, it should switch to the newest service url.
The text was updated successfully, but these errors were encountered:
Original Issue: apache#13315
Motivation
We have geo-replication to support Pulsar cluster level failover. We can setup Pulsar cluster A as a primary cluster in data center A, and setup Pulsar cluster B as backup cluster in data center B. Then we configure geo-replication between cluster A and cluster B. All the clients are connected to the Pulsar cluster by DNS. If cluster A is down, we should switch the DNS to point the target Pulsar cluster from cluster A to cluster B. After the clients are resolved to cluster B, they can produce and consume messages normally. After cluster A recovers, the administrator should switch the DNS back to cluster A.
However, the current method has two shortcomings.
Goal
It's better to provide an automatic cluster level failure recovery mechanism to make pulsar cluster failover more effective. We should support pulsar clients auto switching from cluster A to cluster B when it detects cluster A has been down according to the configured detecting policy and switch back to cluster A when it has recovered. The reason why we should switch back to cluster A is that most applications may be deployed in data center A and they have low network cost for communicating with pulsar cluster A. If they keep visiting pulsar cluster B, they have high network cost, and cause high produce/consume latency.
In order to improve the DNS cache problem, we should provide an administrator controlled switch provider for administrators to update service URLs.
In the end, we should provide an auto service URL switch provider and administrator controlled switch provider.
Design
We have already provided the
ServiceUrlProvider
interface to support different service URLs. In order to support automatic cluster level failure auto recovery, we can provide different ServiceUrlProvider implementations. For current requirements, we can provideAutoClusterFailover
andControlledClusterFailover
.AutoClusterFailover
In order to support auto switching from the primary cluster to the secondary, we can provide a probe task, which will probe the activity of the primary cluster and the secondary one. When it found the primary cluster failed more than
failoverDelayMs
, it will switch to the secondary cluster by callingupdateServiceUrl
. After switch to the secondary cluster, theAutoClusterFailover
will continue to probe the primary cluster. If the primary cluster comes back and remains active forswitchBackDelayMs
, it will switch back to the primary cluster.The APIs are listed as follows.
In the
probeAvailable
method, we will probe the Pulsar service port, and check whether the port is open.ControlledClusterFailover
If the users want to control the cluster switch operation, they can provide the current service URL by a http service. The
ControlledClusterFailover
will get the newest service url from the provided http service periodically.The APIs are listed as follows.
API Changes
For the current
ServiceUrlProvider
interface, we should add aclose
method to close an allocated resource, such as a timer thread.Tests
Add tests for the two service provider implementations.
For
AutoClusterFailover
, when the primary cluster shuts down, it should switch to the secondary cluster. And then the primary cluster came back, we should switch back.For
ControlledClusterFailover
, when switching the service url on the http service side, it should switch to the newest service url.The text was updated successfully, but these errors were encountered: