Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker]PIP-340 Optimization of Probe Implementation for Automatic Failover #22133

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yyj8
Copy link
Contributor

@yyj8 yyj8 commented Feb 27, 2024

For detailed improvement instructions, please refer to issues:
#22134

Motivation

The current Java client implementation has certain flaws in automatic fault switching.

org.apache.pulsar.client.impl.AutoClusterFailover.java
boolean probeAvailable(String url) {
        try {
            resolver.updateServiceUrl(url);
            InetSocketAddress endpoint = resolver.resolveHost();
            Socket socket = new Socket();
            socket.connect(new InetSocketAddress(endpoint.getHostName(), endpoint.getPort()), TIMEOUT);
            socket.close();

            return true
        } catch (Exception e) {
            log.warn("Failed to probe available, url: {}", url, e);
            return false;
        }
    }

The client only establishes a TCP connection with the exposed connection address of the cluster to determine whether the cluster is available, which cannot adapt to scenarios where the cluster is partially unavailable (half dead). In this scenario, we hope to make corresponding fault switching judgments by initiating cluster health status requests to the cluster. Then within the cluster, we provide an admin management command to update the cluster's health status. To avoid this scenario, all businesses that need to connect to this cluster need to manually switch cluster connection addresses and restart applications, resulting in inconsistent link data among multiple business team due to inconsistent operation steps.

Modifications

  1. Add a new cluster health status request and response request;
case HEALTH_CHECK:
	checkArgument(cmd.hasHealthCheck());
	handleHealthCheck(cmd.getHealthCheck());
	break;

case HEALTH_CHECK_RESPONSE:
	checkArgument(cmd.hasHealthCheckResponse());
	handleHealthCheckResponse(cmd.getHealthCheckResponse());
	break;        

//PulsarApi.proto define two new request types.
message CommandHealthCheck {
    repeated KeyValue metadata = 1;
}

message CommandHealthCheckResponse {
    optional bool available = 1 [default = true];
}    
  1. Add a new admin management command to manually update the cluster health status;
//Update cluster health status, available or unavailable. default available
bin/pulsar-admin clusters update-health-status --status unavailable

For other detailed information, please refer to the PR code.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:
yyj8#8

Mail list: https://lists.apache.org/thread/kk8lbc92mtgt0hw3tl7dfw7fmpl4jwyq

@lhotari
Copy link
Member

lhotari commented Feb 27, 2024

@yyj8 thanks for the contribution. The intention of the PIP process is that you'd first start the discussion on the mailing list before deciding the solution. There might be alternative ways to solve the problem. For example, In Pulsar we already have the Ping/Pong messages in the protocol.

@lhotari
Copy link
Member

lhotari commented Feb 27, 2024

Your proposal seems to make sense a lot of sense, but the naming is perhaps not optimal. When looking at the changes, this looks like an active/passive status for a cluster and having that could make it more flexible. There might be other reasons than "health" to mark a cluster passive. Let's say when there's a need to do maintenance where it is desirable that clients move to the other cluster.

There has been some changes in Pulsar PIP process. We have a PIP template https://github.com/apache/pulsar/blob/master/pip/TEMPLATE.md for capturing the proposal in a markdown file. However, that could also happen later in the process, after the initial mailing list discussion.

@lhotari lhotari requested a review from heesung-sn February 27, 2024 12:25
@lhotari
Copy link
Member

lhotari commented Feb 27, 2024

@yyj8 One of the challenges is making this proposal consistent with the Blue-Green deployment feature. PIP-188 #16551 (please note that this PIP was filed before we switched to use markdown files for capturing PIPs).
Would there be a way to make the requirements part of the Blue-Green deployment feature?

@yyj8
Copy link
Contributor Author

yyj8 commented Feb 29, 2024

@yyj8 thanks for the contribution. The intention of the PIP process is that you'd first start the discussion on the mailing list before deciding the solution. There might be alternative ways to solve the problem. For example, In Pulsar we already have the Ping/Pong messages in the protocol.

@lhotari Thank you for your help.
Ping/Pong can only detect whether the process is alive, but cannot determine whether the process can provide produce and consume services normally.

@yyj8
Copy link
Contributor Author

yyj8 commented Feb 29, 2024

@yyj8 One of the challenges is making this proposal consistent with the Blue-Green deployment feature. PIP-188 #16551 (please note that this PIP was filed before we switched to use markdown files for capturing PIPs). Would there be a way to make the requirements part of the Blue-Green deployment feature?

PIP-188 appears to be a very good solution for handling internal forwarding of traffic. It can forward the traffic written by clients to the Blue cluster in the Blue cluster to the Green cluster. However, what we hope for is that when the Blue cluster is abnormal, our client can directly complete the connection address switch to the Green cluster, instead of needing to occupy additional bandwidth and do another forwarding to the Blue cluster.

@yyj8
Copy link
Contributor Author

yyj8 commented Feb 29, 2024

Your proposal seems to make sense a lot of sense, but the naming is perhaps not optimal. When looking at the changes, this looks like an active/passive status for a cluster and having that could make it more flexible. There might be other reasons than "health" to mark a cluster passive. Let's say when there's a need to do maintenance where it is desirable that clients move to the other cluster.

There has been some changes in Pulsar PIP process. We have a PIP template https://github.com/apache/pulsar/blob/master/pip/TEMPLATE.md for capturing the proposal in a markdown file. However, that could also happen later in the process, after the initial mailing list discussion.

This is a very good suggestion. After the discussion on the mailing list is completed, I will make modifications according to the PIP template and the suggested modifications discussed by everyone.

@merlimat
Copy link
Contributor

@yyj8 Wouldn't this already be covered by the ControlledClusterFailover mechanism? https://pulsar.apache.org/api/client/3.2.x/org/apache/pulsar/client/api/ControlledClusterFailoverBuilder.html

@merlimat
Copy link
Contributor

The idea for ControlledClusterFailover is to move the control of whether a cluster is "healthy" outside of the Pulsar scope. You just need to implement a HTTP service that will tell the clients where to connect.

@yyj8
Copy link
Contributor Author

yyj8 commented Feb 29, 2024

The idea for ControlledClusterFailover is to move the control of whether a cluster is "healthy" outside of the Pulsar scope. You just need to implement a HTTP service that will tell the clients where to connect.

The client class AutoClusterFailover.java has implemented a timed detection service to check if the IP and port are in a normal state, and based on the detection results, determine whether it is necessary to switch cluster connection addresses. So, we only need to add a request type (CommandHealthCheck) in the detection logic to obtain the cluster health status.

Check and update the logic as follows:

  private void probeAndUpdateServiceUrl(List<String> targetServiceUrls,
                                          Map<String, Authentication> authentications,
                                          Map<String, String> tlsTrustCertsFilePaths,
                                          Map<String, String> tlsTrustStorePaths,
                                          Map<String, String> tlsTrustStorePasswords) {
        if (probeAvailable(currentPulsarServiceUrl)) {
            failedTimestamp = -1;
            return;
        }

        long currentTimestamp = System.nanoTime();
        if (failedTimestamp == -1) {
            failedTimestamp = currentTimestamp;
        } else if (currentTimestamp - failedTimestamp >= failoverDelayNs) {
            for (String targetServiceUrl : targetServiceUrls) {
                if (probeAvailable(targetServiceUrl)) {
                    log.info("Current Pulsar service is {}, it has been down for {} ms, "
                                    + "switch to the service {}. The current service down at {}",
                            currentPulsarServiceUrl, nanosToMillis(currentTimestamp - failedTimestamp),
                            targetServiceUrl, failedTimestamp);
                    updateServiceUrl(targetServiceUrl,
                            authentications != null ? authentications.get(targetServiceUrl) : null,
                            tlsTrustCertsFilePaths != null ? tlsTrustCertsFilePaths.get(targetServiceUrl) : null,
                            tlsTrustStorePaths != null ? tlsTrustStorePaths.get(targetServiceUrl) : null,
                            tlsTrustStorePasswords != null ? tlsTrustStorePasswords.get(targetServiceUrl) : null);
                    failedTimestamp = -1;
                    break;
                } else {
                    log.warn("Current Pulsar service is {}, it has been down for {} ms. "
                                    + "Failed to switch to service {}, "
                                    + "because it is not available, continue to probe next pulsar service.",
                        currentPulsarServiceUrl, nanosToMillis(currentTimestamp - failedTimestamp), targetServiceUrl);
                }
            }
        }
    }

Then, in the probe method, add a newly defined request logic for obtaining cluster health status.

boolean probeAvailable(String url) {
        try {
            resolver.updateServiceUrl(url);
            InetSocketAddress endpoint = resolver.resolveHost();
            Socket socket = new Socket();
            socket.connect(new InetSocketAddress(endpoint.getHostName(), endpoint.getPort()), TIMEOUT);
            socket.close();

            //Below is the newly added code
            ClientCnx clientCnx = pulsarClient.getCnxPool()
                    .getConnection(new InetSocketAddress(endpoint.getHostName(), endpoint.getPort()))
                    .get();
            clientCnx.ctx().writeAndFlush(Commands.newHealthCheck()).sync();

            return clientCnx.isClusterAvailable();
        } catch (Exception e) {
            log.warn("Failed to probe available, url: {}", url, e);
            return false;
        }
    }

@lhotari
Copy link
Member

lhotari commented Feb 29, 2024

@yyj8 Wouldn't this already be covered by the ControlledClusterFailover mechanism? https://pulsar.apache.org/api/client/3.2.x/org/apache/pulsar/client/api/ControlledClusterFailoverBuilder.html

+1

@yyj8 Please check "PIP-121: Pulsar cluster level auto failover on client side", doc: #13315 and impl: #13316

Instead of adding yet another solution, could you simply use PIP-121?
You should be able to provide your own custom logic to customize it.

@yyj8
Copy link
Contributor Author

yyj8 commented Mar 1, 2024

@yyj8 Wouldn't this already be covered by the ControlledClusterFailover mechanism? https://pulsar.apache.org/api/client/3.2.x/org/apache/pulsar/client/api/ControlledClusterFailoverBuilder.html

+1

@yyj8 Please check "PIP-121: Pulsar cluster level auto failover on client side", doc: #13315 and impl: #13316

Instead of adding yet another solution, could you simply use PIP-121? You should be able to provide your own custom logic to customize it.

@lhotari I have read "PIP-121: Pulsar cluster level auto failover on client side", only the detection of whether the IP and port can be connected has been implemented.

It is a very complex matter to determine whether a cluster can truly provide production and consumption services normally, and whether it has affected business processes. So, the goal of my PIP is to manually intervene in cluster state switching when the cluster part is unavailable and affects business processes, in order to achieve client to cluster connection switching.

Because it is difficult to accurately determine whether this partially abnormal scenario has affected the business program through program automation, it is likely to cause misjudgment and ultimately affect the business program.

Since initiating a request to obtain cluster health status is part of cluster probing, I placed the client's probing logic in the probeAvailable method of class AutoClusterFailover.java

We can consider providing an automated and accurate health detection implementation in future optimizations that can determine whether a cluster can provide services normally. Replace the current manual update of health status with automation.

@yyj8 yyj8 closed this Mar 1, 2024
@yyj8 yyj8 reopened this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants