Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.15.4] Raft leader election unexpected behavior #24995

Open
framsouza opened this issue Jan 23, 2024 · 1 comment
Open

[1.15.4] Raft leader election unexpected behavior #24995

framsouza opened this issue Jan 23, 2024 · 1 comment
Labels
core/ha specific to high-availability k8s

Comments

@framsouza
Copy link
Contributor

framsouza commented Jan 23, 2024

We're encountering an odd case where a Vault leader is being replaced by a single follower that is restarted in a cluster of 3. I'm running vault 1.15.4 running on GKE and deployed by helm version 0.26.1.

  1. The leader was vault-0, I deleted one of the followers vault-1, and once vault-1 pod was created it became the current leader.
  2. I then deleted vault-2 (follower), and pod vault-0 became the leader.
  3. I deleted the current leader vault-0 and the new vault-1. became the leader (expected behavior)

I was expecting a real HA scenario where on steps 1 and 2 (follower deletion), the leader should continue to be the same without triggering the leader election.

Here's my config:

        listener "tcp" {
          tls_disable = false
          address = "0.0.0.0:8200"
          cluster_address = "0.0.0.0:8201"
          http_read_timeout = "600s"
          tls_cert_file = "tls.crt"
          tls_key_file  = "tls.key"
          tls_client_ca_file = "tls.ca"
        }

        telemetry {
          prometheus_retention_time = "12h"
          disable_hostname = true
          enable_hostname_label = true
        }

        seal "gcpckms" {
          project     = "abc"
          key_ring    = "abc"
          crypto_key  = "abc"
        }

        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "https://vault-0:8200"
            leader_ca_cert_file = "ca.crt"
            leader_client_cert_file = "tls.crt"
            leader_client_key_file = "tls.key"
          }
          retry_join {
            leader_api_addr = "https://vault-1:8200"
            leader_ca_cert_file = "ca.crt"
            leader_client_cert_file = "tls.crt"
            leader_client_key_file = "tls.key"
          }
          retry_join {
            leader_api_addr = "https://vault-2:8200"
            leader_ca_cert_file = "ca.crt"
            leader_client_cert_file = "tls.crt"
            leader_client_key_file = "tls.key"
          }
          performance_multiplier = 1

        }
        service_registration "kubernetes" {}

The autopilot config:

vault operator raft autopilot get-config
Key                                   Value
---                                   -----
Cleanup Dead Servers                  false
Last Contact Threshold                10s
Dead Server Last Contact Threshold    24h0m0s
Server Stabilization Time             10s
Min Quorum                            0
Max Trailing Logs                     1000
Disable Upgrade Migration             false

And probes:

Liveness:   http-get https://:8200/v1/sys/health%3Fstandbyok=true delay=60s timeout=3s period=5s #success=1 #failure=2
Readiness:  http-get https://:8200/v1/sys/health%3Fstandbyok=true&sealedcode=204&uninitcode=204 delay=5s timeout=3s period=5s #success=1 #failure=2

Seems to be related #14153

@biazmoreira
Copy link
Contributor

@framsouza #14153 seems to be resolved after a fix was introduced. Would you be able to confirm if hashicorp/raft#494 fixes your issue as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability k8s
Projects
None yet
Development

No branches or pull requests

2 participants