Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split-brain? #2733

Closed
CSharpRU opened this issue May 16, 2017 · 13 comments
Closed

Split-brain? #2733

CSharpRU opened this issue May 16, 2017 · 13 comments

Comments

@CSharpRU
Copy link
Contributor

Hi there,

I have Vault with etcdv3 in HA mode. Accidentally, I've got situation when all nodes are online in standby mode, but no one is trying to make itself master.

How to fix that?

@vishalnayak
Copy link
Member

@CSharpRU Can you please share the Vault version, config file and the server logs?

@CSharpRU
Copy link
Contributor Author

Version: 0.7.0

Config:

listener "tcp" {
  address = "ip:8200"

  cluster_address = "ip:8201"

  tls_disable = "false"
  tls_cert_file = "/etc/vault/cert.pem"
  tls_key_file = "/etc/vault/key.pem"
}

storage "etcd" {
  address = "https://localhost:2379"
  etcd_api = "v3"

  ha_enabled = "true"

  tls_ca_file = "/etc/ssl/ca.pem"
  tls_cert_file = "/etc/ssl/cert.pem"
  tls_key_file = "/etc/ssl/key.pem"

  cluster_addr = "hostname:8201"
  disable_clustering = "false"
  redirect_addr = "https://hostname:8200"
}

Logs:

2017/05/16 08:32:30.598868 [TRACE] physical/cache: creating LRU cache: size=32768
2017/05/16 08:32:30.605254 [TRACE] cluster listener addresses synthesized: cluster_addresses=
2017/05/16 08:32:41.092506 [TRACE] physical/cache: creating LRU cache: size=32768
2017/05/16 08:32:41.098427 [TRACE] cluster listener addresses synthesized: cluster_addresses=
2017/05/16 08:32:55.128303 [INFO ] core: vault is unsealed
2017/05/16 08:32:55.128412 [INFO ] core: entering standby mode
2017/05/16 08:32:55.131117 [TRACE] core: clearing forwarding clients
2017/05/16 08:32:55.131121 [TRACE] core: done clearing forwarding clients
2017/05/16 08:32:58.846141 [TRACE] core: found new active node information, refreshing
2017/05/16 08:32:58.848357 [TRACE] core: parsing information for new active node: 
2017/05/16 08:32:58.848496 [TRACE] core: refreshing forwarding connection
2017/05/16 08:32:58.848501 [TRACE] core: clearing forwarding clients
2017/05/16 08:32:58.848505 [TRACE] core: done clearing forwarding clients
2017/05/16 08:32:58.848536 [TRACE] core: done refreshing forwarding connection
2017/05/16 08:41:02.456065 [INFO ] core: acquired lock, enabling active operation

@vishalnayak
Copy link
Member

@CSharpRU The node for which you have attached the logs seems to have become an active node. How many nodes are in the cluster and what is the output of vault status on each?

@CSharpRU
Copy link
Contributor Author

@vishalnayak I've fixed it already by removing leader and lock keys from etcd. vault status output on each node was Mode: standby and with the same leader every time (even after restart of the whole cluster), other info as usual. 3 nodes in the cluster.

@vishalnayak
Copy link
Member

@CSharpRU Glad to know that its working. If you happen to know what had caused the lock keys to go to that state, please do let us know. Closing this issue for now.

@jefferai jefferai reopened this May 16, 2017
@jefferai
Copy link
Member

@xiang90 do you want to look into this?

@CSharpRU
Copy link
Contributor Author

@vishalnayak I think that it was caused by etcd and Vault outage (killed by memory). But I can't find anything in logs (maybe because it was level=err) and I can't explain why those keys were staying and new "election" wasn't started.

@jefferai
Copy link
Member

It should recover; xiang90 maintains the etcdv3 backend, hence my ping.

@CSharpRU
Copy link
Contributor Author

@jefferai Thanks, waiting for @xiang90 answer :)

@xiang90
Copy link
Contributor

xiang90 commented May 16, 2017

@CSharpRU Can you reproduce it? Probably provide a step by step guide or a script so that we can look into it more?

@CSharpRU
Copy link
Contributor Author

@xiang90 nope, it was made by our devops guy. I'll ask him tomorrow, maybe he'll give some info about that.

@raoofm
Copy link
Contributor

raoofm commented May 16, 2017

@CSharpRU @xiang90 @jefferai

This should be fixed by #2526

vault version being used for this issue is 0.7.0 and it should be fixed if the version is upgraded to 0.7.1 or later.

@CSharpRU
Copy link
Contributor Author

CSharpRU commented May 17, 2017

Thanks, will update and try to repeat it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants