-
Notifications
You must be signed in to change notification settings - Fork 938
Race-condition in Consul KV updates during failover #1273
Comments
Hi @timvaillancourt 👋 thanks for the elaborate analysis. I can see the need for a transaction. OK, I'm open to this -- at the time I tried using the "real" consul client and that required an unreasonable amount of dependencies. If you can make this work, please go for it. I have recollection of someone already suggesting the same in the past, but my search yields nothing. Thank you. |
@shlomi-noach thanks for reviewing this! Agreed, there is still a shocking number of dependencies for the official client and a lot of them don't seem to be actually needed 👎. This creates a big, ugly diff to I tried finding an up-to-date, but unofficial library with less dependencies but I wasn't able to find one. Another option is creating an Consul client from scratch in this repo. I guess the drawback there is maintaining it. Any preference? |
Let;s avoid creating a new client, and see if we can perhaps strip down some of the functionaility + dependencies of the original? For reference: #320 |
@shlomi-noach that sounds good 👍. From my testing think the dependencies |
OK cool, let's try to exclude the original |
@shlomi-noach a draft PR is ready here: #1276 I was able to reduce the deps from 1500+ files to 500, but it's still a huge list of dependencies. I haven't deleted |
@shlomi-noach, kindly ping |
Whoops! Dropped this. Looking into. |
closed by #1276 |
👋 recently we encountered a key-value-update race condition under Orchestrator release v3.1.4 using a setup that stores key-values for the current MySQL master in Consul
Our setup has Orchestrator deployed in Raft mode with KV updates sent to Consul on topology changes. Our MySQL load balancer reads from the same Consul server(s) and updates HaProxy backend configs (via github.com/hashicorp/consul-template) when KVs are changed in Consul, causing a restart of HaProxy - which is used to direct our MySQL traffic. This setup is outlined by @shlomi-noach in this blog post: https://github.blog/2018-06-20-mysql-high-availability-at-github/
Currently
go/kv/consul.go
uses 4 x separate Consul "put" calls to update 4 x different keys in Consul without using an atomic Consul transactionIn the problem scenario:
graceful-master-takeover
to a new PrimaryThe timing of this scenario is under 1-3 seconds. We have many HaProxy instances in each site and this race-condition happens on only a portion of the proxies, somewhat intermittently, perhaps 1 in every 10-20 failovers. This situation is remedied by manually triggering a reload of the Load Balancer once the Consul KVs are consistently updated by Orchestrator
I would like to propose the solution of adding a new, optional
KVStore
interface that uses a Consul transaction to update the KVs atomically in a single operation. The reason for this is this removes the race condition in the update of Consul KVs. It also is more efficient to perform a single HTTP call to ConsulThe drawback of this solution is the Consul client used in Orchestrator (https://github.com/armon/consul-api) is too out-of-date to support Consul Transactions. This client library is deprecated and has seen no updates in 6-7 years, recommending that users switch to the official Consul client library:
Moving towards using the updated, official Consul client would allow Transactions to be used, resolving this race condition. This would also help Orchestrator move away from a stale, deprecated library. However, this would require the official Consul client to be integrated, something I'm happy to tackle myself in some PR(s)
My hypothetical approach to that PR would be:
vendor/
KVStore
interface that uses Consul Transactions for KV changesKVStore
interface for Consul communicationsgit.luolix.top/armon/consul-api
from thevendor/
dircc @shlomi-noach for thoughts on this approach
The text was updated successfully, but these errors were encountered: