Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

etcd v2 failing after update to 1298.5.0 stable #1838

Open
pgburt opened this issue Mar 1, 2017 · 3 comments
Open

etcd v2 failing after update to 1298.5.0 stable #1838

pgburt opened this issue Mar 1, 2017 · 3 comments

Comments

@pgburt
Copy link

pgburt commented Mar 1, 2017

Bug

Container Linux Version

1298.5.0 stable

Expected Behavior

The etcd cluster works after an update.

Actual Behavior

The etcd cluster reports a failure immediately after the update to 1298.5.0 stable.

etcd cluster is unavilable or misconfigured - error from locksmith

Other Information

Logs from IRC, where this was reported:

hi all.   has anyone seen any regressions in etcd with the latest stable release?
5:38 PM we've been having massive problems across our cluster these past 24 hours or so
5:39 PM various things that depend on etcd are exhibiting odd behaviour
5:39 PM fleet has inexplicably shut down all managed units when it timed out talking to etcd proxy
5:39 PM locksmith appeared to fail and allowed two nodes in a group to reboot
5:39 PM at the same time
5:40 PM "etcd cluster is unavilable or misconfigured"  - error from locksmith
5:40 PM we don't know if this is a regression or if we've reached some kind of scale limit with our 5-node etcdv2 cluster
5:41 PM certainly nothing has changed config-wise with etcd.  cluster has been amazingly stable until now
5:42 PM etcd may be a red herring here because etcd2 has been at the same version for so long
5:42 PM it could be a kernel bug affecting networking
5:43 PM one thing we noted was that some nodes in the cluster updated to the latest -stable in just the last few hours
so we were running a mix of releases for a while

5:44 PM I think etcd *could* be a red herring (etcd2 has been on the same version for so long now)
5:45 PM it could be a kernel bug...just spitballing here tho

5:45 PM we had the previous stable release on some nodes up until just now
@coresolve
Copy link

coresolve commented Mar 1, 2017

I saw this as well. It seemed that locksmith came up and tried to hit etcd before etcd was up.

Since the update strategy relies on etcd this should be addressed by setting update-engine.service to require etcd-member.service

I suspect this has to do with moving etcd to a container and it taking a little longer to start.

@chrissnell
Copy link

thanks for the bug @pgburt . This was me that reported this. We are following this closely.

@chrissnell
Copy link

As a work-around for people having problems with fleet because of this, @stensonb here at Revinate implemented a change to the fleet configuration:

coreos:
  fleet:
    engine-reconcile-interval: 10
    etcd-request-timeout: 5
    agent-ttl: "120s"

This helps fleet but does not address the larger issues with other etcd dependents.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants