etcd v2 failing after update to 1298.5.0 stable #1838

pgburt · 2017-03-01T22:57:45Z

Bug

Container Linux Version

1298.5.0 stable

Expected Behavior

The etcd cluster works after an update.

Actual Behavior

The etcd cluster reports a failure immediately after the update to 1298.5.0 stable.

etcd cluster is unavilable or misconfigured - error from locksmith

Other Information

Logs from IRC, where this was reported:

hi all.   has anyone seen any regressions in etcd with the latest stable release?
5:38 PM we've been having massive problems across our cluster these past 24 hours or so
5:39 PM various things that depend on etcd are exhibiting odd behaviour
5:39 PM fleet has inexplicably shut down all managed units when it timed out talking to etcd proxy
5:39 PM locksmith appeared to fail and allowed two nodes in a group to reboot
5:39 PM at the same time
5:40 PM "etcd cluster is unavilable or misconfigured"  - error from locksmith
5:40 PM we don't know if this is a regression or if we've reached some kind of scale limit with our 5-node etcdv2 cluster
5:41 PM certainly nothing has changed config-wise with etcd.  cluster has been amazingly stable until now
5:42 PM etcd may be a red herring here because etcd2 has been at the same version for so long
5:42 PM it could be a kernel bug affecting networking
5:43 PM one thing we noted was that some nodes in the cluster updated to the latest -stable in just the last few hours
so we were running a mix of releases for a while

5:44 PM I think etcd *could* be a red herring (etcd2 has been on the same version for so long now)
5:45 PM it could be a kernel bug...just spitballing here tho

5:45 PM we had the previous stable release on some nodes up until just now

The text was updated successfully, but these errors were encountered:

coresolve · 2017-03-01T23:17:06Z

I saw this as well. It seemed that locksmith came up and tried to hit etcd before etcd was up.

Since the update strategy relies on etcd this should be addressed by setting update-engine.service to require etcd-member.service

I suspect this has to do with moving etcd to a container and it taking a little longer to start.

chrissnell · 2017-03-01T23:40:15Z

thanks for the bug @pgburt . This was me that reported this. We are following this closely.

chrissnell · 2017-03-02T21:06:10Z

As a work-around for people having problems with fleet because of this, @stensonb here at Revinate implemented a change to the fleet configuration:

coreos:
  fleet:
    engine-reconcile-interval: 10
    etcd-request-timeout: 5
    agent-ttl: "120s"

This helps fleet but does not address the larger issues with other etcd dependents.

crawford added area/usability component/etcd component/locksmith kind/regression priority/P0 team/os labels Mar 1, 2017

ghost mentioned this issue Mar 10, 2017

kube-aws appears not to work with CoreOS (container linux) 1298.5 (most recent stable) kubernetes-retired/kube-aws#405

Closed

pop mentioned this issue May 15, 2017

etcd: configuring etcd-member by hand. coreos/docs#1074

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd v2 failing after update to 1298.5.0 stable #1838

etcd v2 failing after update to 1298.5.0 stable #1838

pgburt commented Mar 1, 2017

coresolve commented Mar 1, 2017 •

edited

Loading

chrissnell commented Mar 1, 2017

chrissnell commented Mar 2, 2017

etcd v2 failing after update to 1298.5.0 stable #1838

etcd v2 failing after update to 1298.5.0 stable #1838

Comments

pgburt commented Mar 1, 2017

Bug

Container Linux Version

Expected Behavior

Actual Behavior

Other Information

coresolve commented Mar 1, 2017 • edited Loading

chrissnell commented Mar 1, 2017

chrissnell commented Mar 2, 2017

coresolve commented Mar 1, 2017 •

edited

Loading