Unable to create cluster with more than 1 etcd #1206

mludvig · 2018-03-29T04:09:30Z

Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.

I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.

I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.

steinfletcher · 2018-03-29T20:13:30Z

Hi, I am also seeing similar behaviour today using both v0.9.8 and v0.9.9.

I have etcd.count: 3 deploying into an existing private subnet. Getting this from journalctl on the first etcd trying to resolve the other 2 etcds (which never launch).

Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996256 W | rafthttp: health check for peer b48943dd77f32763 could not connect: dial tcp x.x.x.x:2380: i/o timeout
Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996304 W | rafthttp: health check for peer 62fde287b92dfdf could not connect: dial tcp x.x.x.x:2380: i/o timeout

Looks like the cfn signal is never sent from Etcd0 and the control pane nested stack fails. From cfn event log:

Etcd0 | Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

If I set etcd.count: 1 then everything works fine. I am a bit stumped and will continue poking around...

luck02 · 2018-03-29T21:36:26Z

I'm seeing the same behaviour, we're on v0.9.8...

Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.087296 E | etcdserver: publish error: etcdserver: request timed out
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630920 I | raft: 719e986611adb617 is starting a new election at term 510
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630962 I | raft: 719e986611adb617 became candidate at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630974 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630984 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630992 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430903 I | raft: 719e986611adb617 is starting a new election at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430945 I | raft: 719e986611adb617 became candidate at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430958 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430968 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430977 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 512
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.090013 W | rafthttp: health check for peer a0d815f4b93422a9 could not connect: dial tcp x.y.z.a:2380: i/o timeout
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.091244 W | rafthttp: health check for peer f8cabdc7bae4698a could not connect: dial tcp a.x.y.z:2380: i/o timeout

luck02 · 2018-03-29T22:28:05Z

It's related to the wait signal @steinfletcher + @mludvig try adding this to your cluster.yaml:

waitSignal:
  enabled: false
  maxBatchSize: 1

the relevant template is:

"{{$etcdInstance.LogicalName}}": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "HealthCheckGracePeriod": 600,
        "HealthCheckType": "EC2",
        "LaunchConfigurationName": {
          "Ref": "{{$etcdInstance.LaunchConfigurationLogicalName}}"
        },
        "MaxSize": "1",
        "MetricsCollection": [
          {
            "Granularity": "1Minute"
          }
        ],
        "MinSize": "1",
        "Tags": [
          {
            "Key": "kubernetes.io/cluster/{{$.ClusterName}}",
            "PropagateAtLaunch": "true",
            "Value": "true"
          },
          {
            "Key": "Name",
            "PropagateAtLaunch": "true",
            "Value": "{{$.ClusterName}}-{{$.StackName}}-kube-aws-etcd-{{$etcdIndex}}"
          },
          {
            "Key": "kube-aws:role",
            "PropagateAtLaunch": "true",
            "Value": "etcd"
          }
        ],
        "VPCZoneIdentifier": [
          {{$etcdInstance.SubnetRef}}
        ]
      },
      {{if $.WaitSignal.Enabled}}
      "CreationPolicy" : {
        "ResourceSignal" : {
          "Count" : "1",
          "Timeout" : "{{$.Controller.CreateTimeout}}"
        }
      },
      {{end}}
      "UpdatePolicy" : {
        "AutoScalingRollingUpdate" : {
          "MinInstancesInService" : "0",
          "MaxBatchSize" : "1",
          {{if $.WaitSignal.Enabled}}
          "WaitOnResourceSignals" : "true",
          "PauseTime": "{{$.Controller.CreateTimeout}}"
          {{else}}
          "PauseTime": "PT2M"
          {{end}}
        }
      },

I was able to get this working by disabling the signal, the next question is, how did this ever work? Something underlying in the cfn engine must have changed WRT to simultaneous execution.

Here's my etcd log after setting wait to false:

-- Logs begin at Thu 2018-03-29 22:20:06 UTC. --
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started Session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd-logind[780]: New session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Paths.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Sockets.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Timers.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Basic System.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Default.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Startup finished in 23ms.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started User Manager for UID 500.
Mar 29 22:25:23 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:23.592989 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906323 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906359 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:28 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:28.593180 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108598 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108630 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:33 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:33.593363 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310906 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310939 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:38 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:38.593555 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513243 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513275 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901783 I | rafthttp: peer 596daac612174e37 became active
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901825 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream MsgApp v2 reader)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.902251 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream Message reader)
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.526774 I | etcdserver: updating the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529138 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529325 I | etcdserver/api: enabled capabilities for version 3.2

luck02 · 2018-03-29T23:20:46Z

We've asked our AWS Technical Account Managers to see if the CF team can shed any insight.

The other thing I'm wondering, and haven't had a chance to check yet. Perhaps the etcd version / image isn't locked down and something changed there? I'll look later this eve when I have time.

steinfletcher · 2018-03-30T12:20:38Z

Thanks @luck02. "Something underlying in the cfn engine must have changed WRT to simultaneous execution." Yeah I am also suspecting this.

mumoshu · 2018-04-02T07:29:53Z

Each etcd node has a dedicated ASG which depends on the next etcd node for sequential launch and rolling update, so there should be no simultaneous execution(if that's what you meant).

The first etcd node in your cluster should just start without waiting for any other etcd nodes as implemented in etcdadm, so in my understanding, something like reported here shouldn't happen normally.

I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes.

Can you confirm that you do have bad stackTags in cluster.yaml, and omitting them resolves the issue? Thx!

mludvig · 2018-04-02T07:43:19Z

Hi, thanks for the answer. Nope I don't have stackTags set:

# AWS Tags for cloudformation stack resources
#stackTags:
#  Name: "Kubernetes"
#  Environment: "Production"

mumoshu · 2018-04-02T07:50:24Z

@mludvig Thx! Would you mind sharing me the result of journalctl -u etcdadm-reconfigure.service on your failing etcd node? A github gist would be nice.

mludvig · 2018-04-02T09:07:10Z

Here:

ip-10-0-10-151 ~ # journalctl -u etcdadm-reconfigure.service
-- Logs begin at Mon 2018-04-02 08:59:20 UTC, end at Mon 2018-04-02 09:04:37 UTC. --
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Starting etcdadm reconfigure runner...
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm ]
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm/snapshots ]
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: panic! etcd data dir "/var/lib/etcd2" does not exist
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Started etcdadm reconfigure runner.
ip-10-0-10-151 ~ #

Interestingly /var/lib/etcd2 exists:

ip-10-0-10-151 ~ # find /var/lib/etcd2
/var/lib/etcd2
/var/lib/etcd2/member
/var/lib/etcd2/member/snap
/var/lib/etcd2/member/snap/db
/var/lib/etcd2/member/wal
/var/lib/etcd2/member/wal/0.tmp
/var/lib/etcd2/member/wal/0000000000000000-0000000000000000.wal
/var/lib/etcd2/lost+found

luck02 · 2018-04-02T18:26:51Z

@mumoshu FWIW, here's our stack tags:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "{{ stack_env }}"
  project:     "{{ PROJECT_NAME }}"
  owner:       "{{ PROJECT_OWNER }}"

Note that this hasn't changed and we're using a template to populate, on execution it would be more like:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "test"
  project:     "ub-data-infrastructure/cluster"
  owner:       "dataops"

Again, this hasn't changed, I'd be interested in hearing more along the lines of: "I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes."

In the meantime if you'd like to see our etcd logs as requested above I can provide them as well, I just need to undo the waitSignal change:

waitSignal:
  enabled: false
  maxBatchSize: 1

jcrugzz · 2018-04-03T17:20:25Z

yea this started happening to me last Friday when i tried to create a new cluster with a basically identical config to a cluster i created a few weeks earlier. Something subtle definitely must have changed. Im currently afraid to run kube-aws update on any of my clusters but I need to soon. Can I trust that waitSignal work around for updating a live prod cluster? Or do I need to think about other options.

I have a hard time thinking its a stackTags issue in my case since it was never a problem previously.

How this manifested for me was a "etcdadm-check.service: Failed with result 'exit-code'." happening on the first etcd node that tried to come up, preventing anything else from happening.

luck02 · 2018-04-03T17:26:58Z

@jcrugzz I'm just working my way through some fixes (waitSignal included). I expect to be deploying to our production this evening / tomorow. I will update with my experiences. I am running into some other issues but they may be unrelated to this issue.

jcrugzz · 2018-04-03T22:02:04Z

Thanks @luck02 appreciate it!

iherbmatt · 2018-04-04T00:57:11Z

Hey Guys. I disabled the wait signal, and it generated all the appropriate machines, however the masters are no longer healthy. The cluster.yaml file I'm using is one I've been using since 0.9.9 originally came out. Should it work entirely by uncommenting out the waitSignal and setting it to be disabled?

luck02 · 2018-04-04T04:18:49Z

@jcrugzz / everyone else.

I've burned quite a bit of time testing this. I don't think disabling waitSignal is going to be viable. Quite a few of my validation steps start failing randomly. Of course YMMV but we want to validate our cluster is healthy at the end and the waitSignal disable makes that challenging.

I did hear back from our AWS technical account managers. They claim 0 changes in the underlying CFN code. They've offered to investigate a failed stack for us which I'll set up tomorrow morning (PST). I didn't see the etcd container having a specific version so my next theory is that if the image isn't locked down it could be such that we're pulling a different container and we could be seeing drift there (IE perhaps their not reporting success / failure in the same way/ differently etc).

I'll continue investigating.

ktateish · 2018-04-04T10:46:01Z

I'm facing on the same problem too.
I noticed some behavior:

When It failed , journalctl -u etcdadm-reconfigure on the etcd0 node showed logs like @mludvig reported.
When systemctl restart etcdadm-reconfigure on the etcd0 node after kube-aws up failed, etcdadm-reconfigure looks like working properly (logs show pulling container images successfully).
I tried kube-aws up several times and it always succeeds so far on my environment after applying the following patch to the userdata:

diff --git a/userdata/cloud-config-etcd b/userdata/cloud-config-etcd
index cf306f6..f613337 100644
--- a/userdata/cloud-config-etcd
+++ b/userdata/cloud-config-etcd
@@ -156,6 +156,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

I think that the etcdadm-reconfigure unit seems started too early on etcd nodes' boot.

luck02 · 2018-04-04T18:36:58Z

I just checked and a new version of etcd was released 6 days ago, so presumably it's related.

I'm just cleaning up a semi related mess and then going to set our etcd version back to what was out a month ago. I'm assuming that's going to solve the issue as well.

I'll report back when I'm done.

etcd version is set here:
https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cluster.yaml#L648

I'd expect to be testing that this evening / tomorrow.

kylegoch · 2018-04-04T20:34:55Z

Seeing the exact same behavior as well. We were testing a dev build. Had a known working cluster.yaml, went to recreate and got the same errors as above.

We are using etcd version 3.2.10

Edit: Using @ktateish 's patch from above on the userdata made everything work again. Wonder why it broke in the first place.

luck02 · 2018-04-04T22:03:12Z

@kylegoch so you've pinned your etcd version to v3.2.1 which according to github was built on Jun 23, 2017?

Ok, that's really odd. Something changed... If it wasn't CFN and it wasn't etcd...

I'm going to experiment with pinning my version to something older than last month just to replicate the issue with a pinned version (previously we weren't pinning the version)

kylegoch · 2018-04-04T22:21:17Z

We are using 3.2.10 from November. Not sure why that version, but thats what we have always used.

And the cluster.yaml im working with right now, worked just fine about 10 days ago.

iherbmatt · 2018-04-04T22:38:09Z

I've been using etcd 3.2.6 and I still incurred this issue as well.

mludvig · 2018-04-04T22:41:49Z

I can confirm that @ktateish 's workaround with sleep 60 works for me, I just created a cluster with 3 etcd nodes:

+00:02:57	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 
+00:02:57	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 	"Resource creation Initiated"
+00:06:24	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 	"Received SUCCESS signal with UniqueId i-0b5da874acdc0e7bb"
+00:06:25	Controlplane	CREATE_COMPLETE         		Etcd0                 
+00:06:30	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 
+00:06:31	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 	"Resource creation Initiated"
+00:09:56	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 	"Received SUCCESS signal with UniqueId i-0be602b3afcacc247"
+00:09:58	Controlplane	CREATE_COMPLETE         		Etcd1                 
+00:10:02	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 
+00:10:03	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 	"Resource creation Initiated"
+00:12:47	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 	"Received SUCCESS signal with UniqueId i-097a60f76baa844f7"
+00:12:48	Controlplane	CREATE_COMPLETE         		Etcd2

luck02 · 2018-04-04T22:54:50Z

Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config:

etcd:
 #etc
  version: 3.3.1

but when I log into the etcd from my failed cluster I get:

core@ip-x-y-z-etc ~ $ etcdctl version
etcdctl version: 3.2.15
API version: 3.2

3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now.

iherbmatt · 2018-04-04T23:10:00Z

Hi Gary, Are you baking that into a build? Or what would typically be the best way to make this change manually? Should we be doing this manually? Thank you!

…

On Wed, Apr 4, 2018 at 3:54 PM, Gary Lucas ***@***.***> wrote: Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config: etcd: #etc version: 3.3.1 but when I log into the etcd from my failed cluster I get: ***@***.*** ~ $ etcdctl version etcdctl version: 3.2.15 API version: 3.2 3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWH4rjqELXg1Nfs7PJpTFfO2FRFVLRIzks5tlU88gaJpZM4S_uCR> .

-- *The information contained in this message is the sole and exclusive property of ***iHerb Inc.*** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of ***iHerb Inc.** *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

luck02 · 2018-04-04T23:17:32Z

@iherbmatt depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc).

iherbmatt · 2018-04-04T23:30:09Z

Your fix for the iops seemed to work well by cherry picking - not sure if that's what you mean by hotfix. I haven't been able to build clusters in over a week, so I'm desperate and really appreciate your looking into this :)

…

On Wed, Apr 4, 2018 at 4:17 PM, Gary Lucas ***@***.***> wrote: @iherbmatt <https://github.com/iherbmatt> depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWH4rgHyRishM6dfP_lZRBPBKrMy_Wuiks5tlVSPgaJpZM4S_uCR> .

-- *The information contained in this message is the sole and exclusive property of ***iHerb Inc.*** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of ***iHerb Inc.** *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

luck02 · 2018-04-04T23:34:10Z

That's correct, in this case there's no commit to cherry pick, but applying a diff etc same thing etc. I'm running off of v0.9.8 so this is the patch I applied:

commit 19ad26bd147ec9882dfb7e67f5aa854a331cf2cd (HEAD -> v0.9.8-hotfix4, tag: v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:03:06 2018 -0700

    more fixii

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index 2d25d487..c8ae763b 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -166,6 +166,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer

mind you, my new stack isn't up yet.

luck02 · 2018-04-04T23:40:16Z

Godamn it, I put the 'fix' in the wrong stanza (update service instead of reconfigure)

I'll try again this eve.

luck02 · 2018-04-04T23:51:50Z

applied this:

commit 4d6a8b89431828638a5414a5a73b4404c58514e9 (HEAD -> v0.9.8-hotfix5, tag: v0.9.8-hotfix5, v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:46:36 2018 -0700

    moved the sleep command

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index c8ae763b..e85ca23c 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,6 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -166,7 +167,6 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer
(END)

Resolves kubernetes-retired#1206

davidmccormick · 2018-04-05T13:40:25Z

Isn't having a service reconfigure the type of etcd service a lot of added complexity? Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster? I would rather that it be left as notify but that all etcd nodes are initially created in parallel. What do you think?

ktateish · 2018-04-05T16:14:48Z

Oh, I missed something. I thought it should be fixed on starting etcdadm-reconfigure in addition to your patch. But by only your patch, it have the same effect with nicer way. Am I right?

iherbmatt · 2018-04-05T17:53:42Z

Hi Everyone,

I'm seeing this now:

member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
cluster is healthy

It appears etcd is healthy and I'm seeing this in the controller logs as well. I'm having trouble getting the controllers to generate now, however. I'm going to try to build it again and see what happens.

luck02 · 2018-04-05T20:47:21Z

I applied this:

commit 65722a891eca5e8a5ff9538e2837d7bbeb84390f (HEAD -> unbounce-v0.9.8, tag: v0.9.8-hotfix6, origin/unbounce-v0.9.8, v0.9.8-hotfix6, v0.9.8-hotfix5)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Thu Apr 5 08:40:40 2018 -0700

    trying mumoshis fix

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index e85ca23c..b8a56949 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,7 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
+        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

Cluster came up, I'm happy :D

iherbmatt · 2018-04-05T23:43:50Z

I wonder if it has something to do with the fact that I'm using 0.9.9 instead of 0.9.8. The etcd cluster comes up fine, but my controllers now don't come online, however they are built.

Here is the output I'm seeing loop in journalctl from the controllers:

output.txt

mumoshu · 2018-04-06T00:38:37Z

@iherbmatt Hi! Kubelet seems fine to me. Can you share the full output from journalctl, rather than kubelet's log only?

mumoshu · 2018-04-06T00:58:28Z

@davidmccormick

Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster?

Partially yes, and partially no? I guess you may be confusing two things. Generally there are two major categories in failure cases, transient and permanent failures of etcd node(s).

A transient failure is that the underlying EC2 instance failed due to an AWS infrastructure issue. In this case, the ASG just recreates the EC2 instance to resolve the issue. Suppose you have a 3 nodes etcd cluster, you may notice that you now have 3 ASGs in total, each matches to one etcd node=member. We also have a pool of EIP+EBS pairs from which each etcd member borrows its identity and datadir.

A permanent failure is that e.g. the EBS volume serving etcd datadir corrupted so that you have to recover the etcd member from an etcd snapshot(not EBS snapshot).

etcd.disasterRecovery.automated and etcd.snapshot.automated is for the latter case. And AFAICS, we have no simpler way to do that. Just marking every etcd-member type to simple results in losing

That being said,

Isn't having a service reconfigure the type of etcd service a lot of added complexity?

Definitely. I'm open to ideas to set type to notify statically while somehow allowing us to cover use-cases of:

Rolling-update of etcd nodes, postponed and rolled-back on the new member failed to join the existing cluster
- kube-aws as of today achieves this by setting a cfn DependsOn from the prev to the next etcd ASG=Node
Initial bootstrap of etcd cluster
- DependsOn requires us to provision etcd ASGs one by one, so that we have set type to simple for first N/2 etcd ASGs.

mumoshu · 2018-04-06T01:10:59Z

@davidmccormick

What might make more sense is to deploy all 3 (n) at once when you perform a fresh cluster install but only roll in one-by-one when upgrading

Good point! This is what I gave up when I first implemented the H/A etcd about a year ago. It should be the time to consider alternative implementations or possible enhancements.

I'm not all that familiar with cloud-formation but I think I might have seen the controllers behaving in this way?

Did you mean kube-aws controller nodes? Then yes, controller nodes are behaving that way - there's a single multi-AZ ASG managing desired number of controller EC2 instances.

Implementation-wise, we can't do the same for etcd nodes though.
We have to give each etcd node a stable network identity plus EBS volume, and an EBS volume is tied to single AZ.
What if we had a 3-AZ ASG, 3 EBS volumes each is tied to separate AZ, for 3 etcd nodes, and then one of AZs failed? The ASG would try to launch a replacement etcd node in one of available 2 AZs, in which the EBS volume holding the original etcd data doesn't exist!
In that sense, I believe we have to get along with 1-etcd-asg-per-az pattern.

But anyway,

This way quorum can be achieved before the cfn-signal is sent. In a fresh install I would personally also bring up the controllers and nodes without waiting too.

This should be discussed further. How about just omitting DependsOn on etcd ASGs for initial bootstrap via kube-aws up, and then add DependsOns on the subsequent kube-aws update run? Would it actually result in a rolling-update of etcd ASGs by DependsOns just added?

iherbmatt · 2018-04-07T01:31:31Z

@mumoshu I was really excited to see my etcd nodes build successfully. I even logged in and saw they were all healthy, but then I saw the same CloudFormation timeouts on the controllers. I will redact some identifying data from the journalctl log and attach it. Thank you for the time in advance :)

mumoshu · 2018-04-07T12:56:47Z

@iherbmatt Thanks!

If I could ask for more, sharing us your cluster.yaml would also help! I know a cluster bootstrapping shouldn't be such exciting and hard thing to do but there are certainly many failure cases, which can be pinpointed just by looking at your cluster.yaml.

iherbmatt · 2018-04-09T17:29:19Z

@mumoshu Here is the cluster.yaml file.
cluster-yaml.txt

iherbmatt · 2018-04-09T23:56:10Z

@mumoshu Here is the journalctl log from the controllers that would not start up.
journalctl-redacted.log

Vince-Cercury · 2018-04-11T05:36:21Z

For me the issue starts with CoreOS 1688.5.3 released in April.
The previous version (1632.3.0, Release Date: February 15, 2018) is not an issue.

With the patch from @mumoshu the etcd get updated fine with CoreOS 1688.5.3 . However Controllers don't and rollback.

iherbmatt · 2018-04-11T19:05:03Z

@mumoshu Any thoughts?

iherbmatt · 2018-04-13T01:14:17Z

@Vincemd Are you unable to build clusters as well?

mludvig · 2018-04-13T01:19:43Z

@iherbmatt I had the same problem while testing the proposed fix because I changed the cluster name in cluster.yml but the certificates were still for the old name. That led exactly to the same issue that you observe - after creating etcd nodes the controllers failed to build. Removing credentials/ and recreating the certs fixed it.

Vince-Cercury · 2018-04-13T03:04:18Z

@iherbmatt correct with latest version of CoreOs. If I use the Feb release of the AMI, then all fine. A colleague of mine is also facing the same issue.

iherbmatt · 2018-04-14T01:59:12Z

I wish that change would work for me.

It just sits there and eventually times out when trying to build the controllers.
I even used CoreOS-stable-1632.3.0-hvm (ami-862140e9)

It's been almost 2 weeks I've been unable to build clusters :(

mumoshu · 2018-04-14T04:37:49Z

@iherbmatt Sorry for the trouble!
Your etcd seems fine. But from the logs I see Calico installer is complaining.

Perhaps you are hit by the recent regression in master? Would you ming trying with kube-aws v0.9.10-rc.3? If it still doesn't work, trying k8s 1.9.3 which is the defaul in 0.9.10-rc.3 may change somethig.

iherbmatt · 2018-04-17T16:36:26Z

Hi @mumoshu. I was able to generate a cluster with 0.9.10-rc.3 but it had to be running version 1.9.3 otherwise it wouldn't work. Another issue I have, however, is that I cannot use m5's for the etcd's. Any reason you can think of that might explain why? Thanks!

Confushion · 2018-04-30T16:07:15Z

Hi @mumoshu

Seems you were right about etcdadm-reconfigure.service wanting a formatted /var/lib/etcd2.
However, your fix seemed not to wait for the service to be active, but to fail when it is not active yet...
So the timeouts were still happening unfortunately.

Below patch fixes this by actually depending on service var-lib-etcd2.mount (which is the one that it should depend on, and that in turn depends on format-etcd2-volume.service anyway...)

Also the WantedBy line wasn't doing anything useful AFAIK...

Thanks.

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index fc077436..a291fdbf 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -151,6 +151,7 @@ coreos:
         Wants=cfn-etcd-environment.service
         After=cfn-etcd-environment.service
         After=network.target
+        After=var-lib-etcd2.mount

         [Service]
         Type=oneshot
@@ -158,7 +159,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
+        ExecStartPre=/usr/bin/systemctl is-active var-lib-etcd2.mount
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -167,9 +168,6 @@ coreos:
         {{end -}}
         TimeoutStartSec=120

-        [Install]
-        WantedBy=cfn-etcd-environment.service
-
     - name: etcdadm-update-status.service
       enable: true
       content: |

mumoshu · 2018-05-02T01:03:10Z

@iherbmatt Ah, sorry for the late reply! The bad news is that m5 and also c5 aren't supported out-of-box yet as mentioned in #1230.

The good news is that there is a patch composed of scripts and systemd units to adapt the NVMe devices to look like legacy devices so that they can be successfully consumed by kube-aws. The patch can be found in issues linked from #1230.

Please don't hesitate to ask me if you still had trouble on anything.

mumoshu · 2018-05-02T01:05:42Z

@Confushion Certainly - I realized hat my patch wasn't complete at all after seeing your work! Thank you so much for that.

Everyone, @Confushion has kindly contributed #1270 to make etcd bootstrapping even more reliable.
It is already merged and will be available in v0.9.10-rc.6 or v0.9.10

davidmccormick · 2018-06-07T13:31:58Z

Implementation of my previous suggestion to bring the etcd servers up in parallel on a new cluster build. #1357

jcrugzz mentioned this issue Apr 4, 2018

previously valid config started to fail kube-aws validate #1221

Closed

ktateish added a commit to ktateish/kube-aws that referenced this issue Apr 5, 2018

Fix synchronization on starting etcdadm-reconfigure

8e151fa

Resolves kubernetes-retired#1206

ktateish mentioned this issue Apr 5, 2018

Fix synchronization on starting etcdadm-reconfigure #1225

Merged

mumoshu mentioned this issue Apr 6, 2018

Add networking-daemonsets feature #1195

Merged

mumoshu closed this as completed in #1225 Apr 6, 2018

This was referenced May 1, 2018

Fix etcd startup #1269

Closed

Fix etcd startup #1270

Merged

Unable to create cluster with more than 1 etcd #1206

Unable to create cluster with more than 1 etcd #1206

Comments

mludvig commented Mar 29, 2018

steinfletcher commented Mar 29, 2018

luck02 commented Mar 29, 2018

luck02 commented Mar 29, 2018 • edited Loading

luck02 commented Mar 29, 2018

steinfletcher commented Mar 30, 2018

mumoshu commented Apr 2, 2018

mludvig commented Apr 2, 2018

mumoshu commented Apr 2, 2018

mludvig commented Apr 2, 2018

luck02 commented Apr 2, 2018

jcrugzz commented Apr 3, 2018

luck02 commented Apr 3, 2018

jcrugzz commented Apr 3, 2018

iherbmatt commented Apr 4, 2018

luck02 commented Apr 4, 2018

ktateish commented Apr 4, 2018

luck02 commented Apr 4, 2018

kylegoch commented Apr 4, 2018 • edited Loading

luck02 commented Apr 4, 2018

kylegoch commented Apr 4, 2018

iherbmatt commented Apr 4, 2018 via email • edited Loading

mludvig commented Apr 4, 2018

luck02 commented Apr 4, 2018

iherbmatt commented Apr 4, 2018 via email • edited Loading

luck02 commented Apr 4, 2018

iherbmatt commented Apr 4, 2018 via email • edited Loading

luck02 commented Apr 4, 2018 • edited Loading

luck02 commented Apr 4, 2018

luck02 commented Apr 4, 2018

davidmccormick commented Apr 5, 2018 • edited Loading

ktateish commented Apr 5, 2018 • edited Loading

iherbmatt commented Apr 5, 2018

luck02 commented Apr 5, 2018

iherbmatt commented Apr 5, 2018 • edited Loading

mumoshu commented Apr 6, 2018

mumoshu commented Apr 6, 2018

mumoshu commented Apr 6, 2018

iherbmatt commented Apr 7, 2018

mumoshu commented Apr 7, 2018

iherbmatt commented Apr 9, 2018

iherbmatt commented Apr 9, 2018

Vince-Cercury commented Apr 11, 2018

iherbmatt commented Apr 11, 2018

iherbmatt commented Apr 13, 2018

mludvig commented Apr 13, 2018

Vince-Cercury commented Apr 13, 2018

iherbmatt commented Apr 14, 2018

mumoshu commented Apr 14, 2018

iherbmatt commented Apr 17, 2018

Confushion commented Apr 30, 2018

mumoshu commented May 2, 2018

mumoshu commented May 2, 2018

davidmccormick commented Jun 7, 2018

luck02 commented Mar 29, 2018 •

edited

Loading

kylegoch commented Apr 4, 2018 •

edited

Loading

iherbmatt commented Apr 4, 2018 via email •

edited

Loading

iherbmatt commented Apr 4, 2018 via email •

edited

Loading

iherbmatt commented Apr 4, 2018 via email •

edited

Loading

luck02 commented Apr 4, 2018 •

edited

Loading

davidmccormick commented Apr 5, 2018 •

edited

Loading

ktateish commented Apr 5, 2018 •

edited

Loading

iherbmatt commented Apr 5, 2018 •

edited

Loading