Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Unable to create cluster with more than 1 etcd #1206

Closed
mludvig opened this issue Mar 29, 2018 · 72 comments
Closed

Unable to create cluster with more than 1 etcd #1206

mludvig opened this issue Mar 29, 2018 · 72 comments

Comments

@mludvig
Copy link

mludvig commented Mar 29, 2018

Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.

I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.

I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.

@steinfletcher
Copy link

Hi, I am also seeing similar behaviour today using both v0.9.8 and v0.9.9.

I have etcd.count: 3 deploying into an existing private subnet. Getting this from journalctl on the first etcd trying to resolve the other 2 etcds (which never launch).

Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996256 W | rafthttp: health check for peer b48943dd77f32763 could not connect: dial tcp x.x.x.x:2380: i/o timeout
Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996304 W | rafthttp: health check for peer 62fde287b92dfdf could not connect: dial tcp x.x.x.x:2380: i/o timeout

Looks like the cfn signal is never sent from Etcd0 and the control pane nested stack fails. From cfn event log:

Etcd0 | Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

If I set etcd.count: 1 then everything works fine. I am a bit stumped and will continue poking around...

@luck02
Copy link
Contributor

luck02 commented Mar 29, 2018

I'm seeing the same behaviour, we're on v0.9.8...

Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.087296 E | etcdserver: publish error: etcdserver: request timed out
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630920 I | raft: 719e986611adb617 is starting a new election at term 510
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630962 I | raft: 719e986611adb617 became candidate at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630974 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630984 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630992 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430903 I | raft: 719e986611adb617 is starting a new election at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430945 I | raft: 719e986611adb617 became candidate at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430958 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430968 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430977 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 512
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.090013 W | rafthttp: health check for peer a0d815f4b93422a9 could not connect: dial tcp x.y.z.a:2380: i/o timeout
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.091244 W | rafthttp: health check for peer f8cabdc7bae4698a could not connect: dial tcp a.x.y.z:2380: i/o timeout

@luck02
Copy link
Contributor

luck02 commented Mar 29, 2018

It's related to the wait signal @steinfletcher + @mludvig try adding this to your cluster.yaml:

waitSignal:
  enabled: false
  maxBatchSize: 1

the relevant template is:

"{{$etcdInstance.LogicalName}}": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "HealthCheckGracePeriod": 600,
        "HealthCheckType": "EC2",
        "LaunchConfigurationName": {
          "Ref": "{{$etcdInstance.LaunchConfigurationLogicalName}}"
        },
        "MaxSize": "1",
        "MetricsCollection": [
          {
            "Granularity": "1Minute"
          }
        ],
        "MinSize": "1",
        "Tags": [
          {
            "Key": "kubernetes.io/cluster/{{$.ClusterName}}",
            "PropagateAtLaunch": "true",
            "Value": "true"
          },
          {
            "Key": "Name",
            "PropagateAtLaunch": "true",
            "Value": "{{$.ClusterName}}-{{$.StackName}}-kube-aws-etcd-{{$etcdIndex}}"
          },
          {
            "Key": "kube-aws:role",
            "PropagateAtLaunch": "true",
            "Value": "etcd"
          }
        ],
        "VPCZoneIdentifier": [
          {{$etcdInstance.SubnetRef}}
        ]
      },
      {{if $.WaitSignal.Enabled}}
      "CreationPolicy" : {
        "ResourceSignal" : {
          "Count" : "1",
          "Timeout" : "{{$.Controller.CreateTimeout}}"
        }
      },
      {{end}}
      "UpdatePolicy" : {
        "AutoScalingRollingUpdate" : {
          "MinInstancesInService" : "0",
          "MaxBatchSize" : "1",
          {{if $.WaitSignal.Enabled}}
          "WaitOnResourceSignals" : "true",
          "PauseTime": "{{$.Controller.CreateTimeout}}"
          {{else}}
          "PauseTime": "PT2M"
          {{end}}
        }
      },

I was able to get this working by disabling the signal, the next question is, how did this ever work? Something underlying in the cfn engine must have changed WRT to simultaneous execution.

Here's my etcd log after setting wait to false:

-- Logs begin at Thu 2018-03-29 22:20:06 UTC. --
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started Session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd-logind[780]: New session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Paths.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Sockets.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Timers.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Basic System.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Default.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Startup finished in 23ms.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started User Manager for UID 500.
Mar 29 22:25:23 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:23.592989 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906323 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906359 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:28 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:28.593180 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108598 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108630 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:33 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:33.593363 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310906 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310939 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:38 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:38.593555 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513243 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513275 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901783 I | rafthttp: peer 596daac612174e37 became active
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901825 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream MsgApp v2 reader)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.902251 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream Message reader)
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.526774 I | etcdserver: updating the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529138 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529325 I | etcdserver/api: enabled capabilities for version 3.2

@luck02
Copy link
Contributor

luck02 commented Mar 29, 2018

We've asked our AWS Technical Account Managers to see if the CF team can shed any insight.

The other thing I'm wondering, and haven't had a chance to check yet. Perhaps the etcd version / image isn't locked down and something changed there? I'll look later this eve when I have time.

@steinfletcher
Copy link

Thanks @luck02. "Something underlying in the cfn engine must have changed WRT to simultaneous execution." Yeah I am also suspecting this.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 2, 2018

Each etcd node has a dedicated ASG which depends on the next etcd node for sequential launch and rolling update, so there should be no simultaneous execution(if that's what you meant).

The first etcd node in your cluster should just start without waiting for any other etcd nodes as implemented in etcdadm, so in my understanding, something like reported here shouldn't happen normally.

I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes.

Can you confirm that you do have bad stackTags in cluster.yaml, and omitting them resolves the issue? Thx!

@mludvig
Copy link
Author

mludvig commented Apr 2, 2018

Hi, thanks for the answer. Nope I don't have stackTags set:

# AWS Tags for cloudformation stack resources
#stackTags:
#  Name: "Kubernetes"
#  Environment: "Production"

@mumoshu
Copy link
Contributor

mumoshu commented Apr 2, 2018

@mludvig Thx! Would you mind sharing me the result of journalctl -u etcdadm-reconfigure.service on your failing etcd node? A github gist would be nice.

@mludvig
Copy link
Author

mludvig commented Apr 2, 2018

Here:

ip-10-0-10-151 ~ # journalctl -u etcdadm-reconfigure.service
-- Logs begin at Mon 2018-04-02 08:59:20 UTC, end at Mon 2018-04-02 09:04:37 UTC. --
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Starting etcdadm reconfigure runner...
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm ]
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm/snapshots ]
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: panic! etcd data dir "/var/lib/etcd2" does not exist
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Started etcdadm reconfigure runner.
ip-10-0-10-151 ~ # 

Interestingly /var/lib/etcd2 exists:

ip-10-0-10-151 ~ # find /var/lib/etcd2
/var/lib/etcd2
/var/lib/etcd2/member
/var/lib/etcd2/member/snap
/var/lib/etcd2/member/snap/db
/var/lib/etcd2/member/wal
/var/lib/etcd2/member/wal/0.tmp
/var/lib/etcd2/member/wal/0000000000000000-0000000000000000.wal
/var/lib/etcd2/lost+found

@luck02
Copy link
Contributor

luck02 commented Apr 2, 2018

@mumoshu FWIW, here's our stack tags:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "{{ stack_env }}"
  project:     "{{ PROJECT_NAME }}"
  owner:       "{{ PROJECT_OWNER }}"

Note that this hasn't changed and we're using a template to populate, on execution it would be more like:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "test"
  project:     "ub-data-infrastructure/cluster"
  owner:       "dataops"

Again, this hasn't changed, I'd be interested in hearing more along the lines of: "I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes."

In the meantime if you'd like to see our etcd logs as requested above I can provide them as well, I just need to undo the waitSignal change:

waitSignal:
  enabled: false
  maxBatchSize: 1

@jcrugzz
Copy link

jcrugzz commented Apr 3, 2018

yea this started happening to me last Friday when i tried to create a new cluster with a basically identical config to a cluster i created a few weeks earlier. Something subtle definitely must have changed. Im currently afraid to run kube-aws update on any of my clusters but I need to soon. Can I trust that waitSignal work around for updating a live prod cluster? Or do I need to think about other options.

I have a hard time thinking its a stackTags issue in my case since it was never a problem previously.

How this manifested for me was a "etcdadm-check.service: Failed with result 'exit-code'." happening on the first etcd node that tried to come up, preventing anything else from happening.

@luck02
Copy link
Contributor

luck02 commented Apr 3, 2018

@jcrugzz I'm just working my way through some fixes (waitSignal included). I expect to be deploying to our production this evening / tomorow. I will update with my experiences. I am running into some other issues but they may be unrelated to this issue.

@jcrugzz
Copy link

jcrugzz commented Apr 3, 2018

Thanks @luck02 appreciate it!

@iherbmatt
Copy link

Hey Guys. I disabled the wait signal, and it generated all the appropriate machines, however the masters are no longer healthy. The cluster.yaml file I'm using is one I've been using since 0.9.9 originally came out. Should it work entirely by uncommenting out the waitSignal and setting it to be disabled?

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

@jcrugzz / everyone else.

I've burned quite a bit of time testing this. I don't think disabling waitSignal is going to be viable. Quite a few of my validation steps start failing randomly. Of course YMMV but we want to validate our cluster is healthy at the end and the waitSignal disable makes that challenging.

I did hear back from our AWS technical account managers. They claim 0 changes in the underlying CFN code. They've offered to investigate a failed stack for us which I'll set up tomorrow morning (PST). I didn't see the etcd container having a specific version so my next theory is that if the image isn't locked down it could be such that we're pulling a different container and we could be seeing drift there (IE perhaps their not reporting success / failure in the same way/ differently etc).

I'll continue investigating.

@ktateish
Copy link
Contributor

ktateish commented Apr 4, 2018

I'm facing on the same problem too.
I noticed some behavior:

  • When It failed , journalctl -u etcdadm-reconfigure on the etcd0 node showed logs like @mludvig reported.
  • When systemctl restart etcdadm-reconfigure on the etcd0 node after kube-aws up failed, etcdadm-reconfigure looks like working properly (logs show pulling container images successfully).
  • I tried kube-aws up several times and it always succeeds so far on my environment after applying the following patch to the userdata:
diff --git a/userdata/cloud-config-etcd b/userdata/cloud-config-etcd
index cf306f6..f613337 100644
--- a/userdata/cloud-config-etcd
+++ b/userdata/cloud-config-etcd
@@ -156,6 +156,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

I think that the etcdadm-reconfigure unit seems started too early on etcd nodes' boot.

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

I just checked and a new version of etcd was released 6 days ago, so presumably it's related.

I'm just cleaning up a semi related mess and then going to set our etcd version back to what was out a month ago. I'm assuming that's going to solve the issue as well.

I'll report back when I'm done.

etcd version is set here:
https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cluster.yaml#L648

I'd expect to be testing that this evening / tomorrow.

@kylegoch
Copy link

kylegoch commented Apr 4, 2018

Seeing the exact same behavior as well. We were testing a dev build. Had a known working cluster.yaml, went to recreate and got the same errors as above.

We are using etcd version 3.2.10

Edit: Using @ktateish 's patch from above on the userdata made everything work again. Wonder why it broke in the first place.

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

@kylegoch so you've pinned your etcd version to v3.2.1 which according to github was built on Jun 23, 2017?

Ok, that's really odd. Something changed... If it wasn't CFN and it wasn't etcd...

I'm going to experiment with pinning my version to something older than last month just to replicate the issue with a pinned version (previously we weren't pinning the version)

@kylegoch
Copy link

kylegoch commented Apr 4, 2018

We are using 3.2.10 from November. Not sure why that version, but thats what we have always used.

And the cluster.yaml im working with right now, worked just fine about 10 days ago.

@iherbmatt
Copy link

iherbmatt commented Apr 4, 2018 via email

@mludvig
Copy link
Author

mludvig commented Apr 4, 2018

I can confirm that @ktateish 's workaround with sleep 60 works for me, I just created a cluster with 3 etcd nodes:

+00:02:57	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 
+00:02:57	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 	"Resource creation Initiated"
+00:06:24	Controlplane	CREATE_IN_PROGRESS      		Etcd0                 	"Received SUCCESS signal with UniqueId i-0b5da874acdc0e7bb"
+00:06:25	Controlplane	CREATE_COMPLETE         		Etcd0                 
+00:06:30	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 
+00:06:31	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 	"Resource creation Initiated"
+00:09:56	Controlplane	CREATE_IN_PROGRESS      		Etcd1                 	"Received SUCCESS signal with UniqueId i-0be602b3afcacc247"
+00:09:58	Controlplane	CREATE_COMPLETE         		Etcd1                 
+00:10:02	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 
+00:10:03	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 	"Resource creation Initiated"
+00:12:47	Controlplane	CREATE_IN_PROGRESS      		Etcd2                 	"Received SUCCESS signal with UniqueId i-097a60f76baa844f7"
+00:12:48	Controlplane	CREATE_COMPLETE         		Etcd2                 

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config:

etcd:
 #etc
  version: 3.3.1

but when I log into the etcd from my failed cluster I get:

core@ip-x-y-z-etc ~ $ etcdctl version
etcdctl version: 3.2.15
API version: 3.2

3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now.

@iherbmatt
Copy link

iherbmatt commented Apr 4, 2018 via email

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

@iherbmatt depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc).

@iherbmatt
Copy link

iherbmatt commented Apr 4, 2018 via email

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

That's correct, in this case there's no commit to cherry pick, but applying a diff etc same thing etc. I'm running off of v0.9.8 so this is the patch I applied:

commit 19ad26bd147ec9882dfb7e67f5aa854a331cf2cd (HEAD -> v0.9.8-hotfix4, tag: v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:03:06 2018 -0700

    more fixii

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index 2d25d487..c8ae763b 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -166,6 +166,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer

mind you, my new stack isn't up yet.

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

Godamn it, I put the 'fix' in the wrong stanza (update service instead of reconfigure)

I'll try again this eve.

@luck02
Copy link
Contributor

luck02 commented Apr 4, 2018

applied this:

commit 4d6a8b89431828638a5414a5a73b4404c58514e9 (HEAD -> v0.9.8-hotfix5, tag: v0.9.8-hotfix5, v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:46:36 2018 -0700

    moved the sleep command

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index c8ae763b..e85ca23c 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,6 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -166,7 +167,6 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer
(END)

@davidmccormick
Copy link
Contributor

davidmccormick commented Apr 5, 2018

Isn't having a service reconfigure the type of etcd service a lot of added complexity? Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster? I would rather that it be left as notify but that all etcd nodes are initially created in parallel. What do you think?

@ktateish
Copy link
Contributor

ktateish commented Apr 5, 2018

Oh, I missed something. I thought it should be fixed on starting etcdadm-reconfigure in addition to your patch. But by only your patch, it have the same effect with nicer way. Am I right?

@iherbmatt
Copy link

Hi Everyone,

I'm seeing this now:

member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379
cluster is healthy

It appears etcd is healthy and I'm seeing this in the controller logs as well. I'm having trouble getting the controllers to generate now, however. I'm going to try to build it again and see what happens.

@luck02
Copy link
Contributor

luck02 commented Apr 5, 2018

I applied this:

commit 65722a891eca5e8a5ff9538e2837d7bbeb84390f (HEAD -> unbounce-v0.9.8, tag: v0.9.8-hotfix6, origin/unbounce-v0.9.8, v0.9.8-hotfix6, v0.9.8-hotfix5)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Thu Apr 5 08:40:40 2018 -0700

    trying mumoshis fix

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index e85ca23c..b8a56949 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,7 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
+        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

Cluster came up, I'm happy :D

@iherbmatt
Copy link

iherbmatt commented Apr 5, 2018

I wonder if it has something to do with the fact that I'm using 0.9.9 instead of 0.9.8. The etcd cluster comes up fine, but my controllers now don't come online, however they are built.

Here is the output I'm seeing loop in journalctl from the controllers:

output.txt

@mumoshu
Copy link
Contributor

mumoshu commented Apr 6, 2018

@iherbmatt Hi! Kubelet seems fine to me. Can you share the full output from journalctl, rather than kubelet's log only?

@mumoshu
Copy link
Contributor

mumoshu commented Apr 6, 2018

@davidmccormick

Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster?

Partially yes, and partially no? I guess you may be confusing two things. Generally there are two major categories in failure cases, transient and permanent failures of etcd node(s).

A transient failure is that the underlying EC2 instance failed due to an AWS infrastructure issue. In this case, the ASG just recreates the EC2 instance to resolve the issue. Suppose you have a 3 nodes etcd cluster, you may notice that you now have 3 ASGs in total, each matches to one etcd node=member. We also have a pool of EIP+EBS pairs from which each etcd member borrows its identity and datadir.

A permanent failure is that e.g. the EBS volume serving etcd datadir corrupted so that you have to recover the etcd member from an etcd snapshot(not EBS snapshot).

etcd.disasterRecovery.automated and etcd.snapshot.automated is for the latter case. And AFAICS, we have no simpler way to do that. Just marking every etcd-member type to simple results in losing

That being said,

Isn't having a service reconfigure the type of etcd service a lot of added complexity?

Definitely. I'm open to ideas to set type to notify statically while somehow allowing us to cover use-cases of:

  1. Rolling-update of etcd nodes, postponed and rolled-back on the new member failed to join the existing cluster
    • kube-aws as of today achieves this by setting a cfn DependsOn from the prev to the next etcd ASG=Node
  2. Initial bootstrap of etcd cluster
    • DependsOn requires us to provision etcd ASGs one by one, so that we have set type to simple for first N/2 etcd ASGs.

@mumoshu
Copy link
Contributor

mumoshu commented Apr 6, 2018

@davidmccormick

What might make more sense is to deploy all 3 (n) at once when you perform a fresh cluster install but only roll in one-by-one when upgrading

Good point! This is what I gave up when I first implemented the H/A etcd about a year ago. It should be the time to consider alternative implementations or possible enhancements.

  • I'm not all that familiar with cloud-formation but I think I might have seen the controllers behaving in this way?

Did you mean kube-aws controller nodes? Then yes, controller nodes are behaving that way - there's a single multi-AZ ASG managing desired number of controller EC2 instances.

Implementation-wise, we can't do the same for etcd nodes though.
We have to give each etcd node a stable network identity plus EBS volume, and an EBS volume is tied to single AZ.
What if we had a 3-AZ ASG, 3 EBS volumes each is tied to separate AZ, for 3 etcd nodes, and then one of AZs failed? The ASG would try to launch a replacement etcd node in one of available 2 AZs, in which the EBS volume holding the original etcd data doesn't exist!
In that sense, I believe we have to get along with 1-etcd-asg-per-az pattern.

But anyway,

This way quorum can be achieved before the cfn-signal is sent. In a fresh install I would personally also bring up the controllers and nodes without waiting too.

This should be discussed further. How about just omitting DependsOn on etcd ASGs for initial bootstrap via kube-aws up, and then add DependsOns on the subsequent kube-aws update run? Would it actually result in a rolling-update of etcd ASGs by DependsOns just added?

@iherbmatt
Copy link

@mumoshu I was really excited to see my etcd nodes build successfully. I even logged in and saw they were all healthy, but then I saw the same CloudFormation timeouts on the controllers. I will redact some identifying data from the journalctl log and attach it. Thank you for the time in advance :)

@mumoshu
Copy link
Contributor

mumoshu commented Apr 7, 2018

@iherbmatt Thanks!

If I could ask for more, sharing us your cluster.yaml would also help! I know a cluster bootstrapping shouldn't be such exciting and hard thing to do but there are certainly many failure cases, which can be pinpointed just by looking at your cluster.yaml.

@iherbmatt
Copy link

@mumoshu Here is the cluster.yaml file.
cluster-yaml.txt

@iherbmatt
Copy link

@mumoshu Here is the journalctl log from the controllers that would not start up.
journalctl-redacted.log

@Vince-Cercury
Copy link

For me the issue starts with CoreOS 1688.5.3 released in April.
The previous version (1632.3.0, Release Date: February 15, 2018) is not an issue.

With the patch from @mumoshu the etcd get updated fine with CoreOS 1688.5.3 . However Controllers don't and rollback.

@iherbmatt
Copy link

@mumoshu Any thoughts?

@iherbmatt
Copy link

@Vincemd Are you unable to build clusters as well?

@mludvig
Copy link
Author

mludvig commented Apr 13, 2018

@iherbmatt I had the same problem while testing the proposed fix because I changed the cluster name in cluster.yml but the certificates were still for the old name. That led exactly to the same issue that you observe - after creating etcd nodes the controllers failed to build. Removing credentials/ and recreating the certs fixed it.

@Vince-Cercury
Copy link

@iherbmatt correct with latest version of CoreOs. If I use the Feb release of the AMI, then all fine. A colleague of mine is also facing the same issue.

@iherbmatt
Copy link

I wish that change would work for me.

image

image

It just sits there and eventually times out when trying to build the controllers.
I even used CoreOS-stable-1632.3.0-hvm (ami-862140e9)

It's been almost 2 weeks I've been unable to build clusters :(

@mumoshu
Copy link
Contributor

mumoshu commented Apr 14, 2018

@iherbmatt Sorry for the trouble!
Your etcd seems fine. But from the logs I see Calico installer is complaining.

Perhaps you are hit by the recent regression in master? Would you ming trying with kube-aws v0.9.10-rc.3? If it still doesn't work, trying k8s 1.9.3 which is the defaul in 0.9.10-rc.3 may change somethig.

@iherbmatt
Copy link

Hi @mumoshu. I was able to generate a cluster with 0.9.10-rc.3 but it had to be running version 1.9.3 otherwise it wouldn't work. Another issue I have, however, is that I cannot use m5's for the etcd's. Any reason you can think of that might explain why? Thanks!

@Confushion
Copy link
Contributor

Hi @mumoshu

Seems you were right about etcdadm-reconfigure.service wanting a formatted /var/lib/etcd2.
However, your fix seemed not to wait for the service to be active, but to fail when it is not active yet...
So the timeouts were still happening unfortunately.

Below patch fixes this by actually depending on service var-lib-etcd2.mount (which is the one that it should depend on, and that in turn depends on format-etcd2-volume.service anyway...)

Also the WantedBy line wasn't doing anything useful AFAIK...

Thanks.

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index fc077436..a291fdbf 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -151,6 +151,7 @@ coreos:
         Wants=cfn-etcd-environment.service
         After=cfn-etcd-environment.service
         After=network.target
+        After=var-lib-etcd2.mount

         [Service]
         Type=oneshot
@@ -158,7 +159,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
+        ExecStartPre=/usr/bin/systemctl is-active var-lib-etcd2.mount
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -167,9 +168,6 @@ coreos:
         {{end -}}
         TimeoutStartSec=120

-        [Install]
-        WantedBy=cfn-etcd-environment.service
-
     - name: etcdadm-update-status.service
       enable: true
       content: |

This was referenced May 1, 2018
@mumoshu
Copy link
Contributor

mumoshu commented May 2, 2018

@iherbmatt Ah, sorry for the late reply! The bad news is that m5 and also c5 aren't supported out-of-box yet as mentioned in #1230.

The good news is that there is a patch composed of scripts and systemd units to adapt the NVMe devices to look like legacy devices so that they can be successfully consumed by kube-aws. The patch can be found in issues linked from #1230.

Please don't hesitate to ask me if you still had trouble on anything.

@mumoshu
Copy link
Contributor

mumoshu commented May 2, 2018

@Confushion Certainly - I realized hat my patch wasn't complete at all after seeing your work! Thank you so much for that.

Everyone, @Confushion has kindly contributed #1270 to make etcd bootstrapping even more reliable.
It is already merged and will be available in v0.9.10-rc.6 or v0.9.10

@davidmccormick
Copy link
Contributor

Implementation of my previous suggestion to bring the etcd servers up in parallel on a new cluster build. #1357

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests