-
Notifications
You must be signed in to change notification settings - Fork 295
Unable to create cluster with more than 1 etcd #1206
Comments
Hi, I am also seeing similar behaviour today using both v0.9.8 and v0.9.9. I have
Looks like the cfn signal is never sent from Etcd0 and the control pane nested stack fails. From cfn event log:
If I set |
I'm seeing the same behaviour, we're on v0.9.8...
|
It's related to the wait signal @steinfletcher + @mludvig try adding this to your cluster.yaml: waitSignal:
enabled: false
maxBatchSize: 1 the relevant template is: "{{$etcdInstance.LogicalName}}": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {
"HealthCheckGracePeriod": 600,
"HealthCheckType": "EC2",
"LaunchConfigurationName": {
"Ref": "{{$etcdInstance.LaunchConfigurationLogicalName}}"
},
"MaxSize": "1",
"MetricsCollection": [
{
"Granularity": "1Minute"
}
],
"MinSize": "1",
"Tags": [
{
"Key": "kubernetes.io/cluster/{{$.ClusterName}}",
"PropagateAtLaunch": "true",
"Value": "true"
},
{
"Key": "Name",
"PropagateAtLaunch": "true",
"Value": "{{$.ClusterName}}-{{$.StackName}}-kube-aws-etcd-{{$etcdIndex}}"
},
{
"Key": "kube-aws:role",
"PropagateAtLaunch": "true",
"Value": "etcd"
}
],
"VPCZoneIdentifier": [
{{$etcdInstance.SubnetRef}}
]
},
{{if $.WaitSignal.Enabled}}
"CreationPolicy" : {
"ResourceSignal" : {
"Count" : "1",
"Timeout" : "{{$.Controller.CreateTimeout}}"
}
},
{{end}}
"UpdatePolicy" : {
"AutoScalingRollingUpdate" : {
"MinInstancesInService" : "0",
"MaxBatchSize" : "1",
{{if $.WaitSignal.Enabled}}
"WaitOnResourceSignals" : "true",
"PauseTime": "{{$.Controller.CreateTimeout}}"
{{else}}
"PauseTime": "PT2M"
{{end}}
}
}, I was able to get this working by disabling the signal, the next question is, how did this ever work? Something underlying in the cfn engine must have changed WRT to simultaneous execution. Here's my etcd log after setting wait to false: -- Logs begin at Thu 2018-03-29 22:20:06 UTC. --
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started Session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd-logind[780]: New session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Paths.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Sockets.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Timers.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Basic System.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Default.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Startup finished in 23ms.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started User Manager for UID 500.
Mar 29 22:25:23 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:23.592989 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906323 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906359 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:28 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:28.593180 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108598 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108630 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:33 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:33.593363 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310906 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310939 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:38 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:38.593555 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513243 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513275 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901783 I | rafthttp: peer 596daac612174e37 became active
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901825 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream MsgApp v2 reader)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.902251 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream Message reader)
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.526774 I | etcdserver: updating the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529138 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529325 I | etcdserver/api: enabled capabilities for version 3.2 |
We've asked our AWS Technical Account Managers to see if the CF team can shed any insight. The other thing I'm wondering, and haven't had a chance to check yet. Perhaps the etcd version / image isn't locked down and something changed there? I'll look later this eve when I have time. |
Thanks @luck02. "Something underlying in the cfn engine must have changed WRT to simultaneous execution." Yeah I am also suspecting this. |
Each etcd node has a dedicated ASG which depends on the next etcd node for sequential launch and rolling update, so there should be no simultaneous execution(if that's what you meant). The first etcd node in your cluster should just start without waiting for any other etcd nodes as implemented in etcdadm, so in my understanding, something like reported here shouldn't happen normally. I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused Can you confirm that you do have bad |
Hi, thanks for the answer. Nope I don't have
|
@mludvig Thx! Would you mind sharing me the result of |
Here:
Interestingly
|
@mumoshu FWIW, here's our stack tags: # AWS Tags for cloudformation stack resources
stackTags:
environment: "{{ stack_env }}"
project: "{{ PROJECT_NAME }}"
owner: "{{ PROJECT_OWNER }}" Note that this hasn't changed and we're using a template to populate, on execution it would be more like: # AWS Tags for cloudformation stack resources
stackTags:
environment: "test"
project: "ub-data-infrastructure/cluster"
owner: "dataops" Again, this hasn't changed, I'd be interested in hearing more along the lines of: "I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes." In the meantime if you'd like to see our etcd logs as requested above I can provide them as well, I just need to undo the waitSignal change: waitSignal:
enabled: false
maxBatchSize: 1 |
yea this started happening to me last Friday when i tried to create a new cluster with a basically identical config to a cluster i created a few weeks earlier. Something subtle definitely must have changed. Im currently afraid to run I have a hard time thinking its a stackTags issue in my case since it was never a problem previously. How this manifested for me was a |
@jcrugzz I'm just working my way through some fixes ( |
Thanks @luck02 appreciate it! |
Hey Guys. I disabled the wait signal, and it generated all the appropriate machines, however the masters are no longer healthy. The cluster.yaml file I'm using is one I've been using since 0.9.9 originally came out. Should it work entirely by uncommenting out the waitSignal and setting it to be disabled? |
@jcrugzz / everyone else. I've burned quite a bit of time testing this. I don't think disabling waitSignal is going to be viable. Quite a few of my validation steps start failing randomly. Of course YMMV but we want to validate our cluster is healthy at the end and the waitSignal disable makes that challenging. I did hear back from our AWS technical account managers. They claim 0 changes in the underlying CFN code. They've offered to investigate a failed stack for us which I'll set up tomorrow morning (PST). I didn't see the etcd container having a specific version so my next theory is that if the image isn't locked down it could be such that we're pulling a different container and we could be seeing drift there (IE perhaps their not reporting success / failure in the same way/ differently etc). I'll continue investigating. |
I'm facing on the same problem too.
I think that the |
I just checked and a new version of etcd was released 6 days ago, so presumably it's related. I'm just cleaning up a semi related mess and then going to set our etcd version back to what was out a month ago. I'm assuming that's going to solve the issue as well. I'll report back when I'm done. etcd version is set here: I'd expect to be testing that this evening / tomorrow. |
Seeing the exact same behavior as well. We were testing a dev build. Had a known working We are using etcd version 3.2.10 Edit: Using @ktateish 's patch from above on the userdata made everything work again. Wonder why it broke in the first place. |
@kylegoch so you've pinned your etcd version to v3.2.1 which according to github was built on Jun 23, 2017? Ok, that's really odd. Something changed... If it wasn't CFN and it wasn't etcd... I'm going to experiment with pinning my version to something older than last month just to replicate the issue with a pinned version (previously we weren't pinning the version) |
We are using 3.2.10 from November. Not sure why that version, but thats what we have always used. And the cluster.yaml im working with right now, worked just fine about 10 days ago. |
I've been using etcd 3.2.6 and I still incurred this issue as well.
|
I can confirm that @ktateish 's workaround with
|
Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config: etcd:
#etc
version: 3.3.1 but when I log into the etcd from my failed cluster I get: core@ip-x-y-z-etc ~ $ etcdctl version
etcdctl version: 3.2.15
API version: 3.2 3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now. |
Hi Gary,
Are you baking that into a build? Or what would typically be the best way
to make this change manually? Should we be doing this manually?
Thank you!
…On Wed, Apr 4, 2018 at 3:54 PM, Gary Lucas ***@***.***> wrote:
Now i'm wondering if the versioning provided in the cluster.yaml is
effective. I just added this to my cluster.yaml config:
etcd:
#etc
version: 3.3.1
but when I log into the etcd from my failed cluster I get:
***@***.*** ~ $ etcdctl version
etcdctl version: 3.2.15
API version: 3.2
3.2.15 was build in January, and I see it's a failed cluster so presumably
that's the end of the line for this enquiry. I'll do the sleep workaround
for now.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1206 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWH4rjqELXg1Nfs7PJpTFfO2FRFVLRIzks5tlU88gaJpZM4S_uCR>
.
--
*The information contained in this message is the sole and exclusive
property of ***iHerb Inc.*** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of ***iHerb Inc.** *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
|
@iherbmatt depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc). |
Your fix for the iops seemed to work well by cherry picking - not sure if
that's what you mean by hotfix. I haven't been able to build clusters in
over a week, so I'm desperate and really appreciate your looking into this
:)
…On Wed, Apr 4, 2018 at 4:17 PM, Gary Lucas ***@***.***> wrote:
@iherbmatt <https://github.com/iherbmatt> depends on your setup, for us
it's a bit complicated and the easiest way for me to do this is to apply a
hotfix to the kube-aws source code and build myself a new hotfix version.
But that's because our deployment pipeline doesn't leave the artifacts for
me to locally jury rig. We do have some pipeline stuff I could jury rig up
to apply the fix, but it's really ugly (ansible - lineinfile - regex etc).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1206 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWH4rgHyRishM6dfP_lZRBPBKrMy_Wuiks5tlVSPgaJpZM4S_uCR>
.
--
*The information contained in this message is the sole and exclusive
property of ***iHerb Inc.*** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of ***iHerb Inc.** *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
|
That's correct, in this case there's no commit to cherry pick, but applying a diff etc same thing etc. I'm running off of v0.9.8 so this is the patch I applied: commit 19ad26bd147ec9882dfb7e67f5aa854a331cf2cd (HEAD -> v0.9.8-hotfix4, tag: v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date: Wed Apr 4 16:03:06 2018 -0700
more fixii
diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index 2d25d487..c8ae763b 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -166,6 +166,7 @@ coreos:
RestartSec=5
EnvironmentFile=-/etc/etcd-environment
EnvironmentFile=-/var/run/coreos/etcdadm-environment
+ ExecStartPre=/usr/bin/sleep 60
ExecStart=/opt/bin/etcdadm member_status_set_started
{{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer mind you, my new stack isn't up yet. |
Godamn it, I put the 'fix' in the wrong stanza (update service instead of reconfigure) I'll try again this eve. |
applied this:
|
Isn't having a service reconfigure the type of etcd service a lot of added complexity? Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster? I would rather that it be left as notify but that all etcd nodes are initially created in parallel. What do you think? |
Oh, I missed something. I thought it should be fixed on starting |
Hi Everyone, I'm seeing this now: member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 It appears etcd is healthy and I'm seeing this in the controller logs as well. I'm having trouble getting the controllers to generate now, however. I'm going to try to build it again and see what happens. |
I applied this: commit 65722a891eca5e8a5ff9538e2837d7bbeb84390f (HEAD -> unbounce-v0.9.8, tag: v0.9.8-hotfix6, origin/unbounce-v0.9.8, v0.9.8-hotfix6, v0.9.8-hotfix5)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date: Thu Apr 5 08:40:40 2018 -0700
trying mumoshis fix
diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index e85ca23c..b8a56949 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,7 +140,7 @@ coreos:
RestartSec=5
EnvironmentFile=-/etc/etcd-environment
EnvironmentFile=-/var/run/coreos/etcdadm-environment
- ExecStartPre=/usr/bin/sleep 60
+ ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
ExecStart=/opt/bin/etcdadm reconfigure
Cluster came up, I'm happy :D |
I wonder if it has something to do with the fact that I'm using 0.9.9 instead of 0.9.8. The etcd cluster comes up fine, but my controllers now don't come online, however they are built. Here is the output I'm seeing loop in journalctl from the controllers: |
@iherbmatt Hi! Kubelet seems fine to me. Can you share the full output from |
Partially yes, and partially no? I guess you may be confusing two things. Generally there are two major categories in failure cases, transient and permanent failures of etcd node(s). A transient failure is that the underlying EC2 instance failed due to an AWS infrastructure issue. In this case, the ASG just recreates the EC2 instance to resolve the issue. Suppose you have a 3 nodes etcd cluster, you may notice that you now have 3 ASGs in total, each matches to one etcd node=member. We also have a pool of EIP+EBS pairs from which each etcd member borrows its identity and datadir. A permanent failure is that e.g. the EBS volume serving etcd datadir corrupted so that you have to recover the etcd member from an etcd snapshot(not EBS snapshot).
That being said,
Definitely. I'm open to ideas to set
|
Good point! This is what I gave up when I first implemented the H/A etcd about a year ago. It should be the time to consider alternative implementations or possible enhancements.
Did you mean kube-aws controller nodes? Then yes, controller nodes are behaving that way - there's a single multi-AZ ASG managing desired number of controller EC2 instances. Implementation-wise, we can't do the same for etcd nodes though. But anyway,
This should be discussed further. How about just omitting |
@mumoshu I was really excited to see my etcd nodes build successfully. I even logged in and saw they were all healthy, but then I saw the same CloudFormation timeouts on the controllers. I will redact some identifying data from the journalctl log and attach it. Thank you for the time in advance :) |
@iherbmatt Thanks! If I could ask for more, sharing us your cluster.yaml would also help! I know a cluster bootstrapping shouldn't be such exciting and hard thing to do but there are certainly many failure cases, which can be pinpointed just by looking at your cluster.yaml. |
@mumoshu Here is the cluster.yaml file. |
@mumoshu Here is the journalctl log from the controllers that would not start up. |
For me the issue starts with CoreOS 1688.5.3 released in April. With the patch from @mumoshu the etcd get updated fine with CoreOS 1688.5.3 . However Controllers don't and rollback. |
@mumoshu Any thoughts? |
@Vincemd Are you unable to build clusters as well? |
@iherbmatt I had the same problem while testing the proposed fix because I changed the cluster name in |
@iherbmatt correct with latest version of CoreOs. If I use the Feb release of the AMI, then all fine. A colleague of mine is also facing the same issue. |
@iherbmatt Sorry for the trouble! Perhaps you are hit by the recent regression in master? Would you ming trying with kube-aws v0.9.10-rc.3? If it still doesn't work, trying k8s 1.9.3 which is the defaul in 0.9.10-rc.3 may change somethig. |
Hi @mumoshu. I was able to generate a cluster with 0.9.10-rc.3 but it had to be running version 1.9.3 otherwise it wouldn't work. Another issue I have, however, is that I cannot use m5's for the etcd's. Any reason you can think of that might explain why? Thanks! |
Hi @mumoshu Seems you were right about Below patch fixes this by actually depending on service Also the Thanks.
|
@iherbmatt Ah, sorry for the late reply! The bad news is that m5 and also c5 aren't supported out-of-box yet as mentioned in #1230. The good news is that there is a patch composed of scripts and systemd units to adapt the NVMe devices to look like legacy devices so that they can be successfully consumed by kube-aws. The patch can be found in issues linked from #1230. Please don't hesitate to ask me if you still had trouble on anything. |
@Confushion Certainly - I realized hat my patch wasn't complete at all after seeing your work! Thank you so much for that. Everyone, @Confushion has kindly contributed #1270 to make etcd bootstrapping even more reliable. |
Implementation of my previous suggestion to bring the etcd servers up in parallel on a new cluster build. #1357 |
Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.
I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.
I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.
The text was updated successfully, but these errors were encountered: