-
Notifications
You must be signed in to change notification settings - Fork 22
Fix volume attachments while upgrade (v12 v13) #999
Conversation
@@ -4,6 +4,9 @@ const Instance = `{{define "instance"}} | |||
{{ .Instance.Master.Instance.ResourceName }}: | |||
Type: "AWS::EC2::Instance" | |||
Description: Master instance | |||
DependsOn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem was that VM did not depend on volumes to it was immediately created with etcd as first disk and in that time docker disk was resized (takes 3-5 minutes) and then docker disk attached as second disk and mounted to /var/lib/etcd
.
This dependency makes sure that VM will wait to both volumes to be ready before get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch, thanks!
Properties: | ||
Encrypted: true | ||
Size: 100 | ||
Size: 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rolling master docker disk size to 50GB, because we don't really need 100GB on master node. 100GB only needed for workers.
This prevents unnecessary delay (3-5 minutes caused by resizing) while upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense to leave the volume at 50GB for masters. Great that it avoids the delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for taking care here! <3
In terms of versioning there are other changes being worked on in v13 but I'd rather not introduce a patch unless we have to.
Also I've tested K8s 1.10.4 and it fixes the configmap problem. So upgrading from 1.10.2 is an option. WDYT?
@@ -4,6 +4,9 @@ const Instance = `{{define "instance"}} | |||
{{ .Instance.Master.Instance.ResourceName }}: | |||
Type: "AWS::EC2::Instance" | |||
Description: Master instance | |||
DependsOn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch, thanks!
Properties: | ||
Encrypted: true | ||
Size: 100 | ||
Size: 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense to leave the volume at 50GB for masters. Great that it avoids the delay.
VolumeType: gp2 | ||
AvailabilityZone: !GetAtt {{ .Instance.Master.Instance.ResourceName }}.AvailabilityZone | ||
AvailabilityZone: {{ .Instance.Master.AZ }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the master resource now depends on the volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise it has curcular error :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. Thanks.
thanks for review unfortunately this is only a part of the problem. The thing is that with m5 we in general affected by "disk ordering problem". Disks randomly can be nvme1 / nvme2 , so we need to extract disk label (that we set in CF). This is already hacked in community We also need to do some kind of hack. I'll work on that tomorrow. |
Yes 1.10.4 makes total sense. |
- Set proper dependencies (vm depends on volume and not opposite) - Set docker disk size to 50GB for master (prevent unnecessary rezizing)
Tested in ginger. Test list added in description. |
I'll be merging this on Monday. |
@@ -8,7 +8,7 @@ ConditionPathExists=!/var/lib/docker | |||
|
|||
[Service] | |||
Type=oneshot | |||
ExecStart=/bin/bash -c "([ -b "/dev/xvdc" ] && /usr/sbin/mkfs.xfs -f /dev/xvdc -L docker) || ([ -b "/dev/nvme1n1" ] && /usr/sbin/mkfs.xfs -f /dev/nvme1n1 -L docker)" | |||
ExecStart=/bin/bash -c "[ -e "/dev/xvdc" ] && /usr/sbin/mkfs.xfs -f /dev/xvdc -L docker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-e
file exists (any type), because /dev/xvdc
is symlink in case NVMe and block device in case m3
@@ -0,0 +1,26 @@ | |||
package cloudconfig | |||
|
|||
const NVMEUdevRule = `KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/opt/ebs-nvme-mapping /dev/%k", SYMLINK+="%c" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udev rule that calls script on any events with nvme that will create/delete sylinks with name that was specified in EBS e.g. /dev/xvdh
fi | ||
` | ||
|
||
const NVMEUdevTriggerUnit = `[Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit only necessary on the first boot, because udev rule was just added and we need to retrigger udev.
@r7vme @rossf7 I'm excited and pleased that you've discovered my solution[1] for dealing with NVMe devices in AWS by way of the CoreOS and kube-aws projects, but as a courtesy, can you include a copy of the license and copyright notice[2] in your derivative work? 1: https://github.com/oogali/ebs-automatic-nvme-mapping |
Fixes: https://github.com/giantswarm/giantswarm/issues/3307
Updated v12 and v13 versions as no clusters on v12. Or do i need to create v12patch1?
QA
v12
successfully created withm5.large
and master node survives rebootv11
tov12
withm5.large
v11
tov12
withm3.large