Nodegroup scaling #254

richardcase · 2018-10-13T08:16:08Z

Description

Initial version of nodegroup scaling has been added. This scales
by modifying the CloudFormation template for the nodegroup. The
modified template is used to create a changeset that is then
executed.

When scaling down/in (i.e. reducing the number of nodes) we rely
solely on the the resulting change to the ASG. This means that
the node(s) that are to be terminated aren't drained and so pods
running on the terminating nodes may cause errors. In the future
we may consider picking the EC2 instances to be terminated and then
drain the nodes and then create a termination policy to ensure those
nodes are killed.

This supercedes #191 and it relates to issue #116.

Todo:

Cloudformation template modification
Changeset implementation
Workload testing whilst scaling up
Workload testing whilst scaling down

Checklist

Code compiles correctly (i.e make build)
Added tests that cover your change (if possible)
All tests passing (i.e. make test)
Added/modified documentation as required (such as the README)
Added yourself to the humans.txt file

errordeveloper

Broadly, LGTM. I'll test it and have another look locally, as I'd like to better understand how it works. I may add some cosmetic changes, and probably a few lines in the docs. Thanks a lot, I think we should be able to merge and release this soon enough! 👍 🥇

cmd/eksctl/scale.go

+	fs := cmd.Flags()
+
+	fs.StringVarP(&cfg.ClusterName, "name", "n", "", "EKS cluster name")
+	fs.IntVarP(&cfg.Nodes, "nodes", "N", 0, "total number of nodes (scale to this number)")


pkg/cfn/manager/api.go

+	}
+	logger.Debug("changes = %#v", changeset.Changes)
+	if err := c.doExecuteChangeset(stackName, changesetName); err != nil {
+		logger.Warning("error executing Cloudformation changeset %s in stack %s. Check the Cloudformation console for further details", changesetName, stackName)


cmd/eksctl/scale.go

pkg/cfn/manager/waiters.go

+			logger.Debug("describeChangesetErr=%v", err)
+		} else {
+			logger.Critical("unexpected status %q while %s", *s.Status, msg)
+			c.troubleshootStackFailureCause(i, desiredStatus)


errordeveloper · 2018-10-15T11:47:09Z

Scaling to zero is legit use-case actually (basically a way to save money), we shouldn't prevent it as such. I just think there should be no default value, that's all.

…

On Mon, 15 Oct 2018, 12:34 pm Richard Case, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In cmd/eksctl/scale.go <#254 (comment)>: > + + cmd := &cobra.Command{ + Use: "nodegroup", + Short: "Scale a nodegroup", + Run: func(_ *cobra.Command, args []string) { + if err := doScaleNodeGroup(cfg); err != nil { + logger.Critical("%s\n", err.Error()) + os.Exit(1) + } + }, + } + + fs := cmd.Flags() + + fs.StringVarP(&cfg.ClusterName, "name", "n", "", "EKS cluster name") + fs.IntVarP(&cfg.Nodes, "nodes", "N", 0, "total number of nodes (scale to this number)") We do have the following test later to cover this: if cfg.Nodes < 1 { return fmt.Errorf("number of nodes must be greater than 0. Use the --nodes/-N flag") } — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#254 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPWSxXPxtpqu_kOynS3Z_-5DfV4Q1XIks5ulHGWgaJpZM4XahHY> .

richardcase · 2018-10-15T12:10:31Z

Scaling to zero is legit use-case actually (basically a way to save money), we shouldn't prevent it as such. I just think there should be no default value, that's all.

Good point. I'll change that.

errordeveloper · 2018-10-15T12:14:38Z

@richardcase so annoying GitHub doesn't attach email reply to the thread... sorry about that, I was hoping they fixed it.

pkg/cfn/manager/waiters.go

pkg/cfn/manager/nodegroup.go

+)
+
+const (
+	desirecCapacityPath = "Resources.NodeGroup.Properties.DesiredCapacity"


pkg/cfn/manager/api.go

+		logger.Warning("error executing Cloudformation changeset %s in stack %s. Check the Cloudformation console for further details", changesetName, stackName)
+		return err
+	}
+	return c.doWaitUntilStackIsUpdated(i)


Lazyshot · 2018-10-16T15:27:25Z

We may want to go ahead and add integration tests for this:

Scale to 2 nodes, check for kubernetes nodes via kube api, scale back down to 1 (no wait?)

richardcase · 2018-10-16T15:36:04Z

We may want to go ahead and add integration tests for this:
Scale to 2 nodes, check for kubernetes nodes via kube api, scale back down to 1 (no wait?)

That would be very nice. I'll also add this.

pkg/cfn/manager/api.go

+func (c *StackCollection) doCreateChangesetRequest(i *Stack, action string, description string, templateBody []byte,
+	parameters map[string]string, withIAM bool) (string, error) {
+
+	changesetName := fmt.Sprintf("eksctl-%s-%d", action, time.Now().Unix())


pkg/cfn/manager/api.go

pkg/cfn/manager/waiters.go

+	}
+	logger.Debug("start %s", msg)
+	if waitErr := w.WaitWithContext(ctx); waitErr != nil {
+		s, err := c.describeStackChangeset(i, changesetName)


errordeveloper

@richardcase I'm done with nitpicking here! we can merge and add tests in another PR, then cut a release, but I'm also equally happy to see test here - up to you :)

Lazyshot · 2018-10-16T16:50:18Z

We can definitely have the integration tests in another PR

Initial version of nodegroup scaling has been added. This scales by modifying the CloudFormation template for the nodegroup. The modified template is used to create a changeset that is then executed. When scaling down/in (i.e. reducing the number of nodes) we rely solely on the the resulting change to the ASG. This means that the node(s) that are to be terminated aren't drained and so pods running on the terminating nodes may cause errors. In the future we may consider picking the EC2 instances to be terminated and then drain the nodes and then create a termination policy to ensure those nodes are killed. Issue #116 Signed-off-by: Richard Case <richard.case@outlook.com>

If the desired capacity is greater/less than the the current max/min of the ASG then it will be updated to match the desired node count. Signed-off-by: Richard Case <richard.case@outlook.com>

richardcase · 2018-10-16T18:09:10Z

Rebased but build is failing.

errordeveloper · 2018-10-16T18:14:19Z

@richardcase I've re-triggered, looks like it could be a flake...

richardcase · 2018-10-16T18:30:03Z

Thanks @errordeveloper. Could you approve again when you get time and i'll merge in.

richardcase · 2018-10-16T18:33:54Z

I've created #267 to make sure we don't forget to add the integration test.

Combine driver manifests

richardcase force-pushed the nodegroup-scaling branch 2 times, most recently from 2b459d9 to cc9c25b Compare October 15, 2018 10:02

richardcase changed the title ~~WIP: Nodegroup scaling~~ Nodegroup scaling Oct 15, 2018

richardcase requested a review from errordeveloper October 15, 2018 10:03

errordeveloper reviewed Oct 15, 2018

View reviewed changes

pkg/cfn/manager/waiters.go Outdated Show resolved Hide resolved

Lazyshot reviewed Oct 16, 2018

View reviewed changes

pkg/cfn/manager/nodegroup.go Outdated

)

const (

desirecCapacityPath = "Resources.NodeGroup.Properties.DesiredCapacity"

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

richardcase force-pushed the nodegroup-scaling branch from d146b30 to ade0f54 Compare October 16, 2018 15:14

Lazyshot reviewed Oct 16, 2018

View reviewed changes

errordeveloper reviewed Oct 16, 2018

View reviewed changes

pkg/cfn/manager/api.go Outdated Show resolved Hide resolved

errordeveloper reviewed Oct 16, 2018

View reviewed changes

pkg/cfn/manager/api.go Show resolved Hide resolved

golangcibot reviewed Oct 16, 2018

View reviewed changes

pkg/cfn/manager/waiters.go Outdated

}

logger.Debug("start %s", msg)

if waitErr := w.WaitWithContext(ctx); waitErr != nil {

s, err := c.describeStackChangeset(i, changesetName)

This comment was marked as abuse.

Sign in to view

errordeveloper force-pushed the nodegroup-scaling branch from f1be016 to 7de2bba Compare October 16, 2018 16:21

errordeveloper previously approved these changes Oct 16, 2018

View reviewed changes

errordeveloper dismissed their stale review via 2a3eafd October 16, 2018 16:50

errordeveloper force-pushed the nodegroup-scaling branch from 7de2bba to 2a3eafd Compare October 16, 2018 16:50

errordeveloper previously approved these changes Oct 16, 2018

View reviewed changes

Lazyshot previously approved these changes Oct 16, 2018

View reviewed changes

richardcase and others added 5 commits October 16, 2018 18:46

Review changes

c633e0c

If the desired capacity is greater/less than the the current max/min of the ASG then it will be updated to match the desired node count. Signed-off-by: Richard Case <richard.case@outlook.com>

Fixed typo

541df2f

More consistent naming and style

32b9558

Few updates to readme

df3cf0f

Excluded testutils from golangci

3c8777f

richardcase dismissed stale reviews from Lazyshot and errordeveloper via 3c8777f October 16, 2018 17:55

richardcase force-pushed the nodegroup-scaling branch from 2a3eafd to 3c8777f Compare October 16, 2018 17:55

richardcase mentioned this pull request Oct 16, 2018

Scaling integration test #267

Closed

Lazyshot approved these changes Oct 16, 2018

View reviewed changes

richardcase merged commit e229797 into master Oct 16, 2018

errordeveloper deleted the nodegroup-scaling branch October 16, 2018 22:38

errordeveloper mentioned this pull request Oct 18, 2018

Allow ASG size modification from eksctl #116

Closed

torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022

Merge pull request eksctl-io#254 from leakingtapan/mainfest

5608df4

Combine driver manifests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodegroup scaling #254

Nodegroup scaling #254

richardcase commented Oct 13, 2018 •

edited

Loading

errordeveloper left a comment

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

errordeveloper commented Oct 15, 2018 via email

richardcase commented Oct 15, 2018

errordeveloper commented Oct 15, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

Lazyshot commented Oct 16, 2018

richardcase commented Oct 16, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

errordeveloper left a comment

Lazyshot commented Oct 16, 2018

richardcase commented Oct 16, 2018

errordeveloper commented Oct 16, 2018

richardcase commented Oct 16, 2018 •

edited

Loading

richardcase commented Oct 16, 2018

Nodegroup scaling #254

Nodegroup scaling #254

Conversation

richardcase commented Oct 13, 2018 • edited Loading

Description

Checklist

errordeveloper left a comment

Choose a reason for hiding this comment

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

errordeveloper commented Oct 15, 2018 via email

richardcase commented Oct 15, 2018

errordeveloper commented Oct 15, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

Lazyshot commented Oct 16, 2018

richardcase commented Oct 16, 2018

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

errordeveloper left a comment

Choose a reason for hiding this comment

Lazyshot commented Oct 16, 2018

richardcase commented Oct 16, 2018

errordeveloper commented Oct 16, 2018

richardcase commented Oct 16, 2018 • edited Loading

richardcase commented Oct 16, 2018

richardcase commented Oct 13, 2018 •

edited

Loading

richardcase commented Oct 16, 2018 •

edited

Loading