Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodegroup scaling #254

Merged
merged 6 commits into from
Oct 16, 2018
Merged

Nodegroup scaling #254

merged 6 commits into from
Oct 16, 2018

Conversation

richardcase
Copy link
Contributor

@richardcase richardcase commented Oct 13, 2018

Description

Initial version of nodegroup scaling has been added. This scales
by modifying the CloudFormation template for the nodegroup. The
modified template is used to create a changeset that is then
executed.

When scaling down/in (i.e. reducing the number of nodes) we rely
solely on the the resulting change to the ASG. This means that
the node(s) that are to be terminated aren't drained and so pods
running on the terminating nodes may cause errors. In the future
we may consider picking the EC2 instances to be terminated and then
drain the nodes and then create a termination policy to ensure those
nodes are killed.

This supercedes #191 and it relates to issue #116.

Todo:

  • Cloudformation template modification
  • Changeset implementation
  • Workload testing whilst scaling up
  • Workload testing whilst scaling down

Checklist

  • Code compiles correctly (i.e make build)
  • Added tests that cover your change (if possible)
  • All tests passing (i.e. make test)
  • Added/modified documentation as required (such as the README)
  • Added yourself to the humans.txt file

@richardcase richardcase force-pushed the nodegroup-scaling branch 2 times, most recently from 2b459d9 to cc9c25b Compare October 15, 2018 10:02
@richardcase richardcase changed the title WIP: Nodegroup scaling Nodegroup scaling Oct 15, 2018
Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly, LGTM. I'll test it and have another look locally, as I'd like to better understand how it works. I may add some cosmetic changes, and probably a few lines in the docs. Thanks a lot, I think we should be able to merge and release this soon enough! 👍 🥇

fs := cmd.Flags()

fs.StringVarP(&cfg.ClusterName, "name", "n", "", "EKS cluster name")
fs.IntVarP(&cfg.Nodes, "nodes", "N", 0, "total number of nodes (scale to this number)")

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

}
logger.Debug("changes = %#v", changeset.Changes)
if err := c.doExecuteChangeset(stackName, changesetName); err != nil {
logger.Warning("error executing Cloudformation changeset %s in stack %s. Check the Cloudformation console for further details", changesetName, stackName)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

cmd/eksctl/scale.go Show resolved Hide resolved
logger.Debug("describeChangesetErr=%v", err)
} else {
logger.Critical("unexpected status %q while %s", *s.Status, msg)
c.troubleshootStackFailureCause(i, desiredStatus)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@errordeveloper
Copy link
Contributor

errordeveloper commented Oct 15, 2018 via email

@richardcase
Copy link
Contributor Author

Scaling to zero is legit use-case actually (basically a way to save money), we shouldn't prevent it as such. I just think there should be no default value, that's all.

Good point. I'll change that.

@errordeveloper
Copy link
Contributor

@richardcase so annoying GitHub doesn't attach email reply to the thread... sorry about that, I was hoping they fixed it.

)

const (
desirecCapacityPath = "Resources.NodeGroup.Properties.DesiredCapacity"

This comment was marked as abuse.

This comment was marked as abuse.

logger.Warning("error executing Cloudformation changeset %s in stack %s. Check the Cloudformation console for further details", changesetName, stackName)
return err
}
return c.doWaitUntilStackIsUpdated(i)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@Lazyshot
Copy link

We may want to go ahead and add integration tests for this:

Scale to 2 nodes, check for kubernetes nodes via kube api, scale back down to 1 (no wait?)

@richardcase
Copy link
Contributor Author

We may want to go ahead and add integration tests for this:
Scale to 2 nodes, check for kubernetes nodes via kube api, scale back down to 1 (no wait?)

That would be very nice. I'll also add this.

func (c *StackCollection) doCreateChangesetRequest(i *Stack, action string, description string, templateBody []byte,
parameters map[string]string, withIAM bool) (string, error) {

changesetName := fmt.Sprintf("eksctl-%s-%d", action, time.Now().Unix())

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

}
logger.Debug("start %s", msg)
if waitErr := w.WaitWithContext(ctx); waitErr != nil {
s, err := c.describeStackChangeset(i, changesetName)

This comment was marked as abuse.

errordeveloper
errordeveloper previously approved these changes Oct 16, 2018
Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardcase I'm done with nitpicking here! we can merge and add tests in another PR, then cut a release, but I'm also equally happy to see test here - up to you :)

@Lazyshot
Copy link

We can definitely have the integration tests in another PR

errordeveloper
errordeveloper previously approved these changes Oct 16, 2018
Lazyshot
Lazyshot previously approved these changes Oct 16, 2018
richardcase and others added 5 commits October 16, 2018 18:46
Initial version of nodegroup scaling has been added. This scales
by modifying the CloudFormation template for the nodegroup. The
modified template is used to create a changeset that is then
executed.

When scaling down/in (i.e. reducing the number of nodes) we rely
solely on the the resulting change to the ASG. This means that
the node(s) that are to be terminated aren't drained and so pods
running on the terminating nodes may cause errors. In the future
we may consider picking the EC2 instances to be terminated and then
drain the nodes and then create a termination policy to ensure those
nodes are killed.

Issue #116

Signed-off-by: Richard Case <richard.case@outlook.com>
If the desired capacity is greater/less than the the current
max/min of the ASG then it will be updated to match the
desired node count.

Signed-off-by: Richard Case <richard.case@outlook.com>
@richardcase
Copy link
Contributor Author

Rebased but build is failing.

@errordeveloper
Copy link
Contributor

@richardcase I've re-triggered, looks like it could be a flake...

@richardcase
Copy link
Contributor Author

richardcase commented Oct 16, 2018

Thanks @errordeveloper. Could you approve again when you get time and i'll merge in.

@richardcase
Copy link
Contributor Author

I've created #267 to make sure we don't forget to add the integration test.

@richardcase richardcase merged commit e229797 into master Oct 16, 2018
@errordeveloper errordeveloper deleted the nodegroup-scaling branch October 16, 2018 22:38
torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants