Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): spot support for managed nodegroups #11962

Merged
merged 29 commits into from
Dec 22, 2020
Merged

feat(eks): spot support for managed nodegroups #11962

merged 29 commits into from
Dec 22, 2020

Conversation

pahud
Copy link
Contributor

@pahud pahud commented Dec 9, 2020

This PR adds the CapacityType support and allows users to create Spot managed node groups for Amazon EKS.

  1. The CapacityType attribute is supported by cloudformation but not yet documented. We tentatively use addPropertyOverride() to enable it.
  2. instanceType will be deprecated and we introduced the new instanceTypes
  3. instanceTypes with different CPU architectures will throw an error.
  4. amiType is still optional, however, when specified, incorrect amiType will throw the error.
  5. According to the document, we are allowed to specify instance type(s) in either instanceTypes property or launch template but not both. As we can't check the content of the launch template passed in, we allow instanceTypes and launch template both specified and encourage to use instanceTypes when possible.

Sample

cluster.addNodegroupCapacity('extra-ng-spot', {
  instanceTypes: [
    new ec2.InstanceType('c5.large'),
    new ec2.InstanceType('c5a.large'),
    new ec2.InstanceType('c5d.large'),
  ],
  minSize: 3,
  capacityType: eks.CapacityType.SPOT,
});

Closes #11827


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Dec 9, 2020

@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Dec 9, 2020
@pahud pahud marked this pull request as draft December 9, 2020 15:41
@mergify
Copy link
Contributor

mergify bot commented Dec 9, 2020

Title does not follow the guidelines of Conventional Commits. Please adjust title before merge.

@pahud pahud marked this pull request as ready for review December 9, 2020 23:16
@pahud
Copy link
Contributor Author

pahud commented Dec 10, 2020

@iliapolo I'm ready for the first round.

@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 13, 2020
@pahud
Copy link
Contributor Author

pahud commented Dec 14, 2020

Hi @iliapolo

Please check my comments before I continue revise it. Thanks.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 14, 2020
@iliapolo
Copy link
Contributor

@pahud see my responses

@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 14, 2020
@pahud pahud marked this pull request as draft December 15, 2020 13:02
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 16, 2020
@pahud pahud marked this pull request as ready for review December 16, 2020 10:40
@pahud
Copy link
Contributor Author

pahud commented Dec 16, 2020

@iliapolo I am ready for the 2nd round.

@pahud
Copy link
Contributor Author

pahud commented Dec 18, 2020

Hi @iliapolo

All addressed. Please check it out again.

@pahud pahud requested a review from iliapolo December 18, 2020 13:16
));
test.done();
},
'create nodegroup with ondemand capacity type'(test: Test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a duplicate of create nodegroup with on-demand capacity type on line 230?

Copy link
Contributor Author

@pahud pahud Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iliapolo

Let me change the name to create nodegroup with on-demand capacity type and multiple instance types

so we have two test cases:

  1. on-demand capacity type with one instance type in instanceTypes
  2. on-demand capacity type with multiple instance types in instanceTypes, which is also accepted by CFN. Under the hood the CFN will create an ASG with launch template with multiple instance types but only the first one will be selected. It's not documented specificly in CFN but it's also accepted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pahud Wait, creating an on-demand node group with multiple instance types will result in a launch template with just the first instance type being selected? ignoring all the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on a few minutes. Let me paste some screenshots here later.

Copy link
Contributor Author

@pahud pahud Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the MNG like this

this.cluster.addNodegroupCapacity('MNG', {
  capacityType: eks.CapacityType.ON_DEMAND,
  desiredSize: 10,
  instanceTypes: [
    new ec2.InstanceType('t3.large'),
    new ec2.InstanceType('m5.large'),
    new ec2.InstanceType('c5.large'),
  ],
})

We actually synthesize it to this

  ClusterNodegroupMNGF9B82D86:
    Type: AWS::EKS::Nodegroup
    Properties:
      ClusterName:
        Ref: Cluster9EE0221C
      NodeRole:
        Fn::GetAtt:
          - ClusterNodegroupMNGNodeGroupRole3AB840AB
          - Arn
      Subnets:
        - subnet-05b946a306d81274b
        - subnet-0ae2846156ce0118c
        - subnet-063ca7ed4aacdbad7
      AmiType: AL2_x86_64
      ForceUpdateEnabled: true
      InstanceTypes:
        - t3.large
        - m5.large
        - c5.large
      ScalingConfig:
        DesiredSize: 10
        MaxSize: 10
        MinSize: 1
      CapacityType: ON_DEMAND

And the MNG will create an ASG with LT with 100% OD + 0% Spot like:

圖片

圖片

圖片

And all the instances will be the first instance type

圖片

I believe this is expected from ASG's perspective when ASG+LT for 100% on-demand instance types, ASG will just not span multiple on-demand instance types. It's worth a note here.

Copy link
Contributor

@iliapolo iliapolo Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pahud Its not that all the other instance types are ignored, the distribution of instance types across the ASG depends on the allocation strategy, which for on-demand is prioritized .

To my understanding this means it will try to fulfill all the capacity by trying the first instance type, but will use other types if there is still a capacity shortage and the first type is depleted.

I kind of lost sight of why we got into this discussion honestly :) - does this make a difference to us somehow?

packages/@aws-cdk/aws-eks/test/test.nodegroup.ts Outdated Show resolved Hide resolved
@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 18, 2020
pahud and others added 2 commits December 18, 2020 23:53
Co-authored-by: Eli Polonsky <Eli.polonsky@gmail.com>
@pahud
Copy link
Contributor Author

pahud commented Dec 18, 2020

Hi @iliapolo

All fixed. Please check it out again.

@pahud pahud requested a review from iliapolo December 18, 2020 16:13
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 19, 2020
@iliapolo
Copy link
Contributor

@pahud I'm good with merging i think once the conflicts are resolved.

@iliapolo iliapolo added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 22, 2020
@pahud
Copy link
Contributor Author

pahud commented Dec 22, 2020

Hi @iliapolo

All fixed now. :-)

packages/@aws-cdk/aws-eks/README.md Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-eks/README.md Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-eks/README.md Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Dec 22, 2020

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit 6ccd00f into aws:master Dec 22, 2020
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: 394cefe
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

flochaz pushed a commit to flochaz/aws-cdk that referenced this pull request Jan 5, 2021
This PR adds the `CapacityType` support and allows users to create Spot managed node groups for Amazon EKS.

1. The `CapacityType` attribute is supported by cloudformation but not yet documented. We tentatively use addPropertyOverride() to enable it.
2. `instanceType` will be deprecated and we introduced the new `instanceTypes`
3. `instanceTypes` with different CPU architectures will throw an error.
4. `amiType` is still optional, however, when specified, incorrect `amiType` will throw the error.
5. According to the [document](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html#cfn-eks-nodegroup-instancetypes), we are allowed to specify instance type(s) in either `instanceTypes` property or launch template but not both. As we can't check the content of the launch template passed in, we allow `instanceTypes` and launch template both specified and encourage to use `instanceTypes` when possible.


## Sample

```ts
cluster.addNodegroupCapacity('extra-ng-spot', {
  instanceTypes: [
    new ec2.InstanceType('c5.large'),
    new ec2.InstanceType('c5a.large'),
    new ec2.InstanceType('c5d.large'),
  ],
  minSize: 3,
  capacityType: eks.CapacityType.SPOT,
});
```

Closes aws#11827 

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
mergify bot pushed a commit that referenced this pull request Jan 12, 2021
…that is not compatible to the default instance type (#12441)

> Note: both issues here were introduced in #11962

## Problem 1

When creating a `Nodegroup` without passing instance types, we currently default to use `t3.medium`:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L294

This default is then used to calculate the expected AMI type, and assert that the configured AMI type is indeed as expected:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L302-L304

However, a user might configure instance types on the launch template, and an AMI type on the nodegroup. In this scenario, we still use the default instance type to perform the validation, which will fail if the ami type is not compatible with it.

To make things worse, we don't actually use the default instance type at all, apart from the validation:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L329-L330

And in-fact, this default was only introduced in this [PR](#11962), which also added the problematic validation. 

### Solution

Drop the default instance type altogether, like it was before. The new validation will only take place if the user explicitly configured both `instanceTypes` and `amiType` on the nodegroup. Since the default value was never actually used, this doesn't incur any behavior change.

## Problem 2

When a launch template is used, we currently ignore the value of `amiType` explicitly passed by the user:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L324-L325

This behavior means that users who configured a launch template without a custom ami, and passing an `amiType` to the nodegroup, would now result in no ami specification at all, defaulting to whatever EKS does, which might not be what the user had in mind.

There's no good reason to do this, we should either throw a validation error if both are used, or pass the explicit value nevertheless, even though it might cause problems.

### Solution

When a user explicitly passes an AMI type, just use it and assume the user knows what he/she is doing. When a user does not explicitly pass it, only apply the default if a launch template is not used. 

> If we apply the default in the presence of a launch template, a user would not be able to escape if they also have a custom AMI in the launch template.

This change means that users who previously "relied" on this override, might now experience a deployment failure if they are using a custom AMI in the launch template, those users can resolve the problem by removing the `amiType` property from the nodegroup (since it wasn't used, its not needed). I don't imagine many such users exist since this behavior is new and it doesn't make much sense to configure both a custom AMI and an `amiType`.

--------------------

Fixes #12389

BREAKING CHANGE: Explicitly passing `amiType` to nodegroups will now take affect even if a launch template is configured as well. If your launch template contains a custom AMI, this will cause a deployment failure, to resolve, remove the explicit `amiType` from the nodegroup configuration.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(aws-eks): managed nodegroup for spot instances
4 participants