Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(eks): Managed node with LaunchTemplate failed to set amiType #12389

Closed
vandreykiv opened this issue Jan 6, 2021 · 3 comments · Fixed by #12441
Closed

(eks): Managed node with LaunchTemplate failed to set amiType #12389

vandreykiv opened this issue Jan 6, 2021 · 3 comments · Fixed by #12441
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@vandreykiv
Copy link

vandreykiv commented Jan 6, 2021

After update to the latest cdk found that npm run build fails with the next error:

    at new Nodegroup (./aws-infra/node_modules/@aws-cdk/aws-eks/lib/managed-nodegroup.ts:303:13)
    at Cluster.addNodegroupCapacity (./aws-infra/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:1203:12)
    at new EKSStack (./aws-infra/lib/eks-stack.ts:383:37)
    at Object.<anonymous> (./aws-infra/bin/aws-infra.ts:51:13)
    at Module._compile (node:internal/modules/cjs/loader:1108:14)
    at Module.m._compile (./aws-infra/node_modules/ts-node/src/index.ts:1056:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
    at Object.require.extensions.<computed> [as .ts] (./aws-infra/node_modules/ts-node/src/index.ts:1059:12)
    at Module.load (node:internal/modules/cjs/loader:973:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)

Nodegroup config:

this.cluster.addNodegroupCapacity('gpu', {
            amiType: eks.NodegroupAmiType.AL2_X86_64_GPU,
            minSize: this.config.eks.gpuNodeGroupMinCapacity,
            desiredSize: this.config.eks.gpuNodeGroupMinCapacity,
            maxSize: this.config.eks.gpuNodeGroupMaxCapacity,
            launchTemplateSpec: {
                id: GPUlaunchTemplate.ref,
                version: GPUlaunchTemplate.attrLatestVersionNumber
            },
            nodeRole: eksInstanceRole,
            labels: {
                instance_type: 'gpu'
            },
            subnets: {
                subnetGroupName: 'Kube Private'
            }
        });

LaunchTemplate:

        const GPUlaunchTemplate = new ec2.CfnLaunchTemplate(this, 'GPUNodeGroupLaunchTemplate', {
            launchTemplateData: {
                instanceType: new ec2.InstanceType(this.config.eks.gpuNodeGroupInstanceType).toString(),
                userData: userData,
                securityGroupIds: [this.vpcStack.eksSecurityGroup.securityGroupId, this.vpcStack.esSecurityGroup.securityGroupId, this.vpcStack.rdsSecurityGroup.securityGroupId],
                blockDeviceMappings: [
                    {
                        deviceName: "/dev/xvda",
                        ebs: {
                            volumeSize: this.config.eks.gpuNodeGroupDiskSize
                        }
                    }
                ],
                monitoring: { enabled: true }
            }
        });

According to docs we can't set instanceType for LaunchTemplate and NodeGroup at the same time. If I remove instance type from LaunchTemplate and add it to NodeGroup amiType would be removed from NodeGroup and this cause NodeGroup recreation.

Resources
[~] AWS::EC2::LaunchTemplate GPUNodeGroupLaunchTemplate GPUNodeGroupLaunchTemplate
 └─ [~] LaunchTemplateData
     └─ [-] Removed: .InstanceType
[~] AWS::EC2::LaunchTemplate CPUNodeGroupLaunchTemplate CPUNodeGroupLaunchTemplate
 └─ [~] LaunchTemplateData
     └─ [-] Removed: .InstanceType
[~] AWS::EC2::LaunchTemplate SSDNodeGroupLaunchTemplate SSDNodeGroupLaunchTemplate
 └─ [~] LaunchTemplateData
     └─ [-] Removed: .InstanceType
[~] Custom::AWSCDK-EKS-KubernetesResource eks-cluster/AwsAuth/manifest/Resource eksclusterAwsAuthmanifest769BDE7D
 └─ [+] Overwrite
     └─ true
[~] AWS::EKS::Nodegroup eks-cluster/Nodegroupgpu eksclusterNodegroupgpu2C0AC521 replace
 ├─ [-] AmiType (requires replacement)
 │   └─ AL2_x86_64_GPU
 └─ [+] InstanceTypes (requires replacement)
     └─ ["g4dn.4xlarge"]
[~] AWS::EKS::Nodegroup eks-cluster/Nodegroupcpu eksclusterNodegroupcpu16E5F90A replace
 ├─ [-] AmiType (requires replacement)
 │   └─ AL2_x86_64
 └─ [+] InstanceTypes (requires replacement)
     └─ ["m5d.4xlarge"]
[~] AWS::EKS::Nodegroup eks-cluster/Nodegroupssd eksclusterNodegroupssd5E0EB251 replace
 ├─ [-] AmiType (requires replacement)
 │   └─ AL2_x86_64
 └─ [+] InstanceTypes (requires replacement)
     └─ ["c5d.xlarge"]

Reproduction Steps

        const GPUlaunchTemplate = new ec2.CfnLaunchTemplate(this, 'GPUNodeGroupLaunchTemplate', {
            launchTemplateData: {
                instanceType: new ec2.InstanceType(this.config.eks.gpuNodeGroupInstanceType).toString(),
                userData: userData,
                securityGroupIds: [this.vpcStack.eksSecurityGroup.securityGroupId, this.vpcStack.esSecurityGroup.securityGroupId, this.vpcStack.rdsSecurityGroup.securityGroupId],
                blockDeviceMappings: [
                    {
                        deviceName: "/dev/xvda",
                        ebs: {
                            volumeSize: this.config.eks.gpuNodeGroupDiskSize
                        }
                    }
                ],
                monitoring: { enabled: true }
            }
        });

        this.cluster.addNodegroupCapacity('gpu', {
            amiType: eks.NodegroupAmiType.AL2_X86_64_GPU,
            minSize: this.config.eks.gpuNodeGroupMinCapacity,
            desiredSize: this.config.eks.gpuNodeGroupMinCapacity,
            maxSize: this.config.eks.gpuNodeGroupMaxCapacity,
            launchTemplateSpec: {
                id: GPUlaunchTemplate.ref,
                version: GPUlaunchTemplate.attrLatestVersionNumber
            },
            nodeRole: eksInstanceRole,
            labels: {
                instance_type: 'gpu'
            },
            subnets: {
                subnetGroupName: 'Kube Private'
            }
        });

Environment

  • CDK CLI Version : 1.83.0 (build 827c5f4)
  • Framework Version:
  • Node.js Version: v15.5.1
  • OS : macOS 11.1
  • Language (Version): TypeScript (4.1.3)

This is 🐛 Bug Report

@vandreykiv vandreykiv added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 6, 2021
@vandreykiv vandreykiv changed the title (module eks): Managed node with LaunchTemplate failed to set amiType (eks): Managed node with LaunchTemplate failed to set amiType Jan 6, 2021
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Jan 6, 2021
@iliapolo
Copy link
Contributor

@vandreykiv Thanks for reporting this.

Fix coming up: #12441

@iliapolo iliapolo added effort/small Small work item – less than a day of effort p1 and removed needs-triage This issue or PR still needs to be triaged. labels Jan 10, 2021
@iliapolo iliapolo added this to the [GA] @aws-cdk/aws-eks milestone Jan 10, 2021
@mergify mergify bot closed this as completed in #12441 Jan 12, 2021
mergify bot pushed a commit that referenced this issue Jan 12, 2021
…that is not compatible to the default instance type (#12441)

> Note: both issues here were introduced in #11962

## Problem 1

When creating a `Nodegroup` without passing instance types, we currently default to use `t3.medium`:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L294

This default is then used to calculate the expected AMI type, and assert that the configured AMI type is indeed as expected:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L302-L304

However, a user might configure instance types on the launch template, and an AMI type on the nodegroup. In this scenario, we still use the default instance type to perform the validation, which will fail if the ami type is not compatible with it.

To make things worse, we don't actually use the default instance type at all, apart from the validation:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L329-L330

And in-fact, this default was only introduced in this [PR](#11962), which also added the problematic validation. 

### Solution

Drop the default instance type altogether, like it was before. The new validation will only take place if the user explicitly configured both `instanceTypes` and `amiType` on the nodegroup. Since the default value was never actually used, this doesn't incur any behavior change.

## Problem 2

When a launch template is used, we currently ignore the value of `amiType` explicitly passed by the user:

https://github.com/aws/aws-cdk/blob/da1ed08a6a2de584f5ddf43dab4efbb530541419/packages/%40aws-cdk/aws-eks/lib/managed-nodegroup.ts#L324-L325

This behavior means that users who configured a launch template without a custom ami, and passing an `amiType` to the nodegroup, would now result in no ami specification at all, defaulting to whatever EKS does, which might not be what the user had in mind.

There's no good reason to do this, we should either throw a validation error if both are used, or pass the explicit value nevertheless, even though it might cause problems.

### Solution

When a user explicitly passes an AMI type, just use it and assume the user knows what he/she is doing. When a user does not explicitly pass it, only apply the default if a launch template is not used. 

> If we apply the default in the presence of a launch template, a user would not be able to escape if they also have a custom AMI in the launch template.

This change means that users who previously "relied" on this override, might now experience a deployment failure if they are using a custom AMI in the launch template, those users can resolve the problem by removing the `amiType` property from the nodegroup (since it wasn't used, its not needed). I don't imagine many such users exist since this behavior is new and it doesn't make much sense to configure both a custom AMI and an `amiType`.

--------------------

Fixes #12389

BREAKING CHANGE: Explicitly passing `amiType` to nodegroups will now take affect even if a launch template is configured as well. If your launch template contains a custom AMI, this will cause a deployment failure, to resolve, remove the explicit `amiType` from the nodegroup configuration.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@iliapolo
Copy link
Contributor

@vandreykiv When the next version of the CDK comes, you should be able to use with the original code you had and it won't cause any errors or replacements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
2 participants