Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks: fail to create eks nodegroup in cn-north-1 #24696

Open
BruceLuX opened this issue Mar 20, 2023 · 17 comments
Open

eks: fail to create eks nodegroup in cn-north-1 #24696

BruceLuX opened this issue Mar 20, 2023 · 17 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p2

Comments

@BruceLuX
Copy link

BruceLuX commented Mar 20, 2023

Describe the bug

Hi, folks

I met a promble when use aws python cdk to create eks cluster. Please find information below:

My local env:
(.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ cdk --version
2.67.0 (build b6f7f39)
(.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ python3 --version
Python 3.7.10
(.venv) [ec2-user@ip-10-0-1-73 python-cdk]$ cat /proc/version
Linux version 5.10.144-127.601.amzn2.x86_64 (mockbuild@ip-10-0-44-229) (gcc10-gcc (GCC) 10.3.1 20210422 (Red Hat 10.3.1-1), GNU ld version 2.35-21.amzn2.0.1) #1 SMP Thu Sep 29 01:11:59 UTC 2022

Here is the core code:

node_role = iam.Role.from_role_arn(self, 'eks-node-role-arn-lookup', 'arn:aws-cn:iam::xxxxxxxxxxx:role/eks-node-role')

cluster.add_nodegroup_capacity(
    nodegroup_name,
    nodegroup_name=nodegroup_name,
    instance_types=[ec2.InstanceType(instance_type)],
    min_size=1,
    max_size=3,
    capacity_type=capacity_type,
    disk_size=disk_size,
    ami_type=ami_type
	node_role=node_role
)

I manually create the Node Role, and the cdk will deploy successfully, but when i remove the node_role parameter, like these:

cluster.add_nodegroup_capacity(
    nodegroup_name,
    nodegroup_name=nodegroup_name,
    instance_types=[ec2.InstanceType(instance_type)],
    min_size=1,
    max_size=2,
    capacity_type=capacity_type,
    disk_size=disk_size,
    ami_type=ami_type
)

Below error messages will be thrown :

Resource handler returned message: "Following required service principals [ec2.amazonaws.com.cn] were not found in the trust relations
hips of nodeRole arn:aws-cn:iam::4123xxxxxxx:role/eks-cluster-stack-eksgitlabrunnerclusterNodegroupg-1EPH8PW36YZ3A (Service: Eks, Sta
tus Code: 400, Request ID: 6f4cc1b1-4fd2-4072-887c-abc6ddf60d58)" (RequestToken: 7c7be61d-a2a5-3e36-1a34-e6a54c71d72a, HandlerErrorCod
e: InvalidRequest)

But i think the principals [ec2.amazonaws.com.cn] is right in cn-north-1 region.

Could you please help to check this problem ?

Expected Behavior

When I do not specify the node role in the method, i think cdk will automaticallycreate the node role.

Method doc : https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.aws_eks/Cluster.html#aws_cdk.aws_eks.Cluster.add_nodegroup_capacity

Current Behavior

In the cn-north-1 region, CDk create the node role failed.

I had checked the principals which in my another ec2 role, the configuration [ec2.amazonaws.com.cn] is right.

It seems that CDK cannot recognize this principals

Reproduction Steps

Refer to the CDK code, when remove the node_role, it will create failed in cn-north-1 region.

Possible Solution

manually create the node role, and hard-code in the cdk code

Additional Information/Context

No response

CDK CLI Version

2.67.0

Framework Version

No response

Node.js Version

v16.18.0

OS

Amazon Linux2

Language

Python

Language Version

3.7.10

Other information

No response

@BruceLuX BruceLuX added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 20, 2023
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Mar 20, 2023
@pahud
Copy link
Contributor

pahud commented Mar 20, 2023

Hi,

Let me clarify this first.

  1. Does it happen only when you update your EKS deployment by removing your custom nodeRole?
  2. Are you having this error in cn-north-1

@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. bug This issue is a bug. labels Mar 20, 2023
@pahud pahud self-assigned this Mar 20, 2023
@pahud
Copy link
Contributor

pahud commented Mar 20, 2023

I found the root cause here:

assumedBy: new ServicePrincipal('ec2.amazonaws.com'),

In China, this should be ec2.amazonaws.com.cn instead.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Mar 20, 2023
@pahud
Copy link
Contributor

pahud commented Mar 20, 2023

OK I guess #22589 broke this.

image

This has been removed in #22589 but actually required for AWS China region.

@BruceLuX
Copy link
Author

BruceLuX commented Mar 21, 2023

@pahud
Hi, Pahud
Many thanks for ur troubleshoot.
I also met the similar promble when create the eks cluster by cdk.

I just create the eks cluster, not create the nodegroup and nodegroup role, this is my cdk code :

class EksClusterStack(Stack):
    def __init__(self, scope: Construct, identifier, **kwargs):
        super().__init__(scope, identifier, **kwargs)

        vpc = ec2.Vpc.from_lookup(
            self, "my-vpc", vpc_id=vpc_id
        )

        # eks cluster
        cluster = self.create_eks_cluster(vpc)
        
        """
        CfnOutput(self, "eks-cluster-arn-export", value=cluster.cluster_name, export_name="eks-cluster-name")
        """

    def create_eks_cluster(self, vpc):
        cluster = eks.Cluster(
            self,
            "eks-gitlab-runner-cluster",
            cluster_name=cluster_name,
            vpc=vpc,
            version=eks.KubernetesVersion.V1_24,
            default_capacity=0,
        )
        return cluster

I deploy the stack in cn-north-1, but the stack roll back finally.

And I check the cfn stack error, cfn stack prompted a sub-stack creation failure, so I checked the error message from the sub-stack and find the following log:
Policy arn:aws-cn:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly does not exist or is not attachable. (Service: AmazonIdentityManagement; Status Code: 404; Error Code: NoSuchEntity; Request ID: 99a1c2a5-992a-4e47-a0d0-357d9c73c70d; Proxy: null)

I am sure the policy 'AmazonElasticContainerRegistryPublicReadOnly' is aws managed policy which only use for global region, I cannt find this iam policy in China region.

Could you please help to check if it is the same root cause ?

@pahud
Copy link
Contributor

pahud commented Mar 21, 2023

Yes I can't deploy even this to cn-north-1

import { App, Stack, StackProps,
  aws_eks as eks,
  aws_ec2 as ec2, 
} from 'aws-cdk-lib';
import { KubectlV25Layer as KubectlLayer } from '@aws-cdk/lambda-layer-kubectl-v25';

const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: true });
const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_25,
  kubectlLayer: new KubectlLayer(this, 'KubectlLayer'),
})

The error message is just as you described above:

Resource handler returned message: "Following required service principals [ec2.amazonaws.com.cn] were not found in the trust relations
hips of nodeRole arn:aws-cn:iam::4123xxxxxxx:role/eks-cluster-stack-eksgitlabrunnerclusterNodegroupg-1EPH8PW36YZ3A (Service: Eks, Sta
tus Code: 400, Request ID: 6f4cc1b1-4fd2-4072-887c-abc6ddf60d58)" (RequestToken: 7c7be61d-a2a5-3e36-1a34-e6a54c71d72a, HandlerErrorCod
e: InvalidRequest)

Looks like the EKS is expecting ec2 service principal name as ec2.amazonaws.com.cn but CDK is giving ec2.amazonaws.com. I am still working on this to get it sorted with internal teams.

@pahud pahud added bug This issue is a bug. p1 labels Mar 21, 2023
@pahud pahud changed the title Use AWS Python CDK to create EKS Cluster and Nodegroup eks: fail to create eks nodegroup in cn-north-1 Mar 22, 2023
@pahud
Copy link
Contributor

pahud commented Mar 22, 2023

@Bruce-Lu674 I created #24743 for the missing AmazonElasticContainerRegistryPublicReadOnly bug FYR.

@BruceLuX
Copy link
Author

@pahud Many thanks for your help,
btw, Is there an expected resolution time for this issue? I can use the old version(2.65) to create the EKS Cluster.But I dont think use the old version is long-term solution.

@pahud
Copy link
Contributor

pahud commented Mar 24, 2023

@Bruce-Lu674 The relevant team is working on it but I don't have ETA at this moment but I will update here when I see the issue is fixed(hopefully very soon).

btw, are you able to successfully deploy eks with cdk 2.65 in cn-north-1?

@BruceLuX
Copy link
Author

BruceLuX commented Mar 25, 2023

@pahud Yes, I can deploy the EKS Cluster via cdk v2.65 in cn-north-1.

@pahud
Copy link
Contributor

pahud commented Mar 28, 2023

Hi @Bruce-Lu674

Are you able to deploy the cluster AND a nodegroup with cdk v2.65.0 in cn-north-1 like this?

const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_24,
  defaultCapacity: 0,
  kubectlLayer,
});
const ng = cluster.addNodegroupCapacity('NG', {
  desiredSize: 2,
});  

@BruceLuX
Copy link
Author

BruceLuX commented Mar 31, 2023

Hi Pahud @pahud , yes, I can create the eks cluster via v2.65 and v2.66, but without the Nodegroup resource.
I think like this:

const cluster = new eks.Cluster(this, 'Cluster', {
  vpc,
  version: eks.KubernetesVersion.V1_24,
  defaultCapacity: 0,
  kubectlLayer,
});

Here is my python code:

vpc = ec2.Vpc.from_lookup(
            self, "my-vpc", vpc_id=vpc_id
        )
# eks cluster
cluster = self.create_eks_cluster(vpc)
def create_eks_cluster(self, vpc):
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            cluster_name=cluster_name,
            vpc=vpc,
            default_capacity=0,
            version=eks.KubernetesVersion.V1_24
        )
        return cluster

@pahud
Copy link
Contributor

pahud commented Mar 31, 2023

@Bruce-Lu674

Unfortunately I can't even successfully deploy the cluster. I'll keep diving deep for the root cause.

btw, do you have account on cdk.dev slack? Can you ping me on the slack so we can directly discuss more details?

@ItielOlenick
Copy link

Hi

I am on CDK version 2.74.0, and this is still an issue.
Any updates / ETA on a fix?

Thanks

@pahud
Copy link
Contributor

pahud commented Apr 20, 2023

@ItielOlenick

Looks like the EKS is expecting ec2 service principal name as ec2.amazonaws.com.cn but CDK is giving ec2.amazonaws.com. I am still working on this to get it sorted with internal teams.

We are still working with internal teams to fix this but unfortunately no ETA at this moment. I'll share the update if any.

EKS in CN is having 2 additional issues as well and we probably need to fix them before we are allowed to deploy with the latest CDK.

mergify bot pushed a commit that referenced this issue Apr 21, 2023
…Cloud regions (#25215)

Reopening this PR because #25170 was closed by accident.

As ECR Public is not available in China regions and GovCloud, `AmazonElasticContainerRegistryPublicReadOnly` IAM managed policy would not be available in those affected regions and should not be attached to the role. This PR implements a CfnCondition to determine if ECR public is available based on `Aws.Partition` of the deploying region and conditionally attach `AmazonElasticContainerRegistryPublicReadOnly` to the kubectl-provider handler role. 

This PR has been tested in the following regions:

- [x] *cn-north-1
- [x] *cn-northwest-1
- [x] us-east-1

* I can confirm the role is created correctly in cn regions but due to 
   - #24358 
   - #24696  
The cluster and nodegroup are still failing to create in CN.

Closes #24743 #24808 #25178
@pahud
Copy link
Contributor

pahud commented May 3, 2023

I can confirm we can successfully deploy EKS cluster in China regions with escape hatches as below:

import { KubectlV26Layer as KubectlLayer } from '@aws-cdk/lambda-layer-kubectl-v26';

const cluster = new eks.Cluster(scope, 'EksCluster', {
        vpc,
        version: eks.KubernetesVersion.V1_26,
        kubectlLayer: new KubectlLayer(scope, 'KubectlLayer'),
        defaultCapacity: 2,
    });

// override the service principal for the default nodegroup
overrideServicePrincipal(cluster.defaultNodegroup?.role.node.defaultChild as iam.CfnRole)

const ng = cluster.addNodegroupCapacity('NG', {
  desiredSize: 2,
});

// override the service principal for the additional nodegroup
overrideServicePrincipal(ng.role.node.defaultChild as iam.CfnRole)


function overrideServicePrincipal(role: iam.CfnRole) {
  role.addPropertyOverride('AssumeRolePolicyDocument.Statement.0.Principal.Service', ['ec2.amazonaws.com', 'ec2.amazonaws.com.cn'])
}
% kubectl get no
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-0-140-206.cn-north-1.compute.internal   Ready    <none>   2m34s   v1.26.2-eks-a59e1f0
ip-10-0-141-57.cn-north-1.compute.internal    Ready    <none>   2m20s   v1.26.2-eks-a59e1f0
ip-10-0-174-210.cn-north-1.compute.internal   Ready    <none>   2m34s   v1.26.2-eks-a59e1f0

This is a temporary fix for this issue from CDK.

@pahud pahud removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label May 3, 2023
@iliapolo iliapolo added the p1.5 label May 16, 2023
@justin007755
Copy link

justin007755 commented May 22, 2023

Hello @pahud ,

We are still encountering below when using latest cdk version to create eks and corresponding resources like helm chart etc, and I tested cdk-2.65.0 which looks good, however, it's hard for us to use this cdk version considering other facts, so do we have a ETA or workaround for this issue?


2023-05-19 14:12:02 UTC+0800 HandlerServiceRoleFCDC14AE
CREATE_FAILED Policy arn:aws-cn:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly does not exist or is not attachable. (Service: AmazonIdentityManagement; Status Code: 404; Error Code: NoSuchEntity; Request ID: 8a2723e1-3330-40e4-af9c-d45b6e6aa3b3; Proxy: null)


@pahud
Copy link
Contributor

pahud commented Aug 2, 2023

@justin007755 This bug should have been fixed in #25215

Please install the latest AWS CDK and let me know if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants