Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(EksBlueprint name): (eke-blueprints version incompatibility) #1036

Closed
vpopiolrccl opened this issue Jul 2, 2024 · 14 comments
Closed

(EksBlueprint name): (eke-blueprints version incompatibility) #1036

vpopiolrccl opened this issue Jul 2, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@vpopiolrccl
Copy link

Describe the bug

Trying to cdk deploy for a cluster previously created with the @aws-quickstart/eks-blueprints version 1.14.1 after upgrading to @aws-quickstart/eks-blueprints v1.15.0, I get errors in the Cloud Formation events

Expected Behavior

No changes should be made to the cluster as nothing changed in the stack

Current Behavior

The Cluster Provider nested tack produces this error when creating the Provider Waiter State Machine:
Resource handler returned message: "Resource of type 'AWS::Logs::LogGroup' with identifier '{"/properties/LogGroupName":"/aws/vendedlogs/states/waiter-state-machine-rcg-ecom-cluster-sandbox--ProviderframeworkisCompl-q4ar3IV7b2Li-c823a05924272663236e0df94090e3304c5d23966c"}' already exists." (RequestToken: 48ba77a8-b8d7-7e17-71f3-1e29a5cfca0d, HandlerErrorCode: AlreadyExists)

Reproduction Steps

  • Open package.json
  • Bump version for "@aws-quickstart/eks-blueprints": "1.15.0"
  • Bump version of "aws-cdk-lib": "2.147.1"
  • Bump version of "aws-cdk": "2.147.1"
  • Run npm install
  • Run cdk synth
  • Run cdk deploy

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.147.1

EKS Blueprints Version

1.15.0

Node.js Version

21.6.1

Environment details (OS name and version, etc.)

Mac OS 14.5

Other information

No response

@vpopiolrccl vpopiolrccl added the bug Something isn't working label Jul 2, 2024
@vpopiolrccl
Copy link
Author

While doing some troubleshooting, I deleted the LogGroup that the message referred to and after executing a new cdk deploy, it completed successfully. This doesn't resolve the issue as we have multiples clusters running in different AWS accounts that would like to continue maintaining with the CDK blueprints. For those other accounts, access to make this type of changes (deleting a Log Group) is very restricted.

@shapirov103
Copy link
Collaborator

shapirov103 commented Jul 2, 2024

@vpopiolrccl we just released 1.15.1 as a patch release for some of the backwards compatibility issues. Please give it a try with a cluster that was produced with 1.14.1 and if the issue persists, i will need a blueprint example to reproduce.

@vpopiolrccl
Copy link
Author

@vpopiolrccl we just released 1.15.1 as a patch release for some of the backwards compatibility issues. Please give it a try with a cluster that was produced with 1.14.1 and if the issue persists, i will need a blueprint example to reproduce.

Thanks @shapirov103. I also tied 1.15.1 before opening the issue with the same results.

@shapirov103
Copy link
Collaborator

Looking further into the log, I see that it is most likely related to the cluster log implementation addressing this issue: #997

Let me take a look if we can introduce an option to reuse the existing log group for that.

@vpopiolrccl
Copy link
Author

But that Log Group seems to belong to the Step Function used by the Custom Resource

@shapirov103
Copy link
Collaborator

Yes, the native CDK implementation of the logging is using step functions to orchestrate log creation after the cluster. I am unclear about the name collision. Do I assume correctly that you have the control plane logging enabled with the blueprint?

@vpopiolrccl
Copy link
Author

Do I assume correctly that you have the control plane logging enabled with the blueprint?

Currently, not. But good point. Will most likely change this setting.

@shapirov103
Copy link
Collaborator

shapirov103 commented Jul 2, 2024

Just FYI, I ran provisioning with 1.14.1 for a cluster that resembles your setup (I could not directly reproduce as I don't have access to you env settings and ami version that you use).

const stackID = `${id}-blueprint`;

        const clusterProps: blueprints.MngClusterProviderProps = {
            version: KubernetesVersion.V1_29,
            nodegroupName: 'my-ng',
            instanceTypes: [InstanceType.of(InstanceClass.M5, InstanceSize.LARGE)],
            minSize: 1,
            maxSize: 3
          }
          console.log(`clusterProps: ${JSON.stringify(clusterProps)}`)
          const clusterProvider = new blueprints.MngClusterProvider(clusterProps);

        blueprints.EksBlueprint.builder()
            .clusterProvider(clusterProvider)
            .addOns(
                new blueprints.AwsLoadBalancerControllerAddOn,
                new blueprints.VpcCniAddOn(), 
                new blueprints.MetricsServerAddOn,
                new blueprints.ClusterAutoScalerAddOn,
            )
            .teams()
            .build(scope, stackID);

Provisioned cluster with 1.14.1, then upgraded the blueprints to 1.15.1 and reran deploy. I got no errors, all addons were upgraded to the newer versions (e.g. loadbalancer, metrics server, etc.). That also confirms the experience from other customers who did not have issue with the log group when upgrading.

I will need a full blueprint example to reproduce.

@vpopiolrccl
Copy link
Author

Thanks so much @shapirov103. Looks like the problem was with 1.15.0 and not with 1.15.1. It now works for me.

@paulchambers
Copy link

paulchambers commented Jul 3, 2024

I'm also seeing failures when going from 1.14.1 to 1.15.1, my stack does have control plane logging enabled.

Resource handler returned message: "Resource of type 'AWS::Logs::LogGroup' with identifier '{"/properties/LogGroupName":"/aws/vendedlogs/states/waiter-state-machine-STACKNAME-ProviderframeworkisCompl-S6XDAkzUUmoq-c8b1cfed19641073278d59059a5ed9e648e1781c7c"}' already exists." (RequestToken: 5fd41341-3e15-f2e5-826f-2f51001f349e, HandlerErrorCode: AlreadyExists)

@shapirov103
Copy link
Collaborator

@paulchambers These logs are not produced by the blueprints, they represent lambda logs for the custom resources in the CDK native implementation. I see somewhat related issue about it on the CDK repo here.
If you can drop the log groups similar to what vpopiolrccl described, that would resolve it. Please also consider running the latest cdk bootstrap on the account/region.

If the problem persists, please share the blueprint to reproduce the issue.

@paulchambers
Copy link

@shapirov103 manually removing the loggroup does clear the error, but i'm seeing it on each cluster that I upgrade to 1.15.1

When going from 1.14.1 to 1.15.1 the first deploy fails with "No changes needed for the logging config provided" from the Custom::AWSCDK-EKS-Cluster resource

Second attempt fails with the loggroup error as above

Removing the loggroup then allows the deploy to succeed

@shapirov103
Copy link
Collaborator

@paulchambers as I mentioned in #1036 (comment) , in my test I provisioned a cluster with 1.14.1, upgraded to 1.15.1 and was able to deploy successfully, all addons were updated to the latest version.
It could be an issue specific to the CDK upgrade, as these log groups are created by the CDK impl.

If there is an example that I can use to reproduce the issue, I am happy to give it a shot, if needed I will create an issue against CDK.

@shapirov103
Copy link
Collaborator

About to close this, please let me know if anyone still has troubles with the step function log group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants