Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

Closed
andreialecu opened this issue Jan 13, 2021 · 20 comments · Fixed by #16021
Closed

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

andreialecu opened this issue Jan 13, 2021 · 20 comments · Fixed by #16021
Assignees
Labels
bug This issue is a bug. documentation This is a problem with documentation. p1

Comments

@andreialecu
Copy link
Contributor

andreialecu commented Jan 13, 2021

We are running a cdk deploy '*' step as part of CI.

Even if there are no stack changes whatsover, every 10 days or so, cdk starts recreating an autoscaling group, thus breaking the ECS cluster and results in downtime until everything is built again:

Here are relevant logs:

Stack ARN:
arn:aws:cloudformation:*********:<...>:stack/App-AnalyticsApi-ECR-Production/39030a30-03d5-11eb-9455-02ea080ae8b9
App-AnalyticsEcsCluster2-Production
App-AnalyticsEcsCluster2-Production: deploying...
App-AnalyticsEcsCluster2-Production: creating CloudFormation changeset...
 0/3 | 9:02:13 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production User Initiated
 1/3 | 9:02:19 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) Requested update requires the creation of a new physical resource; hence creating one.
 1/3 | 9:02:20 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) Resource creation Initiated
 1/3 | 9:02:20 PM | UPDATE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 2/3 | 9:02:24 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) Requested update requires the creation of a new physical resource; hence creating one.
 2/3 | 9:02:24 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) Resource creation Initiated
 2/3 | 9:02:25 PM | UPDATE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
 2/3 | 9:02:27 PM | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production 
 2/3 | 9:02:28 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
2/3 Currently in progress: App-AnalyticsEcsCluster2-Production, ClusterScalingASGE8638730
 1/3 | 9:06:02 PM | DELETE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
 1/3 | 9:06:03 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 1/3 | 9:06:03 PM | DELETE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 1/3 | 9:06:04 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production 

 ✅  App-AnalyticsEcsCluster2-Production

It's always the same output.

Note that the cluster uses spot instances, which may be relevant.

Reproduction Steps

Potentially relevant code stack code:

import * as cdk from "@aws-cdk/core";
import * as ecs from "@aws-cdk/aws-ecs";
import * as ec2 from "@aws-cdk/aws-ec2";
import { VpcStack } from "./vpc-stack";

export class EcsClusterStack extends cdk.Stack {
  cluster: ecs.Cluster;
  constructor(
    scope: cdk.App,
    id: string,
    props: cdk.StackProps,
    {
      vpcStack,
      instanceType,
      spotPrice,
      subnetType,
    }: {
      vpcStack: VpcStack;
      instanceType: ec2.InstanceType;
      spotPrice: string;
      subnetType?: ec2.SubnetType;
    }
  ) {
    super(scope, id, props);

    // Create an ECS cluster
    this.cluster = new ecs.Cluster(this, "Cluster", {
      vpc: vpcStack.vpc,
    });

    // Add capacity to it
    this.cluster.addCapacity("Scaling", {
      maxCapacity: 2,
      minCapacity: 2,
      //desiredCapacity: 2,
      taskDrainTime: cdk.Duration.seconds(0),
      vpcSubnets: { subnetType },
      instanceType,
      spotPrice,
      // Enable the Automated Spot Draining support for Amazon ECS
      spotInstanceDraining: true,
    });
  }
}

What did you expect to happen?

No stack updates to be issued. No downtime.

What actually happened?

Stack is updated unnecessarily. Downtime for 10+ minutes.

Environment

  • CDK CLI Version : 1.71.0
  • Framework Version: 1.71.0

Other

It happens on CircleCI, but also when running locally, after not being ran for a while. I believe it is around 10 days, but I'm not completely sure.


This is 🐛 Bug Report

@andreialecu andreialecu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 13, 2021
@andreialecu
Copy link
Contributor Author

Note that I have recently updated from 1.71.0 to 1.84.0, but while looking at the changelog I didn't notice anything relevant that may fix this.

The problem has existed for a long time, but it was a development cluster so it wasn't too important.

@andreialecu
Copy link
Contributor Author

This is the JSON changeset listed in the CloudFormation console:

[
  {
    "resourceChange": {
      "logicalResourceId": "ClusterScalingASGE8638730",
      "action": "Modify",
      "physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingASGE8638730-D5WKYH5KZXGB",
      "resourceType": "AWS::AutoScaling::AutoScalingGroup",
      "replacement": "Conditional",
      "moduleInfo": null,
      "details": [
        {
          "target": {
            "name": "LaunchConfigurationName",
            "requiresRecreation": "Conditionally",
            "attribute": "Properties"
          },
          "causingEntity": "ClusterScalingLaunchConfig3E9D5827",
          "evaluation": "Static",
          "changeSource": "ResourceReference"
        }
      ],
      "changeSetId": null,
      "scope": [
        "Properties"
      ]
    },
    "type": "Resource"
  },
  {
    "resourceChange": {
      "logicalResourceId": "ClusterScalingLaunchConfig3E9D5827",
      "action": "Modify",
      "physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingLaunchConfig3E9D5827-1R99WCAZR3PS8",
      "resourceType": "AWS::AutoScaling::LaunchConfiguration",
      "replacement": "True",
      "moduleInfo": null,
      "details": [
        {
          "target": {
            "name": "ImageId",
            "requiresRecreation": "Always",
            "attribute": "Properties"
          },
          "causingEntity": null,
          "evaluation": "Static",
          "changeSource": "DirectModification"
        }
      ],
      "changeSetId": null,
      "scope": [
        "Properties"
      ]
    },
    "type": "Resource"
  }
]

ImageId seems to be changing

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@andreialecu
Copy link
Contributor Author

andreialecu commented Feb 17, 2021

Reopening. This is still an issue.

The auto-scaling group is being recreated every time there's an update to the underlying Linux image.

It is NOT being cached to the metadata file as mentioned in the documentation. (cdk.context.json contains no value related to a pinned image)

@andreialecu andreialecu reopened this Feb 17, 2021
@andreialecu andreialecu changed the title (aws-ecs): recreates cluster every 10(?) days resulting in downtime (aws-ecs): recreates cluster when Linux image is updated, resulting in downtime Feb 17, 2021
@peterwoodworth peterwoodworth added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Feb 17, 2021
@SoManyHs SoManyHs added guidance Question that needs advice or information. p2 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 25, 2021
@andreialecu
Copy link
Contributor Author

This is still happening in all our clusters. The machine ami id is definitely not being cached to cdk.context.json.

Am I doing something wrong? Could someone point the relevant section of code inside aws-cdk where the value is supposed to be written? I could try debugging it myself.

@peterwoodworth peterwoodworth removed the guidance Question that needs advice or information. label Jul 8, 2021
@peterwoodworth peterwoodworth changed the title (aws-ecs): recreates cluster when Linux image is updated, resulting in downtime (aws-autoscaling): machine ami id not caching to cdk.context.json Jul 30, 2021
@peterwoodworth peterwoodworth removed the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Jul 30, 2021
@github-actions github-actions bot added the @aws-cdk/aws-autoscaling Related to Amazon EC2 Auto Scaling label Jul 30, 2021
@peterwoodworth peterwoodworth added p1 documentation This is a problem with documentation. and removed @aws-cdk/aws-autoscaling Related to Amazon EC2 Auto Scaling p2 labels Jul 30, 2021
@peterwoodworth
Copy link
Contributor

This section in the docs states that:

NOTE: The Amazon Linux images selected will be cached in your cdk.json, so that your AutoScalingGroups don't automatically change out from under you when you're making unrelated changes. To update to the latest version of Amazon Linux, remove the cache entry from the context section of your cdk.json.

@rix0rrr was this how it previously worked? I can't get my machine ami id to cache, this seems to be an error in our documentation. The only things that cache to cdk.context.json are the context methods, so this note should be updated or removed.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 11, 2021

@peterwoodworth the docs you are referring to are about the aws-autoscaling module, and refer to the examples shown just above (which are where the user is in control of the MachineImage instantiation). The code the OP is talking about though is using the aws-ecs module.

Specifically, they are getting an auto-updating image because they're not passing a machineImage parameter to addCapacity, and the default is an auto-updating Amazon Linux 2 image: https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-ecs.AddCapacityOptions.html#machineimage

I'll agree that the docs don't make this very clear. Explicitly picking a non-updating AmazonLinux image would do it.

I am more interested in why this is causing downtime. The ECS cluster instances should be replaced one at at a time in a rolling update, but ECS should be rescheduling Tasks onto other machines to compensate.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 11, 2021

The most trivial way to fix this, to start, is by making the ECS docs more clear.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 12, 2021

Also, by default I think we enable instance draining, as explained here: https://blog.alterway.fr/en/update-your-ecs-container-instances-with-no-downtime.html

So I'm even more mystified by the downtime

@andreialecu
Copy link
Contributor Author

andreialecu commented Aug 12, 2021

I ended up with the following workaround in the meantime:

// pin image id:
// aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region eu-west-1
const machineImage = new ec2.GenericLinuxImage({
  // latest as of 23 july 2021
  "eu-west-1": "ami-0ea9d963ed3029ca3",
});

Then specifically pinned this machine image in the addCapacity call.

@rix0rrr there's definitely downtime as the cluster goes down then back up. It's not long, but it can be 5-10 minutes while the load balancer responds with 502.

Because this is the default and there's no mention of this caveat in the documentation, I think this is a pretty serious problem. Our system needs to run with 99.999% availability, and these 5-10 minutes of random downtime can cause serious problems.

@andreialecu
Copy link
Contributor Author

andreialecu commented Aug 12, 2021

@rix0rrr FWIW, maybe related, but if you look in the original repro code I mentioned, we use taskDrainTime: cdk.Duration.seconds(0), and spot instances.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 12, 2021

Aha well that would explain it, without draining you would always get service interruption while your instances are being replaced. You will run into the same whenever you decide to upgrade your AL AMI in the future.

I see the option spotInstanceDraining. The name seems to indicate it would apply for your situation, but I have to confess I'm not sure how it works.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 12, 2021

Not sure why spot instance draining is disabled by default 😭

@andreialecu
Copy link
Contributor Author

andreialecu commented Aug 12, 2021

You will run into the same whenever you decide to upgrade your AL AMI in the future.

Yes, I'm aware of that and will schedule it for maintenance windows. 🙂

The main problem was that this was completely unexpected and for a good amount of time I didn't understand where the downtime was originating, because like mentioned, it was being deployed as part of CI - but there were NO stack changes most of the time.

@andreialecu
Copy link
Contributor Author

Not sure why spot instance draining is disabled by default 😭

Note that even though it's disabled by default, enabling it doesn't help at all. It's possible that taskDrainTime of 0 seconds is the culprit.

I don't remember the exact reason for changing it to 0 seconds, but I think without it things would hang for up to 5 minutes in certain cases. So we preferred to simply nuke things and quickly start over (and not wait to drain anything).

rix0rrr added a commit that referenced this issue Aug 12, 2021
Most `MachineImage` implementations look up AMIs from SSM Parameters,
and by default they will all look up the Parameters on each deployment.

This leads to instance replacement. Since we already know the SSM
Parameter Name and CDK already has a cached SSM context lookup, it
should be simple to get a stable AMI ID. This is not ideal because the
AMI will grow outdated over time, but users should have the option to
pick non-updating images in a convenient way.

Fixes #12484.
@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 12, 2021

I don't suppose you're feeling like helping us figure out what's wrong with the draining config? 😉

@andreialecu
Copy link
Contributor Author

I'd be glad to try tomorrow, if you can give me some instructions on what to look for.

I didn't realise until you mentioned it that there was something wrong with a draining config. 🙂

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 13, 2021

The thing is, I'm not THAT much of an expert. I don't operate a cluster myself, I don't have a host of production-like tasks ready to try, and I'm not intricately familiar with ECS. To pass it on to ECS team for investigation or clarification, I think we need more of a smoking gun.

First thing I would do is look at various logs (applictaion logs, CloudWatch logs) to see if shutdown signals were being sent properly, when the instances got brought up, tasks scheduled on the new ones, tasks terminated on the old ones, etc, to see if I can figure something out.

@andreialecu
Copy link
Contributor Author

Here's the sequence of events:

FooBar-All-EcsCluster-Production
FooBar-All-EcsCluster-Production: deploying...
FooBar-All-EcsCluster-Production: creating CloudFormation changeset...
[██████████████████████████████████████████████████████████] (3/3)

11:12:49 | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack            | FooBar-All-EcsCluster-Production
11:12:50 | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG

Notice new spot requests:
Screenshot 2021-08-13 at 11 15 25

Old EC2 spot instances shutting down:
Screenshot 2021-08-13 at 11 16 02

And terminated:
Screenshot 2021-08-13 at 11 14 10

In the ECS console the old ones get shut down first:
Screenshot 2021-08-13 at 11 19 49

While the new ones are pending:
Screenshot 2021-08-13 at 11 20 39

Maybe it's related to my taskDrainTime being 0. I notice that:

⚠️ Deprecated: The lifecycle draining hook is not configured if using the EC2 Capacity Provider. Enable managed termination protection instead.

And the managed termination protection mentions:

When managed termination protection is enabled, Amazon ECS prevents the Amazon EC2 instances in an Auto Scaling group that contain tasks from being terminated during a scale-in action. The Auto Scaling group and each instance in the Auto Scaling group must have instance protection from scale-in actions enabled as well. For more information, see Instance Protection in the AWS Auto Scaling User Guide.

When managed termination protection is disabled, your Amazon EC2 instances are not protected from termination when the Auto Scaling group scales in.

I think we still prefer a task drain time of 0 for actual deployments though. Otherwise we risk having two concurrently running versions of our services for a short time, which can cause issues during deployments.

Several seconds of downtime is fine when we actually deploy updates on purpose, but having the autoscaling group replace instances at random times due to the AMI ID update was not something we were expecting.

@mergify mergify bot closed this as completed in #16021 Sep 9, 2021
mergify bot pushed a commit that referenced this issue Sep 9, 2021
Most `MachineImage` implementations look up AMIs from SSM Parameters,
and by default they will all look up the Parameters on each deployment.

This leads to instance replacement. Since we already know the SSM
Parameter Name and CDK already has a cached SSM context lookup, it
should be simple to get a stable AMI ID. This is not ideal because the
AMI will grow outdated over time, but users should have the option to
pick non-updating images in a convenient way.

Fixes #12484.


----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

github-actions bot commented Sep 9, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. documentation This is a problem with documentation. p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants