(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

andreialecu · 2021-01-13T15:14:22Z

We are running a cdk deploy '*' step as part of CI.

Even if there are no stack changes whatsover, every 10 days or so, cdk starts recreating an autoscaling group, thus breaking the ECS cluster and results in downtime until everything is built again:

Here are relevant logs:

Stack ARN:
arn:aws:cloudformation:*********:<...>:stack/App-AnalyticsApi-ECR-Production/39030a30-03d5-11eb-9455-02ea080ae8b9
App-AnalyticsEcsCluster2-Production
App-AnalyticsEcsCluster2-Production: deploying...
App-AnalyticsEcsCluster2-Production: creating CloudFormation changeset...
 0/3 | 9:02:13 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production User Initiated
 1/3 | 9:02:19 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) Requested update requires the creation of a new physical resource; hence creating one.
 1/3 | 9:02:20 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) Resource creation Initiated
 1/3 | 9:02:20 PM | UPDATE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 2/3 | 9:02:24 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) Requested update requires the creation of a new physical resource; hence creating one.
 2/3 | 9:02:24 PM | UPDATE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) Resource creation Initiated
 2/3 | 9:02:25 PM | UPDATE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
 2/3 | 9:02:27 PM | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production 
 2/3 | 9:02:28 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
2/3 Currently in progress: App-AnalyticsEcsCluster2-Production, ClusterScalingASGE8638730
 1/3 | 9:06:02 PM | DELETE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG (ClusterScalingASGE8638730) 
 1/3 | 9:06:03 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 1/3 | 9:06:03 PM | DELETE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration | Cluster/Scaling/LaunchConfig (ClusterScalingLaunchConfig3E9D5827) 
 1/3 | 9:06:04 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | App-AnalyticsEcsCluster2-Production 

 ✅  App-AnalyticsEcsCluster2-Production

It's always the same output.

Note that the cluster uses spot instances, which may be relevant.

Reproduction Steps

Potentially relevant code stack code:

import * as cdk from "@aws-cdk/core";
import * as ecs from "@aws-cdk/aws-ecs";
import * as ec2 from "@aws-cdk/aws-ec2";
import { VpcStack } from "./vpc-stack";

export class EcsClusterStack extends cdk.Stack {
  cluster: ecs.Cluster;
  constructor(
    scope: cdk.App,
    id: string,
    props: cdk.StackProps,
    {
      vpcStack,
      instanceType,
      spotPrice,
      subnetType,
    }: {
      vpcStack: VpcStack;
      instanceType: ec2.InstanceType;
      spotPrice: string;
      subnetType?: ec2.SubnetType;
    }
  ) {
    super(scope, id, props);

    // Create an ECS cluster
    this.cluster = new ecs.Cluster(this, "Cluster", {
      vpc: vpcStack.vpc,
    });

    // Add capacity to it
    this.cluster.addCapacity("Scaling", {
      maxCapacity: 2,
      minCapacity: 2,
      //desiredCapacity: 2,
      taskDrainTime: cdk.Duration.seconds(0),
      vpcSubnets: { subnetType },
      instanceType,
      spotPrice,
      // Enable the Automated Spot Draining support for Amazon ECS
      spotInstanceDraining: true,
    });
  }
}

What did you expect to happen?

No stack updates to be issued. No downtime.

What actually happened?

Stack is updated unnecessarily. Downtime for 10+ minutes.

Environment

CDK CLI Version : 1.71.0
Framework Version: 1.71.0

Other

It happens on CircleCI, but also when running locally, after not being ran for a while. I believe it is around 10 days, but I'm not completely sure.

This is 🐛 Bug Report

The text was updated successfully, but these errors were encountered:

andreialecu · 2021-01-13T15:17:47Z

Note that I have recently updated from 1.71.0 to 1.84.0, but while looking at the changelog I didn't notice anything relevant that may fix this.

The problem has existed for a long time, but it was a development cluster so it wasn't too important.

andreialecu · 2021-01-13T15:23:33Z

This is the JSON changeset listed in the CloudFormation console:

[
  {
    "resourceChange": {
      "logicalResourceId": "ClusterScalingASGE8638730",
      "action": "Modify",
      "physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingASGE8638730-D5WKYH5KZXGB",
      "resourceType": "AWS::AutoScaling::AutoScalingGroup",
      "replacement": "Conditional",
      "moduleInfo": null,
      "details": [
        {
          "target": {
            "name": "LaunchConfigurationName",
            "requiresRecreation": "Conditionally",
            "attribute": "Properties"
          },
          "causingEntity": "ClusterScalingLaunchConfig3E9D5827",
          "evaluation": "Static",
          "changeSource": "ResourceReference"
        }
      ],
      "changeSetId": null,
      "scope": [
        "Properties"
      ]
    },
    "type": "Resource"
  },
  {
    "resourceChange": {
      "logicalResourceId": "ClusterScalingLaunchConfig3E9D5827",
      "action": "Modify",
      "physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingLaunchConfig3E9D5827-1R99WCAZR3PS8",
      "resourceType": "AWS::AutoScaling::LaunchConfiguration",
      "replacement": "True",
      "moduleInfo": null,
      "details": [
        {
          "target": {
            "name": "ImageId",
            "requiresRecreation": "Always",
            "attribute": "Properties"
          },
          "causingEntity": null,
          "evaluation": "Static",
          "changeSource": "DirectModification"
        }
      ],
      "changeSetId": null,
      "scope": [
        "Properties"
      ]
    },
    "type": "Resource"
  }
]

ImageId seems to be changing

github-actions · 2021-01-13T15:24:41Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

andreialecu · 2021-02-17T09:32:36Z

Reopening. This is still an issue.

The auto-scaling group is being recreated every time there's an update to the underlying Linux image.

It is NOT being cached to the metadata file as mentioned in the documentation. (cdk.context.json contains no value related to a pinned image)

andreialecu · 2021-03-08T12:51:16Z

This is still happening in all our clusters. The machine ami id is definitely not being cached to cdk.context.json.

Am I doing something wrong? Could someone point the relevant section of code inside aws-cdk where the value is supposed to be written? I could try debugging it myself.

peterwoodworth · 2021-07-30T23:38:42Z

This section in the docs states that:

NOTE: The Amazon Linux images selected will be cached in your cdk.json, so that your AutoScalingGroups don't automatically change out from under you when you're making unrelated changes. To update to the latest version of Amazon Linux, remove the cache entry from the context section of your cdk.json.

@rix0rrr was this how it previously worked? I can't get my machine ami id to cache, this seems to be an error in our documentation. The only things that cache to cdk.context.json are the context methods, so this note should be updated or removed.

rix0rrr · 2021-08-11T09:33:30Z

@peterwoodworth the docs you are referring to are about the aws-autoscaling module, and refer to the examples shown just above (which are where the user is in control of the MachineImage instantiation). The code the OP is talking about though is using the aws-ecs module.

Specifically, they are getting an auto-updating image because they're not passing a machineImage parameter to addCapacity, and the default is an auto-updating Amazon Linux 2 image: https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-ecs.AddCapacityOptions.html#machineimage

I'll agree that the docs don't make this very clear. Explicitly picking a non-updating AmazonLinux image would do it.

I am more interested in why this is causing downtime. The ECS cluster instances should be replaced one at at a time in a rolling update, but ECS should be rescheduling Tasks onto other machines to compensate.

rix0rrr · 2021-08-11T09:34:43Z

The most trivial way to fix this, to start, is by making the ECS docs more clear.

rix0rrr · 2021-08-12T14:06:23Z

Also, by default I think we enable instance draining, as explained here: https://blog.alterway.fr/en/update-your-ecs-container-instances-with-no-downtime.html

So I'm even more mystified by the downtime

andreialecu · 2021-08-12T14:10:38Z

I ended up with the following workaround in the meantime:

// pin image id:
// aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region eu-west-1
const machineImage = new ec2.GenericLinuxImage({
  // latest as of 23 july 2021
  "eu-west-1": "ami-0ea9d963ed3029ca3",
});

Then specifically pinned this machine image in the addCapacity call.

@rix0rrr there's definitely downtime as the cluster goes down then back up. It's not long, but it can be 5-10 minutes while the load balancer responds with 502.

Because this is the default and there's no mention of this caveat in the documentation, I think this is a pretty serious problem. Our system needs to run with 99.999% availability, and these 5-10 minutes of random downtime can cause serious problems.

andreialecu · 2021-08-12T14:12:23Z

@rix0rrr FWIW, maybe related, but if you look in the original repro code I mentioned, we use taskDrainTime: cdk.Duration.seconds(0), and spot instances.

rix0rrr · 2021-08-12T14:16:52Z

Aha well that would explain it, without draining you would always get service interruption while your instances are being replaced. You will run into the same whenever you decide to upgrade your AL AMI in the future.

I see the option spotInstanceDraining. The name seems to indicate it would apply for your situation, but I have to confess I'm not sure how it works.

rix0rrr · 2021-08-12T14:18:28Z

Not sure why spot instance draining is disabled by default 😭

andreialecu · 2021-08-12T14:21:15Z

You will run into the same whenever you decide to upgrade your AL AMI in the future.

Yes, I'm aware of that and will schedule it for maintenance windows. 🙂

The main problem was that this was completely unexpected and for a good amount of time I didn't understand where the downtime was originating, because like mentioned, it was being deployed as part of CI - but there were NO stack changes most of the time.

andreialecu · 2021-08-12T14:24:43Z

Not sure why spot instance draining is disabled by default 😭

Note that even though it's disabled by default, enabling it doesn't help at all. It's possible that taskDrainTime of 0 seconds is the culprit.

I don't remember the exact reason for changing it to 0 seconds, but I think without it things would hang for up to 5 minutes in certain cases. So we preferred to simply nuke things and quickly start over (and not wait to drain anything).

Most `MachineImage` implementations look up AMIs from SSM Parameters, and by default they will all look up the Parameters on each deployment. This leads to instance replacement. Since we already know the SSM Parameter Name and CDK already has a cached SSM context lookup, it should be simple to get a stable AMI ID. This is not ideal because the AMI will grow outdated over time, but users should have the option to pick non-updating images in a convenient way. Fixes #12484.

rix0rrr · 2021-08-12T15:57:59Z

I don't suppose you're feeling like helping us figure out what's wrong with the draining config? 😉

andreialecu · 2021-08-12T16:06:20Z

I'd be glad to try tomorrow, if you can give me some instructions on what to look for.

I didn't realise until you mentioned it that there was something wrong with a draining config. 🙂

rix0rrr · 2021-08-13T07:37:10Z

The thing is, I'm not THAT much of an expert. I don't operate a cluster myself, I don't have a host of production-like tasks ready to try, and I'm not intricately familiar with ECS. To pass it on to ECS team for investigation or clarification, I think we need more of a smoking gun.

First thing I would do is look at various logs (applictaion logs, CloudWatch logs) to see if shutdown signals were being sent properly, when the instances got brought up, tasks scheduled on the new ones, tasks terminated on the old ones, etc, to see if I can figure something out.

andreialecu · 2021-08-13T08:27:58Z

Here's the sequence of events:

FooBar-All-EcsCluster-Production
FooBar-All-EcsCluster-Production: deploying...
FooBar-All-EcsCluster-Production: creating CloudFormation changeset...
[██████████████████████████████████████████████████████████] (3/3)

11:12:49 | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack            | FooBar-All-EcsCluster-Production
11:12:50 | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup    | Cluster/Scaling/ASG

Notice new spot requests:

Old EC2 spot instances shutting down:

And terminated:

In the ECS console the old ones get shut down first:

While the new ones are pending:

Maybe it's related to my taskDrainTime being 0. I notice that:

⚠️ Deprecated: The lifecycle draining hook is not configured if using the EC2 Capacity Provider. Enable managed termination protection instead.

And the managed termination protection mentions:

When managed termination protection is enabled, Amazon ECS prevents the Amazon EC2 instances in an Auto Scaling group that contain tasks from being terminated during a scale-in action. The Auto Scaling group and each instance in the Auto Scaling group must have instance protection from scale-in actions enabled as well. For more information, see Instance Protection in the AWS Auto Scaling User Guide.

When managed termination protection is disabled, your Amazon EC2 instances are not protected from termination when the Auto Scaling group scales in.

I think we still prefer a task drain time of 0 for actual deployments though. Otherwise we risk having two concurrently running versions of our services for a short time, which can cause issues during deployments.

Several seconds of downtime is fine when we actually deploy updates on purpose, but having the autoscaling group replace instances at random times due to the AMI ID update was not something we were expecting.

Most `MachineImage` implementations look up AMIs from SSM Parameters, and by default they will all look up the Parameters on each deployment. This leads to instance replacement. Since we already know the SSM Parameter Name and CDK already has a cached SSM context lookup, it should be simple to get a stable AMI ID. This is not ideal because the AMI will grow outdated over time, but users should have the option to pick non-updating images in a convenient way. Fixes #12484. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2021-09-09T12:59:43Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

andreialecu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 13, 2021

andreialecu closed this as completed Jan 13, 2021

andreialecu reopened this Feb 17, 2021

andreialecu changed the title ~~(aws-ecs): recreates cluster every 10(?) days resulting in downtime~~ (aws-ecs): recreates cluster when Linux image is updated, resulting in downtime Feb 17, 2021

peterwoodworth assigned MrArnoldPalmer, uttarasridhar and SoManyHs Feb 17, 2021

peterwoodworth added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Feb 17, 2021

SoManyHs added guidance Question that needs advice or information. p2 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 25, 2021

SoManyHs unassigned uttarasridhar, MrArnoldPalmer and SoManyHs May 13, 2021

peterwoodworth removed the guidance Question that needs advice or information. label Jul 8, 2021

peterwoodworth mentioned this issue Jul 30, 2021

(aws-ecs): addCapacity does not save Amazon Linux image id to cdk.context.json #12486

Closed

peterwoodworth changed the title ~~(aws-ecs): recreates cluster when Linux image is updated, resulting in downtime~~ (aws-autoscaling): machine ami id not caching to cdk.context.json Jul 30, 2021

peterwoodworth removed the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Jul 30, 2021

github-actions bot added the @aws-cdk/aws-autoscaling Related to Amazon EC2 Auto Scaling label Jul 30, 2021

github-actions bot assigned rix0rrr Jul 30, 2021

peterwoodworth added p1 documentation This is a problem with documentation. and removed @aws-cdk/aws-autoscaling Related to Amazon EC2 Auto Scaling p2 labels Jul 30, 2021

rix0rrr mentioned this issue Aug 12, 2021

feat(ec2/ecs): cacheInContext properties for machine images #16021

Merged

mergify bot closed this as completed in #16021 Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

andreialecu commented Jan 13, 2021 •

edited

Loading

andreialecu commented Jan 13, 2021

andreialecu commented Jan 13, 2021

github-actions bot commented Jan 13, 2021

andreialecu commented Feb 17, 2021 •

edited

Loading

andreialecu commented Mar 8, 2021

peterwoodworth commented Jul 30, 2021

rix0rrr commented Aug 11, 2021

rix0rrr commented Aug 11, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021 •

edited

Loading

andreialecu commented Aug 12, 2021 •

edited

Loading

rix0rrr commented Aug 12, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021 •

edited

Loading

andreialecu commented Aug 12, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021

rix0rrr commented Aug 13, 2021

andreialecu commented Aug 13, 2021

github-actions bot commented Sep 9, 2021

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

(aws-autoscaling): machine ami id not caching to cdk.context.json #12484

Comments

andreialecu commented Jan 13, 2021 • edited Loading

Reproduction Steps

What did you expect to happen?

What actually happened?

Environment

Other

andreialecu commented Jan 13, 2021

andreialecu commented Jan 13, 2021

github-actions bot commented Jan 13, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

andreialecu commented Feb 17, 2021 • edited Loading

andreialecu commented Mar 8, 2021

peterwoodworth commented Jul 30, 2021

rix0rrr commented Aug 11, 2021

rix0rrr commented Aug 11, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021 • edited Loading

andreialecu commented Aug 12, 2021 • edited Loading

rix0rrr commented Aug 12, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021 • edited Loading

andreialecu commented Aug 12, 2021

rix0rrr commented Aug 12, 2021

andreialecu commented Aug 12, 2021

rix0rrr commented Aug 13, 2021

andreialecu commented Aug 13, 2021

github-actions bot commented Sep 9, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

andreialecu commented Jan 13, 2021 •

edited

Loading

andreialecu commented Feb 17, 2021 •

edited

Loading

andreialecu commented Aug 12, 2021 •

edited

Loading

andreialecu commented Aug 12, 2021 •

edited

Loading

andreialecu commented Aug 12, 2021 •

edited

Loading