-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(aws-autoscaling): machine ami id not caching to cdk.context.json #12484
Comments
Note that I have recently updated from 1.71.0 to 1.84.0, but while looking at the changelog I didn't notice anything relevant that may fix this. The problem has existed for a long time, but it was a development cluster so it wasn't too important. |
This is the JSON changeset listed in the CloudFormation console: [
{
"resourceChange": {
"logicalResourceId": "ClusterScalingASGE8638730",
"action": "Modify",
"physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingASGE8638730-D5WKYH5KZXGB",
"resourceType": "AWS::AutoScaling::AutoScalingGroup",
"replacement": "Conditional",
"moduleInfo": null,
"details": [
{
"target": {
"name": "LaunchConfigurationName",
"requiresRecreation": "Conditionally",
"attribute": "Properties"
},
"causingEntity": "ClusterScalingLaunchConfig3E9D5827",
"evaluation": "Static",
"changeSource": "ResourceReference"
}
],
"changeSetId": null,
"scope": [
"Properties"
]
},
"type": "Resource"
},
{
"resourceChange": {
"logicalResourceId": "ClusterScalingLaunchConfig3E9D5827",
"action": "Modify",
"physicalResourceId": "App-AnalyticsEcsCluster2-Production-ClusterScalingLaunchConfig3E9D5827-1R99WCAZR3PS8",
"resourceType": "AWS::AutoScaling::LaunchConfiguration",
"replacement": "True",
"moduleInfo": null,
"details": [
{
"target": {
"name": "ImageId",
"requiresRecreation": "Always",
"attribute": "Properties"
},
"causingEntity": null,
"evaluation": "Static",
"changeSource": "DirectModification"
}
],
"changeSetId": null,
"scope": [
"Properties"
]
},
"type": "Resource"
}
]
|
|
Reopening. This is still an issue. The auto-scaling group is being recreated every time there's an update to the underlying Linux image. It is NOT being cached to the metadata file as mentioned in the documentation. ( |
This is still happening in all our clusters. The machine ami id is definitely not being cached to Am I doing something wrong? Could someone point the relevant section of code inside aws-cdk where the value is supposed to be written? I could try debugging it myself. |
This section in the docs states that:
@rix0rrr was this how it previously worked? I can't get my machine ami id to cache, this seems to be an error in our documentation. The only things that cache to cdk.context.json are the context methods, so this note should be updated or removed. |
@peterwoodworth the docs you are referring to are about the Specifically, they are getting an auto-updating image because they're not passing a I'll agree that the docs don't make this very clear. Explicitly picking a non-updating AmazonLinux image would do it. I am more interested in why this is causing downtime. The ECS cluster instances should be replaced one at at a time in a rolling update, but ECS should be rescheduling Tasks onto other machines to compensate. |
The most trivial way to fix this, to start, is by making the ECS docs more clear. |
Also, by default I think we enable instance draining, as explained here: https://blog.alterway.fr/en/update-your-ecs-container-instances-with-no-downtime.html So I'm even more mystified by the downtime |
I ended up with the following workaround in the meantime: // pin image id:
// aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region eu-west-1
const machineImage = new ec2.GenericLinuxImage({
// latest as of 23 july 2021
"eu-west-1": "ami-0ea9d963ed3029ca3",
}); Then specifically pinned this machine image in the @rix0rrr there's definitely downtime as the cluster goes down then back up. It's not long, but it can be 5-10 minutes while the load balancer responds with 502. Because this is the default and there's no mention of this caveat in the documentation, I think this is a pretty serious problem. Our system needs to run with 99.999% availability, and these 5-10 minutes of random downtime can cause serious problems. |
@rix0rrr FWIW, maybe related, but if you look in the original repro code I mentioned, we use |
Aha well that would explain it, without draining you would always get service interruption while your instances are being replaced. You will run into the same whenever you decide to upgrade your AL AMI in the future. I see the option |
Not sure why spot instance draining is disabled by default 😭 |
Yes, I'm aware of that and will schedule it for maintenance windows. 🙂 The main problem was that this was completely unexpected and for a good amount of time I didn't understand where the downtime was originating, because like mentioned, it was being deployed as part of CI - but there were NO stack changes most of the time. |
Note that even though it's disabled by default, enabling it doesn't help at all. It's possible that I don't remember the exact reason for changing it to |
Most `MachineImage` implementations look up AMIs from SSM Parameters, and by default they will all look up the Parameters on each deployment. This leads to instance replacement. Since we already know the SSM Parameter Name and CDK already has a cached SSM context lookup, it should be simple to get a stable AMI ID. This is not ideal because the AMI will grow outdated over time, but users should have the option to pick non-updating images in a convenient way. Fixes #12484.
I don't suppose you're feeling like helping us figure out what's wrong with the draining config? 😉 |
I'd be glad to try tomorrow, if you can give me some instructions on what to look for. I didn't realise until you mentioned it that there was something wrong with a draining config. 🙂 |
The thing is, I'm not THAT much of an expert. I don't operate a cluster myself, I don't have a host of production-like tasks ready to try, and I'm not intricately familiar with ECS. To pass it on to ECS team for investigation or clarification, I think we need more of a smoking gun. First thing I would do is look at various logs (applictaion logs, CloudWatch logs) to see if shutdown signals were being sent properly, when the instances got brought up, tasks scheduled on the new ones, tasks terminated on the old ones, etc, to see if I can figure something out. |
Most `MachineImage` implementations look up AMIs from SSM Parameters, and by default they will all look up the Parameters on each deployment. This leads to instance replacement. Since we already know the SSM Parameter Name and CDK already has a cached SSM context lookup, it should be simple to get a stable AMI ID. This is not ideal because the AMI will grow outdated over time, but users should have the option to pick non-updating images in a convenient way. Fixes #12484. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
We are running a
cdk deploy '*'
step as part of CI.Even if there are no stack changes whatsover, every 10 days or so, cdk starts recreating an autoscaling group, thus breaking the ECS cluster and results in downtime until everything is built again:
Here are relevant logs:
It's always the same output.
Note that the cluster uses spot instances, which may be relevant.
Reproduction Steps
Potentially relevant code stack code:
What did you expect to happen?
No stack updates to be issued. No downtime.
What actually happened?
Stack is updated unnecessarily. Downtime for 10+ minutes.
Environment
Other
It happens on CircleCI, but also when running locally, after not being ran for a while. I believe it is around 10 days, but I'm not completely sure.
This is 🐛 Bug Report
The text was updated successfully, but these errors were encountered: