-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ecs): enable Alarm Based Rollbacks in ECS L2s #25346
Conversation
Closes aws#24243. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bvtujo, these are my proposed changes, and I will now be working on implementing and testing them out. There may be some wrinkles to iron out still.
There are a lot of comments in this review, so I've put them into categories:
- [API Change] = A change I am proposing to the public API. These are most important to have a conversation and align on. The rest should be two-way doors.
- [Important] = Must be addressed before merge
- [Minor] = Minor feedback, take it or leave it
```ts | ||
declare const cluster: ecs.Cluster; | ||
declare const taskDefinition: ecs.TaskDefinition; | ||
declare const alarm: cloudwatch.Alarm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor]: We should also demonstrate the new alarm/metric creation utilities in this example. Or in an additional example
} | ||
|
||
/** | ||
* The ecs metric class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor]: Is this the documentation source for these? https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#available_cloudwatch_metrics
Let's add a link to it so that in the future it is easier to update if new metrics are added.
* Whether to enable the deployment alarms. If this property is defined, alarms will be implicitly | ||
* enabled. | ||
* @default - disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor]: Suggestion to add more detail here:
* Whether to enable the deployment alarms. If this property is defined, alarms will be implicitly | |
* enabled. | |
* @default - disabled | |
* The alarm(s) to monitor during deployment, and behavior to apply if | |
* at least one enters a state of alarm during the deployment or bake time. | |
* If this property is defined, deployment alarms will be implicitly enabled. | |
* | |
* @default - No alarms will be monitored during deployment |
/** | ||
* Ecs Metric props | ||
*/ | ||
public readonly ecsMetricProps?: EcsMetricEnumProps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[API Change]: More concise, and make it private. We normally don't just surface the construct props of something as-is as a public property. And, it's not necessary because we have provided the isServiceConnectMetric()
public function.
public readonly ecsMetricProps?: EcsMetricEnumProps, | |
private readonly props?: EcsMetricProps, |
export interface EcsMetricAlarmProps { | ||
/** | ||
* Alarm props | ||
* @default - Alarm props to be used for Ecs metric alarm | ||
*/ | ||
readonly alarmProps?: cloudwatch.AlarmProps; | ||
/** | ||
* Metric props | ||
* @default - Metric props to be used for Ecs metric alarm | ||
*/ | ||
readonly metricProps?: cloudwatch.MetricProps; | ||
/** | ||
* Whether to use as deployment alarm | ||
* @default false | ||
*/ | ||
readonly useAsDeploymentAlarm: boolean; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[API Change]:
- Change name to EcsAlarmProps - more concise
- extend cloudwatch.AlarmProps to that you can customize all the alarm props without additional nesting in other APIs.
- remove alarmProps because now all of the properties are already part of this interface by extending
cloudwatch.AlarmProps
- remove
metricProps
because those would have been configured when you create your metric that you pass in. Either using theEcsMetric
class, or on your own. - Make
useAsDeploymentAlarm
optional. I think this was already the intention, since there is a default in the doc comment. - Add
deploymentAlarmBehavior
so that the behavior of the deployment alarms can be configured with all methods of creating deployment alarms.
export interface EcsMetricAlarmProps { | |
/** | |
* Alarm props | |
* @default - Alarm props to be used for Ecs metric alarm | |
*/ | |
readonly alarmProps?: cloudwatch.AlarmProps; | |
/** | |
* Metric props | |
* @default - Metric props to be used for Ecs metric alarm | |
*/ | |
readonly metricProps?: cloudwatch.MetricProps; | |
/** | |
* Whether to use as deployment alarm | |
* @default false | |
*/ | |
readonly useAsDeploymentAlarm: boolean; | |
} | |
export interface EcsAlarmProps extends cloudwatch.AlarmProps { | |
readonly useAsDeploymentAlarm?: boolean; | |
readonly deploymentAlarmBehavior?: AlarmBehavior; | |
} |
/** | ||
* Disassociate existing deployment alarms | ||
*/ | ||
public disableDeploymentAlarms() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[API Change]: There are a few issues with this function in the current implementation.
- Can we make this private? I know you know it's not a great experience to require users to call this function when they want to update their stacks to no longer use this feature. Disabling alarms should just happen as a result of not using any of the features to configure alarms. I am going to do some testing, but it would be best to make the disabled configuration also the default configuration. This might require a feature flag if there is concern about triggering a service update for no reason for customers who upgrade their CDK and obviously didn't have any deployment alarms before.
- In the current implementation, this is always called anyway when the
deploymentAlarms
prop is not provided. - This function will do nothing if you removed your code that configures alarms, and then also call this function. Which is what I would have assumed the customer would do if they are required to call this function in order to disable previously deployed deployment alarms.
- In any case, this needs to be well documented in the README and doc comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is twofold:
- the ECS UpdateService API is stateful. If the property is omitted from subsequent deployments, it's not removed, it's left unchanged.
- some partitions do not support DeploymentAlarms yet; GovCloud is the major example, so we have to be careful not to specify any of the related properties (
DeployentConfiguration.Alarms: !Ref AWS::NoValue
works in pure CFN, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay 1 is clear, and I think we can address that by making the disabled configuration the default configuration.
(enabled: false, alarms: []
)
I didn't know about 2, that makes it harder. Are you saying that DeployentConfiguration.Alarms: !Ref AWS::NoValue
will work to actually remove a previously deployed deploymentAlarm? Or will CFN treat that as if the property is omitted, leaving behind any deployment alarms that were previously deployed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way Copilot handles it is by specifying NoValue
when we detect a deployment into GovCloud (handled with CFN conditions and the AWS::Partition
pseudoparameter, but specifying DeploymentConfiguration.enabled: false
as default otherwise.
I believe NoValue
makes the CFN resource handler treat the property as if it doesn't exist in the request.
metric: ecsMetric, | ||
threshold: ecsMetricAlarmProps?.alarmProps?.threshold || 85, | ||
evaluationPeriods: ecsMetricAlarmProps?.alarmProps?.evaluationPeriods || 3, | ||
alarmName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Important]: Why are we setting this instead of just leaving it to cloudformation and using references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to remove this and it turns out that we need to be able to create a name for the alarm, otherwise it's super easy to create a circular dependency.
For example:
const svc = new FargateService(stack, 'svc', {});
// Metrics aren't actually constructs; they're convenience objects
// that make it easy to create alarms
const metric = svc.metricCpuUtilization();
svc.createAlarm({
useAsDeploymentAlarm: true,
metric,
evaluationPeriods: 4,
threshold: 80,
});
// This creates a circular dependency because the metric dimensions
// include the service name! That means that the alarm has a reference
// to the service, but the service has a reference to the alarm, because
// we have to pass the alarm name to `deploymentAlarms.alarmNames`.
To break this, we need deterministic string names, which unfortunately means we face a tradeoff.
- Provide a unique id which we generate on each deployment, which re-creates the resource every single time.
- Use a deterministic name, which means we can't do any updates on the alarm which require replacement. Currently,
AlarmName
(docs) is the only CFN field which requires replacement, but I don't know if there's a guarantee that this will stay the case.
I've pushed my code which has failing tests due to this reason for visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops: a couple more comments I forgot.
if (!this.deploymentAlarms) { | ||
this.deploymentAlarms = { | ||
enable: true, | ||
rollback: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just refactor this to re-use the enableDeploymentAlarms
function, and that should address the concern.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
Reopening this, I am still working on a review! Sorry it got closed (I was on vacation last week). |
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
I'll reopen. |
|
This PR enables the new Deployment Alarms feature in ECS for Service L2s [blog post](https://aws.amazon.com/blogs/containers/automate-rollbacks-for-amazon-ecs-rolling-deployments-with-cloudwatch-alarms/). This PR contains changes after the first round of revision by the CDK team. Continuation of #25346 **Why are so many integration tests impacted by this change? There are 2 reasons:** 1. This PR changes the ECS L2s to set the default configuration for the `CfnService.deploymentConfiguration.alarms` property to: ``` alarmNames: [], rollback: false, enable: false, ``` This is necessary, because adding deployment alarms, deploying your CFN stack, then removing the deployment alarms from the CFN template, and deploying again WILL NOT remove the deployment alarms from the service. To remove previously configured deployment alarms, you must explicitly use the configuration shown above. Making this update will cause no interruption to existing ECS services, and does not trigger any update to the service itself during the CFN update. The ECS UpdateService API is stateful, meaning that if a field is not present in the CloudFormation object, it will be ignored in the update. This was originally implemented due to the problems with desiredCount resetting to bad values after autoscaling changed it, but requires us to set an explicit disableDeploymentAlarms() method to cause the service update to behave correctly. 2. Most ECS integ tests have not been executed in a long time. So, many of the changes are bring the snapshots up to date with the latest integ test format, or lambda layers, etc. **Detailed list of changes:** - entire alarm based rollback task into single commit - resolving merge conflicts - integ files modules updated to resolve merge conflicts - regarding merge conflicts issue - integ files import changes for merge conflicts - adding this change for PR not to abondended - chore: refactor createAlarm & remove unnecessary unit tests - chore: fix api but failing unit tests - chore: fix validation - chore: readme changes - chore: fix integ test, remove unnecessary ec2 test - chore: refactor enableAlarms for clarity - test: fix integ test, finally - docs: update docstrings for new props and methods ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
This PR enables the new Deployment Alarms feature in ECS for Service L2s (blog post).
It adds a new enum to contain the default ECS metrics, logic to automatically create metrics and alarms based on those default metrics, and utility methods to clear all fields so that customers can disable the feature easily. This last component is necessary because the ECS UpdateService API is stateful, meaning that if a field is not present in the CloudFormation object, it will be ignored in the update.
This was originally implemented due to the problems with desiredCount resetting to bad values after autoscaling changed it, but requires us to set an explicit disableDeploymentAlarms() method to cause the service update to behave correctly.
This PR reproduces #24182.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license