Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ecs): enable Alarm Based Rollbacks in ECS L2s #25346

Closed
wants to merge 36 commits into from

Conversation

bvtujo
Copy link
Contributor

@bvtujo bvtujo commented Apr 27, 2023

  • entire alarm based rollback task into single commit
  • resolving merge conflicts
  • integ files modules updated to resolve merge conflicts
  • regarding merge conflicts issue
  • integ files import changes for merge conflicts
  • adding this change for PR not to abondended

This PR enables the new Deployment Alarms feature in ECS for Service L2s (blog post).

It adds a new enum to contain the default ECS metrics, logic to automatically create metrics and alarms based on those default metrics, and utility methods to clear all fields so that customers can disable the feature easily. This last component is necessary because the ECS UpdateService API is stateful, meaning that if a field is not present in the CloudFormation object, it will be ignored in the update.

This was originally implemented due to the problems with desiredCount resetting to bad values after autoscaling changed it, but requires us to set an explicit disableDeploymentAlarms() method to cause the service update to behave correctly.

This PR reproduces #24182.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Naumel and others added 28 commits March 28, 2023 21:54
Closes aws#24243.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@aws-cdk-automation aws-cdk-automation requested a review from a team April 27, 2023 16:08
@github-actions github-actions bot added the beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK label Apr 27, 2023
Copy link
Contributor

@madeline-k madeline-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bvtujo, these are my proposed changes, and I will now be working on implementing and testing them out. There may be some wrinkles to iron out still.

There are a lot of comments in this review, so I've put them into categories:

  • [API Change] = A change I am proposing to the public API. These are most important to have a conversation and align on. The rest should be two-way doors.
  • [Important] = Must be addressed before merge
  • [Minor] = Minor feedback, take it or leave it

packages/aws-cdk-lib/aws-ecs/README.md Show resolved Hide resolved
```ts
declare const cluster: ecs.Cluster;
declare const taskDefinition: ecs.TaskDefinition;
declare const alarm: cloudwatch.Alarm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor]: We should also demonstrate the new alarm/metric creation utilities in this example. Or in an additional example

}

/**
* The ecs metric class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor]: Is this the documentation source for these? https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#available_cloudwatch_metrics

Let's add a link to it so that in the future it is easier to update if new metrics are added.

Comment on lines 469 to 471
* Whether to enable the deployment alarms. If this property is defined, alarms will be implicitly
* enabled.
* @default - disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor]: Suggestion to add more detail here:

Suggested change
* Whether to enable the deployment alarms. If this property is defined, alarms will be implicitly
* enabled.
* @default - disabled
* The alarm(s) to monitor during deployment, and behavior to apply if
* at least one enters a state of alarm during the deployment or bake time.
* If this property is defined, deployment alarms will be implicitly enabled.
*
* @default - No alarms will be monitored during deployment

packages/aws-cdk-lib/aws-ecs/lib/base/base-service.ts Outdated Show resolved Hide resolved
/**
* Ecs Metric props
*/
public readonly ecsMetricProps?: EcsMetricEnumProps,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[API Change]: More concise, and make it private. We normally don't just surface the construct props of something as-is as a public property. And, it's not necessary because we have provided the isServiceConnectMetric() public function.

Suggested change
public readonly ecsMetricProps?: EcsMetricEnumProps,
private readonly props?: EcsMetricProps,

Comment on lines 212 to 228
export interface EcsMetricAlarmProps {
/**
* Alarm props
* @default - Alarm props to be used for Ecs metric alarm
*/
readonly alarmProps?: cloudwatch.AlarmProps;
/**
* Metric props
* @default - Metric props to be used for Ecs metric alarm
*/
readonly metricProps?: cloudwatch.MetricProps;
/**
* Whether to use as deployment alarm
* @default false
*/
readonly useAsDeploymentAlarm: boolean;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[API Change]:

  • Change name to EcsAlarmProps - more concise
  • extend cloudwatch.AlarmProps to that you can customize all the alarm props without additional nesting in other APIs.
  • remove alarmProps because now all of the properties are already part of this interface by extending cloudwatch.AlarmProps
  • remove metricProps because those would have been configured when you create your metric that you pass in. Either using the EcsMetric class, or on your own.
  • Make useAsDeploymentAlarm optional. I think this was already the intention, since there is a default in the doc comment.
  • Add deploymentAlarmBehavior so that the behavior of the deployment alarms can be configured with all methods of creating deployment alarms.
Suggested change
export interface EcsMetricAlarmProps {
/**
* Alarm props
* @default - Alarm props to be used for Ecs metric alarm
*/
readonly alarmProps?: cloudwatch.AlarmProps;
/**
* Metric props
* @default - Metric props to be used for Ecs metric alarm
*/
readonly metricProps?: cloudwatch.MetricProps;
/**
* Whether to use as deployment alarm
* @default false
*/
readonly useAsDeploymentAlarm: boolean;
}
export interface EcsAlarmProps extends cloudwatch.AlarmProps {
readonly useAsDeploymentAlarm?: boolean;
readonly deploymentAlarmBehavior?: AlarmBehavior;
}

/**
* Disassociate existing deployment alarms
*/
public disableDeploymentAlarms() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[API Change]: There are a few issues with this function in the current implementation.

  • Can we make this private? I know you know it's not a great experience to require users to call this function when they want to update their stacks to no longer use this feature. Disabling alarms should just happen as a result of not using any of the features to configure alarms. I am going to do some testing, but it would be best to make the disabled configuration also the default configuration. This might require a feature flag if there is concern about triggering a service update for no reason for customers who upgrade their CDK and obviously didn't have any deployment alarms before.
  • In the current implementation, this is always called anyway when the deploymentAlarms prop is not provided.
  • This function will do nothing if you removed your code that configures alarms, and then also call this function. Which is what I would have assumed the customer would do if they are required to call this function in order to disable previously deployed deployment alarms.
  • In any case, this needs to be well documented in the README and doc comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is twofold:

  1. the ECS UpdateService API is stateful. If the property is omitted from subsequent deployments, it's not removed, it's left unchanged.
  2. some partitions do not support DeploymentAlarms yet; GovCloud is the major example, so we have to be careful not to specify any of the related properties (DeployentConfiguration.Alarms: !Ref AWS::NoValue works in pure CFN, though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay 1 is clear, and I think we can address that by making the disabled configuration the default configuration.
(enabled: false, alarms: [])

I didn't know about 2, that makes it harder. Are you saying that DeployentConfiguration.Alarms: !Ref AWS::NoValue will work to actually remove a previously deployed deploymentAlarm? Or will CFN treat that as if the property is omitted, leaving behind any deployment alarms that were previously deployed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way Copilot handles it is by specifying NoValue when we detect a deployment into GovCloud (handled with CFN conditions and the AWS::Partition pseudoparameter, but specifying DeploymentConfiguration.enabled: false as default otherwise.

I believe NoValue makes the CFN resource handler treat the property as if it doesn't exist in the request.

metric: ecsMetric,
threshold: ecsMetricAlarmProps?.alarmProps?.threshold || 85,
evaluationPeriods: ecsMetricAlarmProps?.alarmProps?.evaluationPeriods || 3,
alarmName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Important]: Why are we setting this instead of just leaving it to cloudformation and using references?

Copy link
Contributor Author

@bvtujo bvtujo May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to remove this and it turns out that we need to be able to create a name for the alarm, otherwise it's super easy to create a circular dependency.

For example:

const svc = new FargateService(stack, 'svc', {});

// Metrics aren't actually constructs; they're convenience objects 
// that make it easy to create alarms
const metric = svc.metricCpuUtilization();

svc.createAlarm({
  useAsDeploymentAlarm: true,
  metric,
  evaluationPeriods: 4,
  threshold: 80,
});
// This creates a circular dependency because the metric dimensions
// include the service name! That means that the alarm has a reference 
// to the service, but the service has a reference to the alarm, because 
// we have to pass the alarm name to `deploymentAlarms.alarmNames`.

To break this, we need deterministic string names, which unfortunately means we face a tradeoff.

  1. Provide a unique id which we generate on each deployment, which re-creates the resource every single time.
  2. Use a deterministic name, which means we can't do any updates on the alarm which require replacement. Currently, AlarmName (docs) is the only CFN field which requires replacement, but I don't know if there's a guarantee that this will stay the case.

I've pushed my code which has failing tests due to this reason for visibility.

Copy link
Contributor

@madeline-k madeline-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops: a couple more comments I forgot.

packages/aws-cdk-lib/aws-ecs/lib/base/base-service.ts Outdated Show resolved Hide resolved
if (!this.deploymentAlarms) {
this.deploymentAlarms = {
enable: true,
rollback: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just refactor this to re-use the enableDeploymentAlarms function, and that should address the concern.

@gitpod-io
Copy link

gitpod-io bot commented May 17, 2023

@mergify mergify bot dismissed madeline-k’s stale review May 17, 2023 19:34

Pull request has been modified.

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: bae8694
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Jun 1, 2023
@madeline-k madeline-k reopened this Jun 1, 2023
@madeline-k
Copy link
Contributor

Reopening this, I am still working on a review! Sorry it got closed (I was on vacation last week).

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@bvtujo
Copy link
Contributor Author

bvtujo commented Jun 2, 2023

I'll reopen.

@mergify
Copy link
Contributor

mergify bot commented Jun 2, 2023

⚠️ The sha of the head commit of this PR conflicts with #25840. Mergify cannot evaluate rules on this PR. ⚠️

mergify bot pushed a commit that referenced this pull request Jul 5, 2023
This PR enables the new Deployment Alarms feature in ECS for Service L2s [blog post](https://aws.amazon.com/blogs/containers/automate-rollbacks-for-amazon-ecs-rolling-deployments-with-cloudwatch-alarms/).

This PR contains changes after the first round of revision by the CDK team. 

Continuation of #25346

**Why are so many integration tests impacted by this change? There are 2 reasons:**
1. This PR changes the ECS L2s to set the default configuration for the `CfnService.deploymentConfiguration.alarms` property to:
```
alarmNames: [],
rollback: false,
enable: false,
```
This is necessary, because adding deployment alarms, deploying your CFN stack, then removing the deployment alarms from the CFN template, and deploying again WILL NOT remove the deployment alarms from the service. To remove previously configured deployment alarms, you must explicitly use the configuration shown above. Making this update will cause no interruption to existing ECS services, and does not trigger any update to the service itself during the CFN update.

The ECS UpdateService API is stateful, meaning that if a field is not present in the CloudFormation object, it will be ignored in the update. This was originally implemented due to the problems with desiredCount resetting to bad values after autoscaling changed it, but requires us to set an explicit disableDeploymentAlarms() method to cause the service update to behave correctly.

2. Most ECS integ tests have not been executed in a long time. So, many of the changes are bring the snapshots up to date with the latest integ test format, or lambda layers, etc.


**Detailed list of changes:**
- entire alarm based rollback task into single commit
- resolving merge conflicts
- integ files modules updated to resolve merge conflicts
- regarding merge conflicts issue
- integ files import changes for merge conflicts
- adding this change for PR not to abondended
- chore: refactor createAlarm & remove unnecessary unit tests
- chore: fix api but failing unit tests
- chore: fix validation
- chore: readme changes
- chore: fix integ test, remove unnecessary ec2 test
- chore: refactor enableAlarms for clarity
- test: fix integ test, finally
- docs: update docstrings for new props and methods

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants