fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

AnuragMohapatra · 2022-12-11T06:26:46Z

fixes #20706

Added an optional parameter that defaults to false over the CPU-based scaling policy that is conflicting with the queue visible message-based policy.

Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually.

Updated integration tests and unit tests are working.

All Submissions:

Have you followed the guidelines in our Contributing guide?

Adding new Construct Runtime Dependencies:

This PR adds new construct runtime dependencies following the process described here

New Features

Have you added the new feature to an integration test?
- Did you use yarn integ to deploy the infrastructure and generate the snapshot (i.e. yarn integ without --dry-run)?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

gitpod-io · 2022-12-11T06:26:54Z

aws-cdk-automation

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

✅ Updated pull request passes all PRLinter validations. Dissmissing previous PRLinter review.

HBobertz

So I am onboard with the the fact that this needs to be configurable for different use cases but I'm not so onboard with just deleting it outright as this likely could be a desired state by other customers. Can we not add optional parameters/props to this method/class and make this configurable from a parent construct?

HBobertz · 2022-12-12T15:17:59Z

@kaizencc

bvtujo · 2022-12-12T17:45:10Z

I really appreciate your taking the time to implement this fix and getting it through integ tests. However, if we release a version of the CDK which will mutate stacks on a deployment with the same code, we risk breaking other customers.

How would you feel about the following solution?

An optional parameter in QueueProcessingServiceBaseProps called enableCpuScaling?: boolean which defaults to true (this means that the default behavior on a deployment is the same as before the change).
Additional logic which uses this parameter to conditionally create the cpu scaling policy.
A change to an integ test which exercises the new logic.

AnuragMohapatra · 2022-12-12T20:59:16Z

@HBobertz @bvtujo I am happy to add the optional parameter but how do we warn new users that enabling CPU scaling while queue-based scaling is default enabled will cause this bug on their stack? This will not solve the underlying issue but only hide it for the time being.

We cannot disable the queue-based scaling when CPU-based scaling is enabled as that just means we are converting the queue-based fargate to a scheduled-based fargate service where the schedule is being handled by the internal user.

HBobertz · 2022-12-12T21:38:48Z

This will not solve the underlying issue but only hide it for the time being.

@AnuragMohapatra So maybe I am misunderstanding the interaction of these scaling policies. My understanding is that the issue with these 2 policies arrises when tasks of significantly varying processing requirements are within the same queue. One very CPU intensive task may take up alot of CPU usage while 4+ tasks sit in the queue and thus trigger the scaling policy to scale out. These other tasks may not be nearly as computationally intensive as the first, or simply waiting on another task/input, thus the CPU utilization on these new containers will be low and trigger a scale in, which could potentially occur mid way through processing a message leading to time/data loss.

I agree that this interaction is a bug for this use case, but I don't necessarily see how this is a bug for all use cases, which it would need to be to justify a breaking change for all users. To me it would seem that someone could have regular very computationally heavy tasks and may want both scaling policies.

Is this interpretation wrong? Is it just that these 2 policies always create situations which could lead to improper scale ins under any workload?

Pull request has been modified.

AnuragMohapatra · 2022-12-14T09:46:43Z

@HBobertz Yes, I understand how it might be breaking change, I have added the optional parameter.

My concern is if a new user uses this optional parameter to enable the CPU scaling on the fargate service, then this issue will start occurring, how should they be warned of the potential issue?

Also considering there is no feature to mark a container termination protected similar to an EC2 instance under ASG usage of this parameter will be buggy, and users should handle the graceful termination or message loss scenario themselves while using the pattern.

…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition

HBobertz · 2022-12-14T15:17:31Z

My concern is if a new user uses this optional parameter to enable the CPU scaling on the fargate service, then this issue will start occurring, how should they be warned of the potential issue?

@AnuragMohapatra I will do a more thorough review in a sec by my initial thoughts on this are:

We can't default to False here as this would be a breaking change for people. This default config has been in the wild for 4 years so I'm not seeing how this has been an anti pattern for everyone for those 4 years. Someone may want both scaling policies and we cant just move them off the current behavior randomly if that's what they are expecting
Unsure if we do actually need to warn people but I will consult with some other engineers on the team about this

HBobertz · 2022-12-14T16:01:31Z

We could also put this behind a feature flag if we want to leave this as default. I will consult with some other engineers about this

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

HBobertz

Left an alternative design for these props so we could combine the props together. But tl;dr is that we can't default to false for this value as it would be a breaking change. If we want to default this to false then we will handle that as a different PR as that change will require a feature flag

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

HBobertz · 2022-12-14T17:50:04Z

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

@@ -342,6 +367,9 @@ export abstract class QueueProcessingServiceBase extends Construct {
      }
    }

+    this.enableCpuBasedScaling = props.enableCpuBasedScaling ?? false;


Same as above. This needs to default to true if we don't follow alternative prop design

HBobertz · 2022-12-14T17:50:55Z

...s-cdk/aws-ecs-patterns/test/fargate/integ.queue-processing-fargate-service-scaling-policy.ts

+  testCases: [stack],
+});
+
+app.synth();


integ test is fine but can we get a unit test for this prop funcitonality?

sure, i don't see why not

HBobertz · 2022-12-14T17:54:32Z

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

+   *
+   * @default 50
+   */
+  readonly cpuBasedScalingTargetUtilization?: number;


Alternatively instead of the enableCpuBasedScaling flag, we can combine both these new props by checking the number passed. If 1-100 then it is enabled at that value. If -1 then cpu utilization is disabled. And then instead of checking on true/false flag then we can check on the numbers value. Default would stay the same at 50

If we keep your current implementation then we will need to check for the situation a customer passes
enableCpuBasedScaling: false and cpuBasedScalingTargetUtilization: <a_valid_int> and then throw an error as this is not a valid prop configuration.

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

Pull request has been modified.

AnuragMohapatra · 2022-12-15T10:16:20Z

@HBobertz

This default config has been in the wild for 4 years so I'm not seeing how this has been an anti pattern for everyone for those 4 years.

I am assuming limited number users are using this, user who are using it are facing issue so either they are performing a manual fix, which you can find as a comment from danilvalov on the issue or they are just moving to EC2 based ECS service in which you can mark the EC2 instance as termination protected through a lambda upon instantiated and gracefully remove the protection once the job is complete (some of the engineers in my known circle have done this way).

Someone may want both scaling policies and we cant just move them off the current behavior randomly if that's what they are expecting

Yes, I agree

Alternatively instead of the enableCpuBasedScaling flag, we can combine both these new props by checking the number passed. If 1-100 then it is enabled at that value. If -1 then cpu utilization is disabled. And then instead of checking on true/false flag then we can check on the numbers value. Default would stay the same at 50

Thanks for this, this looks much cleaner implementation and safer, I have updated the PR.

And thanks for engaging with such feedback based review.

HBobertz · 2022-12-15T14:33:16Z

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

-      targetUtilizationPercent: 50,
-    });
+
+    if (this.cpuBasedScalingTargetUtilization > -1) {


Sorry this is a bit nitty, but this functionality won't be obvious looking at the code, and it probably shouldn't take -2 as an option (which it seems to do at the moment). Could we change this to a != -1 and add a comment saying something like "if -1 is passed, we disable cpu based scaling for this fargate service" just so it's abundantly clear for anyone who is looking through this source file. I think values such as -8 should also be caught by the 0 <= n <= 100 check and throw an error which they don't seem to be under this implementation

HBobertz · 2022-12-15T18:47:08Z

I am assuming limited number users are using this

Likely could be true but we don't actually know and we shouldn't risk a breaking change without knowing fully.|

Either way added a review and (unless my understanding is wrong) I'd like to see it only accept -1 as a valid value to disable, and all other negatives throw an error. Also is 0 a valid value for this? Scaling based off 0 cpu usage sounds pretty weird to me

…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition

Pull request has been modified.

aws-cdk-automation · 2022-12-16T01:23:36Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
Commit ID: f3ea871
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

bvtujo · 2022-12-16T15:56:33Z

@AnuragMohapatra on a related note, you may be able to prevent this scale in behavior with no modifications by using the newly released task scale-in protection endpoint. This may not solve your underlying problem entirely, but it does allow Fargate tasks to protect themselves during long-running compute work.

If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:

PUT $ECS_AGENT_URI/task-protection/v1/state -d 
'{"ProtectionEnabled":true,"ExpiresInMinutes":60}'

to enable 60 minutes of scale in protection for itself. The ExpiresInMinutes param is optional, and can be set indefinitely.

The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed.

aws-cdk-automation · 2023-01-07T00:08:47Z

This PR has been in the BUILD FAILING state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

aws-cdk-automation · 2023-01-15T00:09:12Z

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

aws-cdk-automation

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

aws-cdk-automation · 2023-01-15T00:11:02Z

The pull request linter fails with the following errors:

❌ The title of the pull request should omit 'aws-' from the name of modified packages. Use 'ecs-patterns' instead of 'aws-ecs-patterns'.

PRs must pass status checks before we can provide a meaningful review.

AnuragMohapatra and others added 3 commits November 12, 2022 12:54

Remove CPU based scaling and update the integ tests

964abac

Merge branch 'main' into fix/issue-20706-fixscalingbasedracecondition

3a9b9e4

Adding integ test for scaling policy

23dc34d

aws-cdk-automation requested a review from a team December 11, 2022 06:26

github-actions bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK bug This issue is a bug. p1 labels Dec 11, 2022

aws-cdk-automation previously requested changes Dec 11, 2022

View reviewed changes

AnuragMohapatra changed the title ~~Fix/issue 20706 fixscalingbasedracecondition~~ fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks Dec 11, 2022

HBobertz self-assigned this Dec 12, 2022

HBobertz previously requested changes Dec 12, 2022

View reviewed changes

Merge branch 'main' into fix/issue-20706-fixscalingbasedracecondition

b7c21d6

Adding optional parameter instead of removing the scaling

a97b69b

AnuragMohapatra added 2 commits December 14, 2022 09:47

Adding optional parameter

6d83d79

Merge branch 'fix/issue-20706-fixscalingbasedracecondition' of https:…

f31a493

…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition

comcalvi reviewed Dec 14, 2022

View reviewed changes

packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts Outdated Show resolved Hide resolved

HBobertz previously requested changes Dec 14, 2022

View reviewed changes

Updating the parameter format and integ test, adding unit tests

e949098

Merge branch 'main' into fix/issue-20706-fixscalingbasedracecondition

ed49270

HBobertz previously requested changes Dec 15, 2022

View reviewed changes

AnuragMohapatra added 2 commits December 16, 2022 01:01

Update the enable disable mechanism and add unit tests

c47a945

Merge branch 'fix/issue-20706-fixscalingbasedracecondition' of https:…

f3ea871

…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition

aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Jan 15, 2023

aws-cdk-automation closed this Jan 15, 2023

aws-cdk-automation requested changes Jan 15, 2023

View reviewed changes

keenangraham mentioned this pull request Dec 8, 2023

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

AnuragMohapatra commented Dec 11, 2022 •

edited

Loading

gitpod-io bot commented Dec 11, 2022

aws-cdk-automation left a comment

HBobertz left a comment •

edited

Loading

HBobertz commented Dec 12, 2022

bvtujo commented Dec 12, 2022

AnuragMohapatra commented Dec 12, 2022 •

edited

Loading

HBobertz commented Dec 12, 2022

AnuragMohapatra commented Dec 14, 2022 •

edited

Loading

HBobertz commented Dec 14, 2022

HBobertz commented Dec 14, 2022

HBobertz left a comment •

edited

Loading

HBobertz Dec 14, 2022

HBobertz Dec 14, 2022

AnuragMohapatra Dec 15, 2022

HBobertz Dec 14, 2022 •

edited

Loading

AnuragMohapatra commented Dec 15, 2022 •

edited

Loading

HBobertz Dec 15, 2022

HBobertz commented Dec 15, 2022

aws-cdk-automation commented Dec 16, 2022

bvtujo commented Dec 16, 2022

aws-cdk-automation commented Jan 7, 2023

aws-cdk-automation commented Jan 15, 2023

aws-cdk-automation left a comment

aws-cdk-automation commented Jan 15, 2023

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

Conversation

AnuragMohapatra commented Dec 11, 2022 • edited Loading

All Submissions:

Adding new Construct Runtime Dependencies:

New Features

gitpod-io bot commented Dec 11, 2022

aws-cdk-automation left a comment

Choose a reason for hiding this comment

HBobertz left a comment • edited Loading

Choose a reason for hiding this comment

HBobertz commented Dec 12, 2022

bvtujo commented Dec 12, 2022

AnuragMohapatra commented Dec 12, 2022 • edited Loading

HBobertz commented Dec 12, 2022

AnuragMohapatra commented Dec 14, 2022 • edited Loading

HBobertz commented Dec 14, 2022

HBobertz commented Dec 14, 2022

HBobertz left a comment • edited Loading

Choose a reason for hiding this comment

HBobertz Dec 14, 2022

Choose a reason for hiding this comment

HBobertz Dec 14, 2022

Choose a reason for hiding this comment

AnuragMohapatra Dec 15, 2022

Choose a reason for hiding this comment

HBobertz Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

AnuragMohapatra commented Dec 15, 2022 • edited Loading

HBobertz Dec 15, 2022

Choose a reason for hiding this comment

HBobertz commented Dec 15, 2022

aws-cdk-automation commented Dec 16, 2022

AWS CodeBuild CI Report

bvtujo commented Dec 16, 2022

aws-cdk-automation commented Jan 7, 2023

aws-cdk-automation commented Jan 15, 2023

aws-cdk-automation left a comment

Choose a reason for hiding this comment

aws-cdk-automation commented Jan 15, 2023

AnuragMohapatra commented Dec 11, 2022 •

edited

Loading

HBobertz left a comment •

edited

Loading

AnuragMohapatra commented Dec 12, 2022 •

edited

Loading

AnuragMohapatra commented Dec 14, 2022 •

edited

Loading

HBobertz left a comment •

edited

Loading

HBobertz Dec 14, 2022 •

edited

Loading

AnuragMohapatra commented Dec 15, 2022 •

edited

Loading