Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

Closed
wants to merge 11 commits into from
Closed

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

wants to merge 11 commits into from

Conversation

AnuragMohapatra
Copy link
Contributor

@AnuragMohapatra AnuragMohapatra commented Dec 11, 2022


fixes #20706

Added an optional parameter that defaults to false over the CPU-based scaling policy that is conflicting with the queue visible message-based policy.

Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually.

Updated integration tests and unit tests are working.

All Submissions:

Adding new Construct Runtime Dependencies:

  • This PR adds new construct runtime dependencies following the process described here

New Features

  • Have you added the new feature to an integration test?
    • Did you use yarn integ to deploy the infrastructure and generate the snapshot (i.e. yarn integ without --dry-run)?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Dec 11, 2022

@aws-cdk-automation aws-cdk-automation requested a review from a team December 11, 2022 06:26
@github-actions github-actions bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK bug This issue is a bug. p1 labels Dec 11, 2022
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

@AnuragMohapatra AnuragMohapatra changed the title Fix/issue 20706 fixscalingbasedracecondition fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks Dec 11, 2022
@aws-cdk-automation aws-cdk-automation dismissed their stale review December 11, 2022 06:32

✅ Updated pull request passes all PRLinter validations. Dissmissing previous PRLinter review.

@HBobertz HBobertz self-assigned this Dec 12, 2022
Copy link
Contributor

@HBobertz HBobertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am onboard with the the fact that this needs to be configurable for different use cases but I'm not so onboard with just deleting it outright as this likely could be a desired state by other customers. Can we not add optional parameters/props to this method/class and make this configurable from a parent construct?

@HBobertz
Copy link
Contributor

@kaizencc

@bvtujo
Copy link
Contributor

bvtujo commented Dec 12, 2022

I really appreciate your taking the time to implement this fix and getting it through integ tests. However, if we release a version of the CDK which will mutate stacks on a deployment with the same code, we risk breaking other customers.

How would you feel about the following solution?

  1. An optional parameter in QueueProcessingServiceBaseProps called enableCpuScaling?: boolean which defaults to true (this means that the default behavior on a deployment is the same as before the change).
  2. Additional logic which uses this parameter to conditionally create the cpu scaling policy.
  3. A change to an integ test which exercises the new logic.

@AnuragMohapatra
Copy link
Contributor Author

AnuragMohapatra commented Dec 12, 2022

@HBobertz @bvtujo I am happy to add the optional parameter but how do we warn new users that enabling CPU scaling while queue-based scaling is default enabled will cause this bug on their stack? This will not solve the underlying issue but only hide it for the time being.

We cannot disable the queue-based scaling when CPU-based scaling is enabled as that just means we are converting the queue-based fargate to a scheduled-based fargate service where the schedule is being handled by the internal user.

@HBobertz
Copy link
Contributor

This will not solve the underlying issue but only hide it for the time being.

@AnuragMohapatra So maybe I am misunderstanding the interaction of these scaling policies. My understanding is that the issue with these 2 policies arrises when tasks of significantly varying processing requirements are within the same queue. One very CPU intensive task may take up alot of CPU usage while 4+ tasks sit in the queue and thus trigger the scaling policy to scale out. These other tasks may not be nearly as computationally intensive as the first, or simply waiting on another task/input, thus the CPU utilization on these new containers will be low and trigger a scale in, which could potentially occur mid way through processing a message leading to time/data loss.

I agree that this interaction is a bug for this use case, but I don't necessarily see how this is a bug for all use cases, which it would need to be to justify a breaking change for all users. To me it would seem that someone could have regular very computationally heavy tasks and may want both scaling policies.

Is this interpretation wrong? Is it just that these 2 policies always create situations which could lead to improper scale ins under any workload?

@mergify mergify bot dismissed HBobertz’s stale review December 13, 2022 08:31

Pull request has been modified.

@AnuragMohapatra
Copy link
Contributor Author

AnuragMohapatra commented Dec 14, 2022

@HBobertz Yes, I understand how it might be breaking change, I have added the optional parameter.

My concern is if a new user uses this optional parameter to enable the CPU scaling on the fargate service, then this issue will start occurring, how should they be warned of the potential issue?

Also considering there is no feature to mark a container termination protected similar to an EC2 instance under ASG usage of this parameter will be buggy, and users should handle the graceful termination or message loss scenario themselves while using the pattern.

@HBobertz
Copy link
Contributor

My concern is if a new user uses this optional parameter to enable the CPU scaling on the fargate service, then this issue will start occurring, how should they be warned of the potential issue?

@AnuragMohapatra I will do a more thorough review in a sec by my initial thoughts on this are:

  1. We can't default to False here as this would be a breaking change for people. This default config has been in the wild for 4 years so I'm not seeing how this has been an anti pattern for everyone for those 4 years. Someone may want both scaling policies and we cant just move them off the current behavior randomly if that's what they are expecting
  2. Unsure if we do actually need to warn people but I will consult with some other engineers on the team about this

@HBobertz
Copy link
Contributor

We could also put this behind a feature flag if we want to leave this as default. I will consult with some other engineers about this

Copy link
Contributor

@HBobertz HBobertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an alternative design for these props so we could combine the props together. But tl;dr is that we can't default to false for this value as it would be a breaking change. If we want to default this to false then we will handle that as a different PR as that change will require a feature flag

@@ -342,6 +367,9 @@ export abstract class QueueProcessingServiceBase extends Construct {
}
}

this.enableCpuBasedScaling = props.enableCpuBasedScaling ?? false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This needs to default to true if we don't follow alternative prop design

testCases: [stack],
});

app.synth();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integ test is fine but can we get a unit test for this prop funcitonality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i don't see why not

*
* @default 50
*/
readonly cpuBasedScalingTargetUtilization?: number;
Copy link
Contributor

@HBobertz HBobertz Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively instead of the enableCpuBasedScaling flag, we can combine both these new props by checking the number passed. If 1-100 then it is enabled at that value. If -1 then cpu utilization is disabled. And then instead of checking on true/false flag then we can check on the numbers value. Default would stay the same at 50

If we keep your current implementation then we will need to check for the situation a customer passes
enableCpuBasedScaling: false and cpuBasedScalingTargetUtilization: <a_valid_int> and then throw an error as this is not a valid prop configuration.

@mergify mergify bot dismissed HBobertz’s stale review December 15, 2022 10:10

Pull request has been modified.

@AnuragMohapatra
Copy link
Contributor Author

AnuragMohapatra commented Dec 15, 2022

@HBobertz

This default config has been in the wild for 4 years so I'm not seeing how this has been an anti pattern for everyone for those 4 years.

I am assuming limited number users are using this, user who are using it are facing issue so either they are performing a manual fix, which you can find as a comment from danilvalov on the issue or they are just moving to EC2 based ECS service in which you can mark the EC2 instance as termination protected through a lambda upon instantiated and gracefully remove the protection once the job is complete (some of the engineers in my known circle have done this way).

Someone may want both scaling policies and we cant just move them off the current behavior randomly if that's what they are expecting

Yes, I agree

Alternatively instead of the enableCpuBasedScaling flag, we can combine both these new props by checking the number passed. If 1-100 then it is enabled at that value. If -1 then cpu utilization is disabled. And then instead of checking on true/false flag then we can check on the numbers value. Default would stay the same at 50

Thanks for this, this looks much cleaner implementation and safer, I have updated the PR.

And thanks for engaging with such feedback based review.

targetUtilizationPercent: 50,
});

if (this.cpuBasedScalingTargetUtilization > -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is a bit nitty, but this functionality won't be obvious looking at the code, and it probably shouldn't take -2 as an option (which it seems to do at the moment). Could we change this to a != -1 and add a comment saying something like "if -1 is passed, we disable cpu based scaling for this fargate service" just so it's abundantly clear for anyone who is looking through this source file. I think values such as -8 should also be caught by the 0 <= n <= 100 check and throw an error which they don't seem to be under this implementation

@HBobertz
Copy link
Contributor

I am assuming limited number users are using this

Likely could be true but we don't actually know and we shouldn't risk a breaking change without knowing fully.|

Either way added a review and (unless my understanding is wrong) I'd like to see it only accept -1 as a valid value to disable, and all other negatives throw an error. Also is 0 a valid value for this? Scaling based off 0 cpu usage sounds pretty weird to me

@mergify mergify bot dismissed HBobertz’s stale review December 16, 2022 01:02

Pull request has been modified.

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: f3ea871
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@bvtujo
Copy link
Contributor

bvtujo commented Dec 16, 2022

@AnuragMohapatra on a related note, you may be able to prevent this scale in behavior with no modifications by using the newly released task scale-in protection endpoint. This may not solve your underlying problem entirely, but it does allow Fargate tasks to protect themselves during long-running compute work.

If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:

PUT $ECS_AGENT_URI/task-protection/v1/state -d 
'{"ProtectionEnabled":true,"ExpiresInMinutes":60}'

to enable 60 minutes of scale in protection for itself. The ExpiresInMinutes param is optional, and can be set indefinitely.

The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed.

@aws-cdk-automation
Copy link
Collaborator

This PR has been in the BUILD FAILING state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Jan 15, 2023
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

@aws-cdk-automation
Copy link
Collaborator

The pull request linter fails with the following errors:

❌ The title of the pull request should omit 'aws-' from the name of modified packages. Use 'ecs-patterns' instead of 'aws-ecs-patterns'.

PRs must pass status checks before we can provide a meaningful review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK bug This issue is a bug. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p1
Projects
None yet
5 participants