-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310
fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
✅ Updated pull request passes all PRLinter validations. Dissmissing previous PRLinter review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am onboard with the the fact that this needs to be configurable for different use cases but I'm not so onboard with just deleting it outright as this likely could be a desired state by other customers. Can we not add optional parameters/props to this method/class and make this configurable from a parent construct?
I really appreciate your taking the time to implement this fix and getting it through integ tests. However, if we release a version of the CDK which will mutate stacks on a deployment with the same code, we risk breaking other customers. How would you feel about the following solution?
|
@HBobertz @bvtujo I am happy to add the optional parameter but how do we warn new users that enabling CPU scaling while queue-based scaling is default enabled will cause this bug on their stack? This will not solve the underlying issue but only hide it for the time being. We cannot disable the queue-based scaling when CPU-based scaling is enabled as that just means we are converting the queue-based fargate to a scheduled-based fargate service where the schedule is being handled by the internal user. |
@AnuragMohapatra So maybe I am misunderstanding the interaction of these scaling policies. My understanding is that the issue with these 2 policies arrises when tasks of significantly varying processing requirements are within the same queue. One very CPU intensive task may take up alot of CPU usage while 4+ tasks sit in the queue and thus trigger the scaling policy to scale out. These other tasks may not be nearly as computationally intensive as the first, or simply waiting on another task/input, thus the CPU utilization on these new containers will be low and trigger a scale in, which could potentially occur mid way through processing a message leading to time/data loss. I agree that this interaction is a bug for this use case, but I don't necessarily see how this is a bug for all use cases, which it would need to be to justify a breaking change for all users. To me it would seem that someone could have regular very computationally heavy tasks and may want both scaling policies. Is this interpretation wrong? Is it just that these 2 policies always create situations which could lead to improper scale ins under any workload? |
Pull request has been modified.
@HBobertz Yes, I understand how it might be breaking change, I have added the optional parameter. My concern is if a new user uses this optional parameter to enable the CPU scaling on the fargate service, then this issue will start occurring, how should they be warned of the potential issue? Also considering there is no feature to mark a container termination protected similar to an EC2 instance under ASG usage of this parameter will be buggy, and users should handle the graceful termination or message loss scenario themselves while using the pattern. |
…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition
@AnuragMohapatra I will do a more thorough review in a sec by my initial thoughts on this are:
|
We could also put this behind a feature flag if we want to leave this as default. I will consult with some other engineers about this |
packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an alternative design for these props so we could combine the props together. But tl;dr is that we can't default to false for this value as it would be a breaking change. If we want to default this to false then we will handle that as a different PR as that change will require a feature flag
packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts
Outdated
Show resolved
Hide resolved
@@ -342,6 +367,9 @@ export abstract class QueueProcessingServiceBase extends Construct { | |||
} | |||
} | |||
|
|||
this.enableCpuBasedScaling = props.enableCpuBasedScaling ?? false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. This needs to default to true if we don't follow alternative prop design
testCases: [stack], | ||
}); | ||
|
||
app.synth(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integ test is fine but can we get a unit test for this prop funcitonality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i don't see why not
* | ||
* @default 50 | ||
*/ | ||
readonly cpuBasedScalingTargetUtilization?: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively instead of the enableCpuBasedScaling flag, we can combine both these new props by checking the number passed. If 1-100 then it is enabled at that value. If -1 then cpu utilization is disabled. And then instead of checking on true/false flag then we can check on the numbers value. Default would stay the same at 50
If we keep your current implementation then we will need to check for the situation a customer passes
enableCpuBasedScaling: false
and cpuBasedScalingTargetUtilization: <a_valid_int>
and then throw an error as this is not a valid prop configuration.
packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts
Outdated
Show resolved
Hide resolved
Pull request has been modified.
I am assuming limited number users are using this, user who are using it are facing issue so either they are performing a manual fix, which you can find as a comment from danilvalov on the issue or they are just moving to EC2 based ECS service in which you can mark the EC2 instance as termination protected through a lambda upon instantiated and gracefully remove the protection once the job is complete (some of the engineers in my known circle have done this way).
Yes, I agree
Thanks for this, this looks much cleaner implementation and safer, I have updated the PR. And thanks for engaging with such feedback based review. |
targetUtilizationPercent: 50, | ||
}); | ||
|
||
if (this.cpuBasedScalingTargetUtilization > -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a bit nitty, but this functionality won't be obvious looking at the code, and it probably shouldn't take -2 as an option (which it seems to do at the moment). Could we change this to a != -1 and add a comment saying something like "if -1 is passed, we disable cpu based scaling for this fargate service" just so it's abundantly clear for anyone who is looking through this source file. I think values such as -8 should also be caught by the 0 <= n <= 100 check and throw an error which they don't seem to be under this implementation
Likely could be true but we don't actually know and we shouldn't risk a breaking change without knowing fully.| Either way added a review and (unless my understanding is wrong) I'd like to see it only accept -1 as a valid value to disable, and all other negatives throw an error. Also is 0 a valid value for this? Scaling based off 0 cpu usage sounds pretty weird to me |
…//github.com/AnuragMohapatra/aws-cdk into fix/issue-20706-fixscalingbasedracecondition
Pull request has been modified.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@AnuragMohapatra on a related note, you may be able to prevent this scale in behavior with no modifications by using the newly released task scale-in protection endpoint. This may not solve your underlying problem entirely, but it does allow Fargate tasks to protect themselves during long-running compute work. If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:
to enable 60 minutes of scale in protection for itself. The The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed. |
This PR has been in the BUILD FAILING state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
The pull request linter fails with the following errors:
PRs must pass status checks before we can provide a meaningful review. |
fixes #20706
Added an optional parameter that defaults to false over the CPU-based scaling policy that is conflicting with the queue visible message-based policy.
Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually.
Updated integration tests and unit tests are working.
All Submissions:
Adding new Construct Runtime Dependencies:
New Features
yarn integ
to deploy the infrastructure and generate the snapshot (i.e.yarn integ
without--dry-run
)?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license