Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

Closed
AnuragMohapatra opened this issue Jun 11, 2022 · 9 comments · Fixed by #28315
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container @aws-cdk/aws-ecs-patterns Related to ecs-patterns library @aws-cdk/aws-sqs Related to Amazon Simple Queue Service bug This issue is a bug. p2

Comments

@AnuragMohapatra
Copy link
Contributor

AnuragMohapatra commented Jun 11, 2022

Describe the bug

Current Scenario

For the scaling policy of Queue processing fargate service, 2 parts are added -

  1. Queue length-based scaling - In this scenario, if the user has not provided a step-down counter the system auto-calculates that it does not need a scale-in and does not create a scale-in alarm.
  2. Cpu base scaling - In this scenario the scale-in and scale-out are now dependent on the avg CPU utilisation of the system.

This can be found here -

protected configureAutoscalingForService(service: BaseService) {

Issue

The CPU base scaling does not seem appropriate in a Queue processing fargate service, the fargate service should only scale out or in depending on the number of messages are there in the queue, not the CPU utilization of the system.

Because of the CPU-based scaling, the auto-scaling group may start a new instance that will process the same message again if there is a CPU-intensive process triggered by the message and is not completed within the scaling alarm trigger.

Also, if the process is memory intensive then the CPU-based scaling will always be in alarm causing the auto-scaling group to remove a task till it reaches the desired capacity.

These scenarios are also relevant for the memory utilization metric but the running task is actually CPU intensive.

Since there is no task-level termination protection, and disable scale-in feature is missing from the patterns this can cause the ASG to terminate a task that is mid-execution.

Expected Behavior

When a Queue processing fargate service has been set up to only scale-out on an approximate number of messages in the queue and the scale-in has been disabled it should not terminate the tasks.

Current Behavior

The ASG on Queue Processing fargate service starts terminating the task if the task is memory intensive and has a long processing time, because of a CW scale in alarm triggered from the CPUUtilizationMetric Scaling policy, thus terminating a random task mid-execution.

Reproduction Steps

Following CDK -

  import { Stack, StackProps } from 'aws-cdk-lib';
  import { Construct } from 'constructs';
  import { QueueProcessingFargateService }  from 'aws-cdk-lib/aws-ecs-patterns'
  import { ContainerImage } from 'aws-cdk-lib/aws-ecs';
  
  export class QueueProcessingFargateServiceAutoscaleTestStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
      super(scope, id, props);  
      var containerImage = ContainerImage.fromAsset('test')  
      var service = new QueueProcessingFargateService(this, 'test', {
        image: containerImage,
        scalingSteps : [
          { upper: 0, change: 0 },{ lower: 1, change: +1 }
        ]
     })
    }
  }

will create a new QueueProcessingFargateService with following type of scaling policy -

image

which causes conflicting alarms to be always trigger -

image

image

Possible Solution

The issue is with this method in the Queue processing fargate service pattern base -

protected configureAutoscalingForService(service: BaseService) {

It is adding a default CPUUtilizationScalingPolicy that cannot be removed, edited nor disabled.

Solution 1

Remove the CPU Utilization scaling factor if not necessarily required.

Solution 2

Add optional properties and let the user modify the value to disable scale-in on CPU utilization metric or let the user modify the values as per the user's will.

Additional Information/Context

No response

CDK CLI Version

2.27

Framework Version

No response

Node.js Version

16.14.2

OS

Linux

Language

Typescript

Language Version

3.9.7

Other information

No response

@AnuragMohapatra AnuragMohapatra added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 11, 2022
@github-actions github-actions bot added the @aws-cdk/aws-ecs-patterns Related to ecs-patterns library label Jun 11, 2022
@AnuragMohapatra AnuragMohapatra changed the title aws-ecs-patterns (QueueProcessingFargateService): Non-editable CpuScalingPolicy is added by default to the queue processing service. aws-ecs-patterns (QueueProcessingFargateService): ASG task termination race-condition due to a non-editable CPUScalingPolicy being added by default to service. Jun 16, 2022
@indrora
Copy link
Contributor

indrora commented Aug 5, 2022

It is the responsibility of the queue worker when it pulls the message from the queue to hide it from the rest of the workers (with the maximum being 12 hours) upon pulling it from the message queue.

However, what I'm hearing from this ticket is that a sufficiently complex task (e.g. video de/encode) which pegs the CPU usage or memory usage for a prolonged time will be culled early because of the scaling configuration, is that correct?

@AnuragMohapatra
Copy link
Contributor Author

AnuragMohapatra commented Aug 6, 2022

Yes, kind of like -

  1. a. There are 5 messages in queue and Queue based scaling says to increase if there are more than 2 messages.
    b. Fargate service scales out, and creates a 2 new container one after another.
    c. But none of the container are performing a CPU intensive activity just IO based activity (may be they need to wait for a
    third party doc generate service to complete before they can move ahead with it), so the avg CPU utilisation is low.
    d. So now the CPU base scale in alarm kicks in and removes a container which was processing a message mid way, you have
    lost valuable time and potentially some information that might have been time bound.
    e. You can only hide the messages from consumer not remove the message else queue message scale in will trigger and
    remove the container.

  2. a. There is only 3 message in queue and Queue based scaling says to increase if there are more than 4 messages.
    b. Fargate service does nothing since there is no alarm yet.
    c. But the container is performing a CPU intensive activity so the avg CPU utilisation becomes high.
    d. So now the CPU base scale out alarm kicks in and adds a container which starts processing a message.
    e. Then the Queue based scale in kicks in because the messages are below the threshold thus removes a container that is
    processing a message mid way.
    e. Again, now you have lost the processing that was done by this container which may cause issues in downstream
    applications
    f. you can only hide the messages from consumer not remove the message else queue message scale in will trigger
    and remove the container.

An analogy will be SQS - lambda pattern you do not see any memory or CPU based scaling in that cause it have the same effect as above, scaling should be flexible and have a priority. Lambda just times out on high memory and cpu issues.

QueueProcessingFargateService is a solution for SQS-lambda type pattern where we know the task will take more than the 15min timeout of lambda to process and we do not want to maintain our own EC2 instances.

ASGs/ECS also have the same issue but EC2 instances have a feature called Termination Protection where you can disable the scale in of the EC2 instance until the process has completed and no more activity is required.

@danilvalov
Copy link

I have the same problem.

In my case I have a service using QueueProcessingFargateService for text parsing. Sometimes there are more than 1000 tasks in the AWS SQS. Each task has low CPU load. And when my SQS count scalling rule adds a new instance, CPU scalling rule stops it in 1 minute. After my SQS count scalling rule adds a new instance again, and CPU scalling rule stops it too.

To solve it I manually deleted the CPU scalling rule in the AWS console, but I think this is not a good solution to do it manually in the console

@indrora indrora added p1 @aws-cdk/aws-ecs Related to Amazon Elastic Container @aws-cdk/aws-sqs Related to Amazon Simple Queue Service and removed needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022
@indrora indrora changed the title aws-ecs-patterns (QueueProcessingFargateService): ASG task termination race-condition due to a non-editable CPUScalingPolicy being added by default to service. aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks Aug 25, 2022
@inh3
Copy link

inh3 commented Dec 16, 2022

Sharing how I am working around this issue at the moment.

I created a class that derives from QueueProcessingFargateService and overrides the protected configureAutoscalingForService(service: BaseService) method:

// see: https://github.com/aws/aws-cdk/issues/20706
export class QueueAndMessageBasedScalingOnlyFargateService extends QueueProcessingFargateService {
  /**
   * Configure autoscaling based off of number of messages visible in the SQS queue only.
   *
   * @param service the ECS/Fargate service for which to apply the autoscaling rules to
   */
  protected configureAutoscalingForService(service: BaseService) {
    const scalingTarget = service.autoScaleTaskCount({
      maxCapacity: this.maxCapacity,
      minCapacity: this.minCapacity,
    });
    scalingTarget.scaleOnMetric("QueueMessagesVisibleScaling", {
      metric: this.sqsQueue.metricApproximateNumberOfMessagesVisible(),
      scalingSteps: this.scalingSteps,
    });
  }
}

Then I can simply use the derived class (QueueAndMessageBasedScalingOnlyFargateService) in place of QueueProcessingFargateService within my software.

@bvtujo
Copy link
Contributor

bvtujo commented Dec 16, 2022

For others who are running into this problem while we work on the linked PR, there's an alternative. ECS now offers task-level scale in protection via the newly released task scale-in protection endpoint. Individual tasks are now able to protect themselves during long-running compute work.

If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:

PUT $ECS_AGENT_URI/task-protection/v1/state -d 
'{"ProtectionEnabled":true,"ExpiresInMinutes":60}'

to enable 60 minutes of scale in protection for itself. The ExpiresInMinutes param is optional, and defaults to 2 hours of protection.

The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed.

@pahud pahud added the p1.5 label May 19, 2023
@madeline-k madeline-k removed their assignment Oct 30, 2023
@keenangraham
Copy link

Running into this scaler thrashing in production.

If you look at ECS events one alarm is increasing the desired count because there are a lot of objects on the queue, and one is decreasing the count because the task is using a low amount of CPU. This repeats every couple minutes and no work gets done.

Our task makes sequential HTTP requests and is mainly IO- and memory-bound, so hard to saturate the CPU.

I'm not sure what the original intention of sneaking the CPU scaler into this construct was but it seems like a bug if the two scalers conflict like this.

Sad that the fix in #23310 died a slow death (it seems like there was some uncertainty whether this was legitimate behavior or not).

@AnuragMohapatra
Copy link
Contributor Author

@keenangraham I was just looking at this yesterday thinking no one has resolved this yet, added a new PR hopefully this will get reviewed and merged.

@mergify mergify bot closed this as completed in #28315 Dec 22, 2023
mergify bot pushed a commit that referenced this issue Dec 22, 2023
…rget utilization (#28315)

Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy.

When disabled this will stop the race condition issue mentioned in #20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern.

Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually.

Updated integration tests and unit tests are working.

Closes #20706 .

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

paulhcsun pushed a commit to paulhcsun/aws-cdk that referenced this issue Jan 5, 2024
…rget utilization (aws#28315)

Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy.

When disabled this will stop the race condition issue mentioned in aws#20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern.

Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually.

Updated integration tests and unit tests are working.

Closes aws#20706 .

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@keenangraham
Copy link

@keenangraham I was just looking at this yesterday thinking no one has resolved this yet, added a new PR hopefully this will get reviewed and merged.

Nice. Thanks @AnuragMohapatra!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container @aws-cdk/aws-ecs-patterns Related to ecs-patterns library @aws-cdk/aws-sqs Related to Amazon Simple Queue Service bug This issue is a bug. p2
Projects
None yet
9 participants