Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [request]: Control which containers are terminated on scale in #125

Closed
lox opened this issue Jan 21, 2019 · 97 comments
Closed

[ECS] [request]: Control which containers are terminated on scale in #125

lox opened this issue Jan 21, 2019 · 97 comments
Assignees
Labels
ECS Amazon Elastic Container Service Shipped This feature request was delivered.

Comments

@lox
Copy link

lox commented Jan 21, 2019

We use ECS for auto-scaling build agents for buildkite.com. We are using custom metrics and several Lambdas for scaling the ECS service that runs our agent based on pending CI jobs. Presently when a we scale in the DesiredCount on the service, it seems like it's random which running containers get killed. It would be great to have more control over this, either a customizable timeframe to wait for containers to stop after being signaled or something similar to EC2 Lifecycle Hooks.

We're presently working around this by handling termination as gracefully as possible, but it often means cancelling an in-flight CI build, which we'd prefer not to do if other idle containers could be selected.

@lox lox added the Proposed Community submitted issue label Jan 21, 2019
@pgarbe
Copy link

pgarbe commented Jan 21, 2019

What we do is using StepScaling (instead of SimpleScaling), because once the ASG triggers the termination process, it is not blocking any further scaling activities. And in addition we've a lifecycle hook which sets the instance to draining (in ECS) and waits until all tasks are gone (or the timeout). It's based on this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

@lox
Copy link
Author

lox commented Jan 25, 2019

Thanks @pgarbe, I'm not sure how that helps! I'm talking about scaling in ECS tasks when a ECS service gets a decreased DesiredCount.

@pgarbe
Copy link

pgarbe commented Jan 29, 2019

What I understood is, that you want to keep the EC2 hosts running as long as some tasks run on it, right? Even when this instance is marked to be terminated by the AutoScaling Group. Actually, you can't really control which EC2 instance gets terminated. But, with the lifecycle hook I mentioned above, you can delay the termination until all tasks are gone.

@lox
Copy link
Author

lox commented Feb 1, 2019

Apologies if I've done a bad job of explaining myself @pgarbe, that is not at all what I mean. The autoscaling I am talking about is the autoscaling of ECS Tasks in an ECS Service, not the EC2 instances underneath them. As you say, there are a heap of tools for controlling the scale in and out of the underlying instances, but what I'm after are similar mechanisms for the ECS services.

Imagine you have a 100 "jobs" that need processing, and you run "agents" to process those jobs as ECS tasks in an ECS service which is controlled by auto-scaling the DesiredCount. The specific problem I am trying to solve is how to intelligently scale in the ECS tasks that aren't running jobs. Currently setting DesiredCount on the ECS Service seems to basically pick Tasks at random to kill. I would like some control (like lifecycle hooks provides for ECS) to make sure that Tasks finish their work before being randomly terminated.

@pgarbe
Copy link

pgarbe commented Feb 4, 2019

Ok, got it. Unfortunately, in that case, I can't help you much.

@travis-south
Copy link

I have this same issue. I am using Target Tracking as my scaling policy and is tracking CPU Utillization. So whenever it does a scale-in, it kills the tasks for that service even if there are clients connected to it. I would love to know if there's a way to implement some kind of a lifecycle hook or a draining status so it will only kill the task when all connections are drained.

@wbingli
Copy link

wbingli commented Mar 28, 2019

I think there are two things in ECS which can help for connection/job draining before the ECS task stopped.

  • ELB connection draining: If ECS service connects to a ELB target group, ECS will ensure the target is drained in ELB before stop the task.
  • Task stopTimeout: ECS won't directly hard kill the container. Instead, it will send out stop signal and wait for a configurable amount of time before forcefully kill it. The application could gracefully drain in-flight jobs during the shutdown process.

Are they able to handle your case? @lox @travis-south

@travis-south
Copy link

Thanks @wbingli, Is there an option for ELB connection draining for ALBs? I can't seem to find it.

As for the stopTimeout, I'll try this and will give feedback.

Thanks.

@lox
Copy link
Author

lox commented Mar 29, 2019

Yeah, stopTimeout looks interesting for my usecase too! I was in the process of moving away from Services to ad-hoc Tasks, but that might work.

@travis-south
Copy link

I don't think stopTimeout is in CloudFormation already, or am I missing something? 😃

@lox
Copy link
Author

lox commented Mar 29, 2019

I certainly hadn't heard of it before!

@whereisaaron
Copy link

@lox @travis-south The documentation says startTimeout and stopTimeout is only available for tasks using Fargate in us-east-2. That's pretty narrow availability! 😄

This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only and the task or service requires platform version 1.3.0 or later.

@travis-south
Copy link

I see, well, I think I'll resort to ECS_CONTAINER_STOP_TIMEOUT for now to test it.

@wbingli
Copy link

wbingli commented Mar 29, 2019

@travis-south I think here is the document to configure ELB connection draining, ELB Deregistration Delay. There is no need for a configuration on ECS service side, it will always respect ELB target draining and stop the task once target draining completed.

The stopTimeout feature is pretty new, it's launched on Mar 7.

As for the availability, it should be available to all regions if using EC2 launch type, agent version 1.26.0+ required. The document is kind of misleading to say "This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only", it actually means "For tasks using Fargate launch type, it's only available in Ohio (us-east-2) region only and requires platform version 1.3.0 or later".

@travis-south
Copy link

travis-south commented Apr 1, 2019

@wbingli thanks for the explanation. I'll try this. At the moment, my deregistration delay is 10 secs, i'll try to increase this and see what happens.

@jtoberon jtoberon added the ECS Amazon Elastic Container Service label Apr 16, 2019
@coultn
Copy link

coultn commented Apr 16, 2019

Hi everyone, the stopTimeout parameter is available for both ECS and Fargate task definitions. It controls how long the delay is between SIGTERM and SIGKILL. Additionally, if you combine this with the container ordering feature (also available on both ECS and Fargate), you can control the order of termination of your containers, and the time each container is allowed to take to shut down.

We are in the process of updating ECS/Fargate and CloudFormation docs to reflect the fact that these features are available in all regions where those services are available.

@lox
Copy link
Author

lox commented Apr 16, 2019

How would one disable SIGKILL entirely @coultn? Sometimes tasks might take 30+ minutes to finish.

@coultn
Copy link

coultn commented Apr 16, 2019

You can't disable SIGKILL entirely, but you can set the value to a very large number (on ECS; there is a 2 minute limit on Fargate).

@travis-south
Copy link

I tried increasing my deregistration delay to 100 secs and it made thing worst for my case. I receive a lot of 5xx errors during deployments.

@coultn
Copy link

coultn commented Apr 19, 2019

update: the stop timeout parameter is now documented in CloudFormation (see release history here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ReleaseHistory.html). However, the original issue is about controlling which tasks get selected for termination when a service is scaling in due to a scaling policy action. Per-container stop timeouts can help with that but won't provide a complete solution.

@lox
Copy link
Author

lox commented Apr 19, 2019

This basically brings things up to parity with lifecycle hooks on EC2, so I'd say this pretty much addresses my original issue. Happy to close this out, thanks for your help @coultn.

@ajenkins-cargometrics
Copy link

The stop timeout does provide an ECS equivalent for EC2's termination lifecycle hook. However ECS is still missing an equivalent of EC2's instance protection, which would allow solving exactly the problem in this issue's title.

Using EC2 instance protection, you can mark some instances in an autoscaling group as protected from scale in. When scaling in, EC2 will only consider instances without scaling protection enabled for termination. By manipulating the instance protection flag, an application can control exactly which EC2 instances are terminated during scale-in. If ECS would add an equivalent "task protection" flag for ECS tasks, problems like the one @lox described would have a straightforward solution. You'd simply set the protection flag to on for tasks that are busy, and turn it off when a task is idle. When an ECS service was told to scale in, it would only be allowed to kill tasks with protection turned off.

I've been wrestling with a similar problem recently, and it would be very helpful if AWS would add a "task protection" feature.

@MaerF0x0
Copy link

is there a maximum value for stopTimeout ?

@shmulikd9
Copy link

In my case I preffered stopTimeout to be 0 so so the container will be killed immeadetly, but apperently the minimum value is 2 seconds.

What is the reason of not allowing less than 2 seconds values?
Where I can see documentation on the limits?

@kaushikthedeveloper
Copy link

This would a great value addition since ECS-EC2 tasks are usually run for processes that need to be always up and running. And in scenarios where the process cannot be stopped for hour(s) due to tasks running from the time of SIGTERM being called, this can mean incompleted tasks. Would have been wonderful to see a managed solution to this, instead of us having to build an entire architecture around this and maintain the lifecycle of the process ourselves.

@Zogoo
Copy link

Zogoo commented Oct 5, 2020

@ajenkins-cargometrics I 100% agree with your suggestion, ECS tasks for a job should able to enable "task protection".
Can I ask that how you could use that "instance protection" for EC2? Because in ECS you cannot know where the task will be placed by ECS?
Or you are enabling "instance protection" inside of the docker container with AWS CLI?

@Bi0max
Copy link

Bi0max commented Feb 8, 2021

Is there any solution how to work around it?
We have a problem to scale down Celery workers, which are running on ECS Fargate. When AWS decides to shutdown a container, this container can still be running a long-running task, which is lost then.
Without this feature ECS Fargate seems quite useless for usage with workers, which run some long-running jobs.

@bruce-wayne-256
Copy link

Thanks for the updates, I checked some more and added approach and doubt in this stackoverflow question - ECS choose idle tasks while scale in & deployment checkpointing for batch processing, could someone please have a look?

@cgazit
Copy link

cgazit commented Aug 24, 2022 via email

@francoislagier
Copy link

We were trying to keep the containers themselves Cloud-agnostics

I agree, it works but it's not ideal since it brings AWS-specific code to my container. I'll be happy to remove the AWS-specific code once ECS supports a similar feature.

@Huffdiddy420
Copy link

Huffdiddy420 commented Aug 25, 2022 via email

@ArunGitOps
Copy link

We ran into same issue. Can we expect this feature "termination protection flag" for ECS task to be available before this year end?

@AbhishekNautiyal
Copy link

AbhishekNautiyal commented Sep 15, 2022

Hi All,

Amazon ECS team is actively working on delivering this feature, and we appreciate your patience. I wanted to share a little more detail about the solution for your feedback -

As mentioned previously, the way it would work is that you could set a scale-in protection flag on tasks that require protection along with an optional expiration time for the protection. The scale-in protection flag would prevent your tasks from being terminated during scale-in events from service autoscaling or deployments. You would be able to modify the flag using new Amazon ECS APIs (Get/UpdateTaskProtection) or through a new endpoint from within the container (similar to the task metadata endpoint https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html). These options provide you the flexibility to modify the flag from outside the boundary of a container or from within a container - to mark its task as protected - respectively.

The scale-in protection flag will have a default expiration of 8 hours, which you can optionally set to up to 48 hours. The expiration helps ensure that your tasks are not inadvertently left in protected status when something goes awry and are cleaned up automatically. Note that invoking the API again will always reset/extend the protection expiration.

Let us know if this sounds good, or if you have any feedback. We're especially curious to learn about the time duration requirement for keeping your tasks in the protected state.

Edit: The protection expiration will be modifiable to any value up to 48 hours. It'll be an integer - expiresInMinutes - with max value as 2880 (48h) and default as 480 (8h)

@skeggse
Copy link

skeggse commented Sep 15, 2022

The expiration time functionality would work great for our use-case. Generally we have tasks that start in an idle state, receive some work and process it for a bounded period of time, and then self-terminate. We'd like to be able to have the pool of tasks scale up and down depending on the volume of pending work, but need to be able to do the scale-down without interrupting the tasks that are busy.

@guss77
Copy link

guss77 commented Sep 15, 2022

It would be great if the expiration time can be modified down less than 8 hours - I'd rather have our tasks set a short timeout (maybe 5 minutes) and renew it if needed.

@AbhishekNautiyal
Copy link

It would be great if the expiration time can be modified down less than 8 hours - I'd rather have our tasks set a short timeout (maybe 5 minutes) and renew it if needed.

Please see the edit. You can set it to 5 minutes and renew it repeatedly.

@abdulloooh
Copy link

abdulloooh commented Sep 15, 2022

This is a great way to go about it, covers all ends. How soon should we expect this? @AbhishekNautiyal

@HughVolpe
Copy link

@AbhishekNautiyal that solution is great for my use case.

I am running queue workers (single thread of workers per container) that pick up jobs with peaky load. The jobs can take around 40 minutes with no practical way of being broken down or having their state saved. I wanted to scale up and down based on queue depth, but without this there was no way to do it.

If I had this feature I would expect to protect the container for around an hour each time a job is picked up and release it when the job is complete.

@joe-shad
Copy link

joe-shad commented Oct 5, 2022

@AbhishekNautiyal This solution might help our use-case as well

@cgazit
Copy link

cgazit commented Oct 11, 2022 via email

@jimmymic
Copy link

@AbhishekNautiyal you told us it was being actively built 10 months ago. Is it actually being built?
What is the actual timeframe we can expect to see this feature being delivered?

@AbhishekNautiyal
Copy link

I want to have a clarification. You said:

You would be able to modify the flag using new Amazon ECS APIs (Get/UpdateTaskProtection) or through a new endpoint from within >>the container (similar to the task metadata endpoint
I want to stay generic as possible, and therefore would like to avoid AWS APIs in my container. The use of a new endpoint from within the container would be ideal. However, reading the provided URL, it seems those stats are read-only. So, to verify my understanding. ,There will be some ENV variable inside the container , let's call it : Scale-inProtectionFlag. Initially, when the container starts, I want my job to run, so I will have it set (default) Scale-inProtectionFlag=true When my job is done, I'll set it to: Scale-inProtectionFlag=off , and now this container is a fair game to be killed. If my job was not killed (it was idle and just listening on a queue), then if it starts processing again, then it will set Scale-inProtectionFlag=true and now it's safe again (and when done will do: Scale-inProtectionFlag=false). This is a very simplified scenario. Did I get it right ? Thanks, Carmi

________________________________ From: AbhishekNautiyal @.> Sent: Thursday, September 15, 2022 11:14 AM To: aws/containers-roadmap @.> Cc: Carmi Gazit @.>; Comment @.> Subject: Re: [aws/containers-roadmap] [ECS] [request]: Control which containers are terminated on scale in (#125) It would be great if the expiration time can be modified down less than 8 hours - I'd rather have our tasks set a short timeout (maybe 5 minutes) and renew it if needed. Please see the edit. You can set it to 5 minutes and renew it repeatedly. — Reply to this email directly, view it on GitHub<#125 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF3VEPY4T35JMBJTKOKS4BLV6NKP3ANCNFSM4GRH32BQ. You are receiving this because you commented.Message ID: @.***>

@cgazit Your understanding is correct. We will be releasing a new environment variable for mutating task protection status (the current task metadata URL is and will continue to be read-only).

@jimmymic Appreciate your patience. While we cannot commit to a launch date, I can confirm that this feature is in the works to be released fairly soon and have updated the status likewise. Please feel free to reach out to me at nautiya[at]amazon[dot]com for more details.

@agk23
Copy link

agk23 commented Nov 3, 2022

Hi Abhishek,

Thanks for the update and letting me know via email that this is a few weeks out from release. As requested, here is our use case.

We have a Fargate service that scales up/down based on the number of messages in a SQS Queue. However, each task can take several minutes to process (and potentially longer in the future), so we didn't want to go the Lambda approach. Our use case is that we don't want a task that is actively running a job to die in the middle of processing it, because we'd have to restart the job from scratch, which is not ideal.

What we'd like to do is when our ECS task gets a new message from SQS, we'd set the termination protection flag to true, and then back to false when it's done processing.

Thanks again.

@jlsalmon
Copy link

jlsalmon commented Nov 3, 2022

I would like to add my use case here as well for posterity. We use ECS tasks to run multiplayer game server instances, and scale up the service according to the number of players looking to join a game. We would like to control which containers are terminated on scale-in to prevent tasks which are currently hosting an active game from being killed (and hence unceremoniously ending the game for connected players).

@xtagon
Copy link

xtagon commented Nov 3, 2022

I have almost the exact same type of use case. Presently, scaling in has a high chance of ECS deciding to kill the queue workers that have long-running tasks instead of the ones that are idle.

@threewordphrase
Copy link

I am very much looking forward to this feature. @AbhishekNautiyal do you have any rough estimate when we might see it? If it's soon I could potentially avoid a side quest to work around :)

@Vlaaaaaaad
Copy link

Looks like this'll be coming soon: https://github.com/aws-containers/ecs-task-protection-examples was made public about 12 hours ago 👀

Reading through the example code, I have to say I love the implementation of this ❤️

@AbhishekNautiyal
Copy link

We're excited to announce launch of Amazon ECS task scale-in protection! Please see What's new and blog posts for details.

Appreciate the engagement, feedback, and patience from everyone on this thread. We look forward to any additional comments/feedback.

@AbhishekNautiyal AbhishekNautiyal added Shipped This feature request was delivered. and removed Coming Soon labels Nov 10, 2022
@agk23
Copy link

agk23 commented Nov 10, 2022

I love you

@AbhishekNautiyal
Copy link

Closing this issue with the aforementioned launch. Please feel free to open a new issue if you have any further feature requests/feedback.

@ashwani1000
Copy link

@AbhishekNautiyal thanks a lot for this feature, I'm using it and it's able to protect the ECS tasks from getting terminated as promised. However, is there a way I can view which tasks are protected at any given time?

Specifically, I want to monitor for a given time range how many scale-in events were bypassed by protecting the ECS tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Shipped This feature request was delivered.
Projects
None yet
Development

No branches or pull requests