AWS Lambda function to terminate ECS instances with stopped/failed agents.
The Amazon EC2 Container Service runs an agent as a Docker container on each EC2 instance registered in an ECS cluster. Unfortunately, on occasion this agent crashes, and this causes other services running on that instance to be inaccessible.
This doesn't have to be a big deal, though. In the world of micro-services we ought to expect failure and plan for automatically dealing with it.
This AWS Lambda function can be run to check all instances in a given ECS
cluster and verify their agent connection statuses. Any instance whose
connection status is false
will be deregistered from its AWS Autoscaling group
and then terminated.
AWS will automatically replace each instance deregistering from its Autoscaling group with an identical new healthy instance.
We use Codeship to build AWS Lambda deployment packages; you can download the latest stable version here.
You will need to create a new Lambda function. Configuration parameters, both recommended and required:
- Name: ecs-agent-monitor
- Runtime: Python 2.7
- Handler: ecs-agent-monitor.main
- Role: (see below)
- Description: Terminate ECS Instances with stopped agents
- Memory (MB): 128
- Timeout: 20 sec
Load the function code, either directly from the zipped deployment package, by pasting in the S3 URL to that package, or by building your own package from the source.
Make sure the package contains redis. Use pip install -r requirements.txt path/to/package
You will need to create an new IAM role for this Lambda function to assume,
in order that it may have the necessary permission to access ECS clusters,
and deregister and terminate instances. See sample_policy.json
for an
IAM permissions policy that could be applied to this role.
It is also necessary to configure the role's trust relationships, in order to
allow the Lambda function to assume it when run. See sample_trust.json
for an
IAM trust policy that should be applied to enable this.
Once you have created this role, configure the Lambda function to assume it (see above).
This function requires connection to a firebase database. Please set up a database beforehand and pass in the required configuration.
config = {
"apiKey": "apiKey",
"authDomain": "projectId.firebaseapp.com",
"databaseURL": "https://databaseName.firebaseio.com",
"storageBucket": "projectId.appspot.com",
}
user = auth.sign_in_with_email_and_password(email, password)
This function is controlled by the JSON event variable passed when it is invoked. It expects something like this:
{
"cluster": "default",
"snsLogArn": "arn:aws:sns:region:account-id:topicname",
"apiKey": "apiKey",
"authDomain": "projectId.firebaseapp.com",
"databaseURL": "https://databaseName.firebaseio.com",
"storageBucket": "projectId.appspot.com",
"firebaseEmail": "user@example.com",
"firebasePassword": "password",
"fail_after": 3
}
It looks in the event for nine keys:
cluster
: the ECS cluster to scan for stopped agentssnsLogArn
: (optional) ARN of an AWS SNS TopicapiKey
: api key for firebaseauthDomain
: auth domain for firebasedatabaseURL
: database url for firebasestorageBucket
: storage bucket for firebasefirebaseEmail
: email auth for firebase userfirebasePassword
: password for firebase userfail_after
: the number of failures needed to terminate an instance
If snsLogArn
is available, the function will send a formatted information
message to that SNS topic whenever it terminates EC2 instances. You can then
add subscriptions to that topic to generate email or text notifications.
You can invoke this function manually from the AWS Web Console or the aws
command-line tool (by passing in the necessary JSON event), but for regular
cluster scanning it is better to configure a Cloudwatch Rule to invoke it
on a scheduled basis. Be sure to configure the rule to pass the correct JSON
event to the function.
We welcome contributions in the form of opened issues or pull requests.
This repository uses git-flow
for its VCS commit model. So, if you want to use the most recent stable version
of the function, just checkout the latest tag on the master
branch.
The zipped deployment package is automatically built from the latest tag on master
.