Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to enable spot instance draining #1100

Merged

Conversation

mkulke
Copy link
Contributor

@mkulke mkulke commented Sep 2, 2020

Issue number:

fixes #1099

Description of changes:

Added the option to have instance draining on spot instances, I used https://github.com/aws/amazon-ecs-agent/blob/a250409cf5eb4ad84a7b889023f1e4d2e274b7ab/agent/config/types.go as reference

Testing done:

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@mkulke mkulke requested review from bcressey and zmrow September 2, 2020 15:51
@zmrow zmrow requested a review from samuelkarp September 2, 2020 15:52
Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for opening this pull request! We’re super excited to see the interest in the ECS variant of Bottlerocket, and adding new settings is a great way to help out! There are a few more things that will need to be done to take this PR across the finish line. I’m happy to describe them here; if you want to handle implementing them that would be awesome, but if you’d prefer to have us do it that’s also fine (though I can’t guarantee when we’ll be able to get to it). Let us know which you’d prefer!

You can take a look at this commit for inspiration, as it has most of the things you’ll want to do. Specifically:

  • Add the new setting to sources/models/src/lib.rs in the ECSSettings struct. This is how we define the settings that show up in the Bottlerocket API and hook it up with the component that can read settings from user-data.
  • Read the new setting and inject it into the generated ECSConfig struct in sources/api/ecs-settings-applier/src/ecs.rs; you can see that on lines 79-100 of the file
  • (Not shown in the commit I linked you to) Bottlerocket settings are effectively stored in a database with a strict schema; adding new settings changes that schema. We’ll need a migration program (like in here, this is a good example) to handle adding the new setting to the schema.
  • You don’t need to change sources/models/src/aws-ecs-1/override-defaults.toml even though it's show in the other commit; the default can remain unset.

sources/api/ecs-settings-applier/src/ecs.rs Outdated Show resolved Hide resolved
@mkulke mkulke force-pushed the ecs-spot-instance-draining-option branch from f6229eb to b57a735 Compare September 2, 2020 19:44
@mkulke
Copy link
Contributor Author

mkulke commented Sep 2, 2020

@samuelkarp Thanks a ton for the kind words, a very comprehensive review and concrete pointers! I suppose I missed a layer of indirection, things makes more sense now. I think I covered the points you listed, however i'm not sure about the versioning scheme of the project: I just put the migration in a v1.0.1 folder.

@mkulke mkulke requested a review from samuelkarp September 2, 2020 19:50
@mkulke mkulke force-pushed the ecs-spot-instance-draining-option branch from b57a735 to a9a4f9f Compare September 2, 2020 19:53
Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, this looks super good! The last step I think is just for us to test and make sure the following scenarios are all working properly:

  • upgrade from 1.0.0 and make sure the new setting is now available in the API
  • downgrade from 1.0.1 to 1.0.0 and make sure the new setting is removed
  • when enabling spot instance draining through the API, make sure the right codepath in the agent is executing

This is work that we can take care of; staging updates in a repo to execute upgrade-downgrade testing is not super trivial yet. But if you're able to build an AMI (cargo make && cargo make ami) you can test whether the spot instance draining feature functions on a spot instance.

sources/api/ecs-settings-applier/src/ecs.rs Outdated Show resolved Hide resolved
@samuelkarp samuelkarp added area/ecs ECS type/enhancement New feature or request labels Sep 3, 2020
@mkulke
Copy link
Contributor Author

mkulke commented Sep 3, 2020

@samuelkarp

Spot termination notices are hard to test, i'm not sure I can trigger such a notification myself. However, i built an AMI, started an EC2 instance w/ user data:

[settings.host-containers.admin]
enabled = true

[settings.ecs]
cluster = "redacted"
enable-spot-instance-draining = true

...and it registered succesfully as a cluster instance, so the agent works in principle. The agent's introspection api does not seem to provide configuration details (possibly because there might be secrets in there):

[ec2-user@ip-redacted ~]$ curl -s http://localhost:51678/v1/metadata
{"Cluster":"redacted","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:redacted","Version":"Amazon ECS Agent - v1.43.0 (1ebf0604)"}

However I checked whether the param is properly written to the config and that seems to be the case:

[ec2-user@ip-redacted ~]$ sudo sheltie
bash-5.0# cat /etc/ecs/ecs.config.json
{"Cluster":"redacted","InstanceAttributes":{"bottlerocket.version":"1.0.0","bottlerocket.variant":"aws-ecs-1"},"PrivilegedDisabled":true,"AvailableLoggingDrivers":["json-file","awslogs","none"],"TaskIAMRoleEnabled":true,"TaskIAMRoleEnabledForNetworkHost":true,"SELinuxCapable":true,"OverrideAWSLogsExecutionRole":true,"SpotInstanceDrainingEnabled":true}

@mkulke mkulke requested a review from samuelkarp September 3, 2020 15:36
@samuelkarp
Copy link
Contributor

Spot termination notices are hard to test, i'm not sure I can trigger such a notification myself.

I didn't know this until today, but Spot blocks allow specifying a duration for your instance to run, and termination notices are delivered at the end of the block. I was able to test an AMI based on your code, and correctly saw the ECS agent flip the container instance's state to DRAINING roughly two minutes before it was terminated.

I haven't had a chance to run the other test I want, which is upgrade-downgrade to check the migration. (I'm pretty confident that it'll work, but I still want to run the test.)

The other thing that needs to be taken care of (and we can take care of this if you don't want to) is that the migration needs to be moved to 1.0.2 since we've released 1.0.1.

@mkulke
Copy link
Contributor Author

mkulke commented Sep 13, 2020

I didn't know this until today, but Spot blocks allow specifying a duration for your instance to run, and termination notices are delivered at the end of the block. I was able to test an AMI based on your code, and correctly saw the ECS agent flip the container instance's state to DRAINING roughly two minutes before it was terminated.

Ah, wasn't aware, either. This will be helpful for testing spot tools.

I haven't had a chance to run the other test I want, which is upgrade-downgrade to check the migration. (I'm pretty confident that it'll work, but I still want to run the test.)

👍

The other thing that needs to be taken care of (and we can take care of this if you don't want to) is that the migration needs to be moved to 1.0.2 since we've released 1.0.1.

will do

@mkulke mkulke force-pushed the ecs-spot-instance-draining-option branch from 6163303 to 3f5e68c Compare September 13, 2020 11:56
@samuelkarp
Copy link
Contributor

I was able to test upgrade-downgrade and exercise the migration successfully, but there's one change I had to make: the Release.toml file needs to list the migration in a new line at the end like this:

"(1.0.1, 1.0.2)" = ["migrate_v1.0.2_add-enable-spot-instance-draining.lz4"]

Can you add that to the PR? I'll be happy to approve it after. 😄

@mkulke mkulke force-pushed the ecs-spot-instance-draining-option branch from c5902be to 41c04bc Compare September 17, 2020 20:28
Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦄

Thanks!

@samuelkarp samuelkarp requested a review from tjkirch September 17, 2020 20:54
Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

README.md Show resolved Hide resolved
@samuelkarp
Copy link
Contributor

Hi @mkulke! Would you mind squashing your commits into a single commit? Then we can merge from there.

(If we don't hear back from you in a couple days, we'll take care of the squash and merge.)

Co-authored-by: Samuel Karp <samuelkarp@users.noreply.github.com>
@mkulke mkulke force-pushed the ecs-spot-instance-draining-option branch from 260b7b3 to 58493ad Compare September 24, 2020 12:05
@samuelkarp samuelkarp merged commit 17c1e29 into bottlerocket-os:develop Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ecs ECS type/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ECS] SpotInstanceDraining Agent Configuration
5 participants