Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lifecycle hook for instance launch #1015

Open
Niksko opened this issue Apr 28, 2022 · 3 comments
Open

Add a lifecycle hook for instance launch #1015

Niksko opened this issue Apr 28, 2022 · 3 comments

Comments

@Niksko
Copy link

Niksko commented Apr 28, 2022

Is your feature request related to a problem? Please describe.
At the moment, the agent startup process uses cfn-signal to signal back to the Cloudformation stack that the instance has been launched successfully. However cfn-signal is only useful when the stack is being updated.
If an instance is created (e.g. due to a difference between desired and actual capacity in the ASG), then the cfn-signal is not useful.

We observed a case where the instance failed to retrieve some files from S3 during bootstrap, because the instance metadata service failed to respond.
The bootstrap then ends up in this on error handler, where it again failed to contact the instance metadata service.

The result is an instance in the ASG that can't accept any jobs, and the result is a queue that doesn't autoscale (because the right number of instances is present), but also doesn't ever accept any jobs.

Describe the solution you'd like
This seems like it could be fixed by adding an Autoscaling Lifecyle Hook on instance launch.
The hook would default to terminating the instance if not completed after some amount of time, and the bootstrap script would complete the lifecycle action.

Describe alternatives you've considered
We're not sure why the instance metadata endpoint failed in our case, but we'll look in to it.
Regardless though, any failure during startup may have the same effect, as long as the instance creation is a result of scale up due to capacity mismatches and not during an update of the Cloudformation stack.

@lantrix
Copy link

lantrix commented May 4, 2022

We're not sure why the instance metadata endpoint failed in our case, but we'll look in to it.

Amazon does note that they recommended consumption of the instance meta data to have an exponential backoff strategy.

We throttle queries to the instance metadata service on a per-instance basis, and we place limits on the number of simultaneous connections from an instance to the instance metadata service ... If you are throttled while accessing the instance metadata service, retry your query with an exponential backoff strategy.

The curl call at this line (and other lines) has a --fail setting which will fail fast with no output at all on server errors. Given AWS can throttle queries to the instance metadata service on a per-instance basis, maybe curl can be set on calling the instance metadata service http://169.254.169.254 to have an exponential backoff strategy.

For example curls --retry-delay <seconds> makes curl sleep this amount of time before each retry when a transfer has failed with a transient error (it changes the default backoff time algorithm between retries).

@moskyb
Copy link
Contributor

moskyb commented May 4, 2022

hey @Niksko and @lantrix!

this seems like an awesome idea, and it'll solve a bunch of problems that buildkite users have, so go for your life if you're interested in implementing something :)

something that's on the roadmap for the BK Agent at the moment is some sort of configurable healthchecking, which ties into this idea a little bit. productionising this is a fair ways off at this point though, and i think something like what you've proposed would slot in nicely alongside some sort of more robust healthchecking system.

@josh-ross-ai
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants