-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a lifecycle hook for instance launch #1015
Comments
Amazon does note that they recommended consumption of the instance meta data to have an exponential backoff strategy.
The curl call at this line (and other lines) has a For example curls |
this seems like an awesome idea, and it'll solve a bunch of problems that buildkite users have, so go for your life if you're interested in implementing something :) something that's on the roadmap for the BK Agent at the moment is some sort of configurable healthchecking, which ties into this idea a little bit. productionising this is a fair ways off at this point though, and i think something like what you've proposed would slot in nicely alongside some sort of more robust healthchecking system. |
+1 |
Is your feature request related to a problem? Please describe.
At the moment, the agent startup process uses
cfn-signal
to signal back to the Cloudformation stack that the instance has been launched successfully. Howevercfn-signal
is only useful when the stack is being updated.If an instance is created (e.g. due to a difference between desired and actual capacity in the ASG), then the
cfn-signal
is not useful.We observed a case where the instance failed to retrieve some files from S3 during bootstrap, because the instance metadata service failed to respond.
The bootstrap then ends up in this on error handler, where it again failed to contact the instance metadata service.
The result is an instance in the ASG that can't accept any jobs, and the result is a queue that doesn't autoscale (because the right number of instances is present), but also doesn't ever accept any jobs.
Describe the solution you'd like
This seems like it could be fixed by adding an Autoscaling Lifecyle Hook on instance launch.
The hook would default to terminating the instance if not completed after some amount of time, and the bootstrap script would complete the lifecycle action.
Describe alternatives you've considered
We're not sure why the instance metadata endpoint failed in our case, but we'll look in to it.
Regardless though, any failure during startup may have the same effect, as long as the instance creation is a result of scale up due to capacity mismatches and not during an update of the Cloudformation stack.
The text was updated successfully, but these errors were encountered: