Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle DescribeInstances failures when creating new instance #414

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dbnicholson
Copy link
Contributor

Sometimes when launching a new instance, DescribeInstances does not report data within the amount of time that getInstanceWithRetry waits. However, the instance is created and eventually Jenkins can't create any more nodes because we're at maximum capacity.

This PR attempts 2 fixes:

  1. Increase the timeout waiting for DescribeInstances to return data when creating a new instance

  2. If no data could be found, send a termination request for the request so it doesn't pollute the capacity accounting.

@dbnicholson
Copy link
Contributor Author

BTW, I haven't actually run or built this code. I don't have a setup to build Jenkins plugins.

@jvz
Copy link
Member

jvz commented Oct 31, 2019

Looks like you have a compiler error. You can build the plugin by running mvn install.

Sometimes DescribeInstances takes a very long time to return data for
new instances. Wait up to 1 minute in this case before assuming that the
instance is dead.
src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java Outdated Show resolved Hide resolved
fetchLiveInstanceData(true);
try {
// Wait up to 1 minute for the instance to show up
fetchLiveInstanceData(true, 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this get called? If its during a provision operation, having such a long timeout might not be ideal as during provisioning the queue lock is held which means during this 1 minute builds you start won't be queued up and builds can't be picked up by nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure what Jenkins context this is called in, but I'm pretty sure it doesn't block other builds from queuing. That said, it already blocks for up to 25 seconds. This just extends it to 60 seconds.

We could try leaving out this commit and just having the commit that tries to terminate the instance if it couldn't get the information about the instance. That would hopefully keep it from getting deadlocked.

Yet another possible fix is for the EC2 plugin to keep track of which instances it launched and actively terminate any other instances it finds in EC2 with the same details. I didn't look into that, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the timeout to be the same and explore increasing it if it does not solve the issue completely

If DescribeInstances did not return any data, send a termination request
for it. If it shows up later, it would be included in capacity
calculations and Jenkins could refuse to instantiate new nodes if the
maximum capacity was reached.
* it doesn't affect capacity calculations.
*/
LOGGER.log(Level.WARNING, "Failed to get instance data for new instance " + getInstanceId() + ", terminating");
terminateInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thoulen also made a point that this terminateInstance call can fail if the api gets throttled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants