Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle DescribeInstances failures when creating new instance #414

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions src/main/java/hudson/plugins/ec2/CloudHelper.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,12 @@
final class CloudHelper {
private static final Logger LOGGER = LoggerFactory.getLogger(CloudHelper.class);

static Instance getInstanceWithRetry(String instanceId, EC2Cloud cloud) throws AmazonClientException, InterruptedException {
static Instance getInstanceWithRetry(String instanceId, EC2Cloud cloud, int timeout) throws AmazonClientException, InterruptedException {
// Sometimes even after a successful RunInstances, DescribeInstances
// returns an error for a few seconds. We do a few retries instead of
// failing instantly. See [JENKINS-15319].
for (int i = 0; i < 5; i++) {
int waitCount = timeout / 5;
for (int i = 0; i < waitCount; i++) {
try {
return getInstance(instanceId, cloud);
} catch (AmazonServiceException e) {
Expand All @@ -36,6 +37,10 @@ static Instance getInstanceWithRetry(String instanceId, EC2Cloud cloud) throws A
return getInstance(instanceId, cloud);
}

static Instance getInstanceWithRetry(String instanceId, EC2Cloud cloud) throws AmazonClientException, InterruptedException {
return getInstanceWithRetry(instanceId, cloud, 25);
}

@CheckForNull
static Instance getInstance(String instanceId, EC2Cloud cloud) throws AmazonClientException {
if (StringUtils.isEmpty(instanceId) || cloud == null)
Expand Down
22 changes: 19 additions & 3 deletions src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,19 @@ public EC2AbstractSlave(String name, String instanceId, String templateDescripti
this.amiType = amiType;
this.maxTotalUses = maxTotalUses;
readResolve();
fetchLiveInstanceData(true);
try {
// Wait up to 1 minute for the instance to show up
fetchLiveInstanceData(true, 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this get called? If its during a provision operation, having such a long timeout might not be ideal as during provisioning the queue lock is held which means during this 1 minute builds you start won't be queued up and builds can't be picked up by nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure what Jenkins context this is called in, but I'm pretty sure it doesn't block other builds from queuing. That said, it already blocks for up to 25 seconds. This just extends it to 60 seconds.

We could try leaving out this commit and just having the commit that tries to terminate the instance if it couldn't get the information about the instance. That would hopefully keep it from getting deadlocked.

Yet another possible fix is for the EC2 plugin to keep track of which instances it launched and actively terminate any other instances it finds in EC2 with the same details. I didn't look into that, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the timeout to be the same and explore increasing it if it does not solve the issue completely

} catch (com.amazonaws.AmazonClientException e) {
/*
* If DescribeInstances didn't return any information about this
* instance, try to terminate it so that if it does come up later
* it doesn't affect capacity calculations.
*/
LOGGER.log(Level.WARNING, "Failed to get instance data for new instance " + getInstanceId() + ", terminating");
terminateInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thoulen also made a point that this terminateInstance call can fail if the api gets throttled

throw e;
}
}

@Deprecated
Expand Down Expand Up @@ -541,7 +553,7 @@ protected boolean isAlive(boolean force) {
* Much of the EC2 data is beyond our direct control, therefore we need to refresh it from time to time to ensure we
* reflect the reality of the instances.
*/
private void fetchLiveInstanceData(boolean force) throws AmazonClientException {
private void fetchLiveInstanceData(boolean force, int timeout) throws AmazonClientException {
/*
* If we've grabbed the data recently, don't bother getting it again unless we are forced
*/
Expand All @@ -563,7 +575,7 @@ private void fetchLiveInstanceData(boolean force) throws AmazonClientException {

Instance i = null;
try {
i = CloudHelper.getInstanceWithRetry(getInstanceId(), getCloud());
i = CloudHelper.getInstanceWithRetry(getInstanceId(), getCloud(), timeout);
} catch (InterruptedException e) {
// We'll just retry next time we test for idleness.
LOGGER.fine("InterruptedException while get " + getInstanceId()
Expand Down Expand Up @@ -594,6 +606,10 @@ private void fetchLiveInstanceData(boolean force) throws AmazonClientException {
}
}

private void fetchLiveInstanceData(boolean force) throws AmazonClientException {
fetchLiveInstanceData(force, 25);
}

/*
* Clears all existing tag data so that we can force the instance into a known state
*/
Expand Down