Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

Merged
merged 4 commits into from
Oct 22, 2019

Conversation

sdhoka
Copy link
Contributor

@sdhoka sdhoka commented Sep 19, 2019

Root Cause Analysis:
The use of threadpool at times causes some threads to return old values of instance state when making the getState() call. I was able to replicate similar thing using a ruby script (preferred language) that started a stopped instance and used about 10 threads to poll the state method, few threads still returned stopped for a millisecond interval although the instance had already moved to pending. This causes some instances to be started on AWS but are never connected and also, never stopped/terminated
Solution:

  1. Add a retry logic on receiving stopped state from the getstate() method. The thread will wait for 5 seconds before retrying. I have been testing this fix in our internal environment and a retry attempt of 1 seems to work perfectly. The retry count can be increased but that can keep the thread waiting for long.
  2. Remove the uptime < idleMilliseconds check from EC2RetentionStrategy.java. For planned stopped nodes that are started but not connected, idleMilliseconds will be counted from the time even before when it was last stopped and will be always greater than uptime. This prevents such faulty nodes from being terminated and also incurs AWS costs. The condition is ideally true for instances in the launching phase that haven't been connected yet. However, this is already taken care of by the condition here which protects termination of launching nodes from getting terminated before 30 minutes (STARTUP_TIMEOUT)

@sdhoka
Copy link
Contributor Author

sdhoka commented Sep 19, 2019

@res0nance
Sending this PR as discussed, I'd appreciate if you can review.
Thanks,
Shubham

@@ -164,11 +164,6 @@ private long internalCheck(EC2Computer computer) {

final long idleMilliseconds = this.clock.millis() - computer.getIdleStartMilliseconds();

//Stopped instance restarted and Idletime has not be reset
if ( uptime < idleMilliseconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to prevent computers with stale uptimes from terminating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this PR should have fixed the stale uptimes, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much like this PR that number could possibly be stale due to eventual consistency

Copy link
Contributor Author

@sdhoka sdhoka Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand, I am under the impression that #359 should permanently fix the stale uptime issue in EC2RetentionStrategy.java since getstate() will get the latest info from AWS and refresh uptime value. Could you please elaborate on the concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how the EC2 API behaves in this case. Previously we had issues of stale uptimes causing nodes to terminate but those nodes were just spun but did throw the exception seen in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out an experiment on this scenario and was able to confirm that uptime < idleMilliseconds can be reproduced as follows

  1. Consider a node that was stopped by ec2-plugin on idle timeout after finishing the build
  2. For this node , uptime > idleMilliseconds since uptime is counted from the time when the instance was last started but idle time is calculated after the connection process and build completes.
  3. Now, manually start this stopped instance. Since ec2-plugin is not aware of this launch, ec2-plugin will never connect to the node and it will not come online. Idle time here will represent the time since the node got idle before it was stopped by the plugin while uptime will be the time since you just manually launched this machine on AWS. Thus, uptime < idleMilliseconds. The presence of this condition in EC2RetentionStrategy.java basically prevents the ec2-plugin from ever shutting down that manually started instance unless you manually initiate the connection procedure which in turn resets the idle time.
  4. A similar issue occurs in JENKINS-57795 where ec2-plugin starts the instance but never connects to it, this causes idlemilliseconds to never reset
  5. Use the following script in the Jenkins script console to try this out
package hudson.plugins.ec2;
import java.util.concurrent.TimeUnit;

import org.apache.commons.lang.math.NumberUtils;

Jenkins jenkins = Jenkins.getInstance()
int STARTUP_TIME_DEFAULT_VALUE = 30;
int STARTUP_TIMEOUT = NumberUtils.toInt(System.getProperty(EC2RetentionStrategy.class.getCanonicalName() + ".startupTimeout",String.valueOf(STARTUP_TIME_DEFAULT_VALUE)), STARTUP_TIME_DEFAULT_VALUE)

EC2Computer computer = jenkins.getComputer("Your node name in jenkins")

EC2RetentionStrategy ret = new EC2RetentionStrategy('-5')

idleTerminationMinutes = ret.idleTerminationMinutes

println computer.isIdle()
println computer.getState();
println computer.isOffline()

long uptime = computer.getUptime()
long idleMilliseconds = System.currentTimeMillis() - computer.getIdleStartMilliseconds()

println uptime < idleMilliseconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of this new discovery would you re-add that check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant to explain that removing that check is necessary otherwise it will prevent some slaves from getting stopped even after idle timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was introduced because the startup time can be longer than idle

@@ -642,6 +642,8 @@ else if (jenkinsInstance.isTerminating()) {
private PlannedNode createPlannedNode(final SlaveTemplate t, final EC2AbstractSlave slave) {
return new PlannedNode(t.getDisplayName(),
Computer.threadPoolForRemoting.submit(new Callable<Node>() {
int retryCount = 0;
int DESCRIBE_LIMIT = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There probably should be more than 1 retry and it should be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, just fixed it and increased the count to 2

@res0nance res0nance changed the title Retry when waking up orphans or creating new nodes [JENKINS-57795] Retry when waking up orphans or creating new nodes Sep 19, 2019
@res0nance res0nance closed this Oct 17, 2019
@res0nance res0nance reopened this Oct 17, 2019
@res0nance res0nance requested review from jvz and thoulen October 17, 2019 08:21
Copy link
Member

@jvz jvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about how this feature works to offer a useful review.

Copy link
Contributor

@fcojfernandez fcojfernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.

@res0nance res0nance merged commit 12b1d27 into jenkinsci:master Oct 22, 2019
@mirzmaster
Copy link

Thank you @sdhoka & @res0nance for your efforts towards resolving this issue, which I believe we've been running into as well in our environment. When can we expect to see this in a release?

@sdhoka
Copy link
Contributor Author

sdhoka commented Oct 22, 2019

@res0nance
Thanks for getting this merged. Do you think that this fix might also be applicable for JENKINS-56036? Not sure if spot instances use the same logic

@res0nance
Copy link
Contributor

It should be that as well

@res0nance
Copy link
Contributor

@mirzmaster I'm hoping to release this week

@chkelly
Copy link

chkelly commented Nov 7, 2019

Appreciate all the work put into this PR as we believe this will solve issues properly launching build slaves with our Jenkins installs.

Is there any sense of when the next release will be cut?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants