[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

sdhoka · 2019-09-19T00:33:10Z

Root Cause Analysis:
The use of threadpool at times causes some threads to return old values of instance state when making the getState() call. I was able to replicate similar thing using a ruby script (preferred language) that started a stopped instance and used about 10 threads to poll the state method, few threads still returned stopped for a millisecond interval although the instance had already moved to pending. This causes some instances to be started on AWS but are never connected and also, never stopped/terminated
Solution:

Add a retry logic on receiving stopped state from the getstate() method. The thread will wait for 5 seconds before retrying. I have been testing this fix in our internal environment and a retry attempt of 1 seems to work perfectly. The retry count can be increased but that can keep the thread waiting for long.
Remove the uptime < idleMilliseconds check from EC2RetentionStrategy.java. For planned stopped nodes that are started but not connected, idleMilliseconds will be counted from the time even before when it was last stopped and will be always greater than uptime. This prevents such faulty nodes from being terminated and also incurs AWS costs. The condition is ideally true for instances in the launching phase that haven't been connected yet. However, this is already taken care of by the condition here which protects termination of launching nodes from getting terminated before 30 minutes (STARTUP_TIMEOUT)

… even after sending the start signal

sdhoka · 2019-09-19T00:36:35Z

@res0nance
Sending this PR as discussed, I'd appreciate if you can review.
Thanks,
Shubham

res0nance · 2019-09-19T01:18:51Z

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java

@@ -164,11 +164,6 @@ private long internalCheck(EC2Computer computer) {

            final long idleMilliseconds = this.clock.millis() - computer.getIdleStartMilliseconds();

-            //Stopped instance restarted and Idletime has not be reset
-            if ( uptime <  idleMilliseconds) {


This was added to prevent computers with stale uptimes from terminating.

Yes but this PR should have fixed the stale uptimes, right?

Much like this PR that number could possibly be stale due to eventual consistency

I am not sure I understand, I am under the impression that #359 should permanently fix the stale uptime issue in EC2RetentionStrategy.java since getstate() will get the latest info from AWS and refresh uptime value. Could you please elaborate on the concern?

I do not know how the EC2 API behaves in this case. Previously we had issues of stale uptimes causing nodes to terminate but those nodes were just spun but did throw the exception seen in this PR

I tried out an experiment on this scenario and was able to confirm that uptime < idleMilliseconds can be reproduced as follows

Consider a node that was stopped by ec2-plugin on idle timeout after finishing the build

For this node , uptime > idleMilliseconds since uptime is counted from the time when the instance was last started but idle time is calculated after the connection process and build completes.

Now, manually start this stopped instance. Since ec2-plugin is not aware of this launch, ec2-plugin will never connect to the node and it will not come online. Idle time here will represent the time since the node got idle before it was stopped by the plugin while uptime will be the time since you just manually launched this machine on AWS. Thus, uptime < idleMilliseconds. The presence of this condition in EC2RetentionStrategy.java basically prevents the ec2-plugin from ever shutting down that manually started instance unless you manually initiate the connection procedure which in turn resets the idle time.

A similar issue occurs in JENKINS-57795 where ec2-plugin starts the instance but never connects to it, this causes idlemilliseconds to never reset

Use the following script in the Jenkins script console to try this out

package hudson.plugins.ec2; import java.util.concurrent.TimeUnit; import org.apache.commons.lang.math.NumberUtils; Jenkins jenkins = Jenkins.getInstance() int STARTUP_TIME_DEFAULT_VALUE = 30; int STARTUP_TIMEOUT = NumberUtils.toInt(System.getProperty(EC2RetentionStrategy.class.getCanonicalName() + ".startupTimeout",String.valueOf(STARTUP_TIME_DEFAULT_VALUE)), STARTUP_TIME_DEFAULT_VALUE) EC2Computer computer = jenkins.getComputer("Your node name in jenkins") EC2RetentionStrategy ret = new EC2RetentionStrategy('-5') idleTerminationMinutes = ret.idleTerminationMinutes println computer.isIdle() println computer.getState(); println computer.isOffline() long uptime = computer.getUptime() long idleMilliseconds = System.currentTimeMillis() - computer.getIdleStartMilliseconds() println uptime < idleMilliseconds

In light of this new discovery would you re-add that check?

No, I meant to explain that removing that check is necessary otherwise it will prevent some slaves from getting stopped even after idle timeout

Sorry, I misunderstood

that was introduced because the startup time can be longer than idle

res0nance · 2019-09-19T01:24:30Z

src/main/java/hudson/plugins/ec2/EC2Cloud.java

@@ -642,6 +642,8 @@ else if (jenkinsInstance.isTerminating()) {
    private PlannedNode createPlannedNode(final SlaveTemplate t, final EC2AbstractSlave slave) {
        return new PlannedNode(t.getDisplayName(),
                Computer.threadPoolForRemoting.submit(new Callable<Node>() {
+                    int retryCount     = 0;
+                    int DESCRIBE_LIMIT = 1;


There probably should be more than 1 retry and it should be final

Thanks for catching that, just fixed it and increased the count to 2

jvz

I don't know enough about how this feature works to offer a useful review.

fcojfernandez

Seems reasonable.

mirzmaster · 2019-10-22T13:09:32Z

Thank you @sdhoka & @res0nance for your efforts towards resolving this issue, which I believe we've been running into as well in our environment. When can we expect to see this in a release?

sdhoka · 2019-10-22T14:31:55Z

@res0nance
Thanks for getting this merged. Do you think that this fix might also be applicable for JENKINS-56036? Not sure if spot instances use the same logic

res0nance · 2019-10-22T15:09:38Z

It should be that as well

res0nance · 2019-10-22T15:58:39Z

@mirzmaster I'm hoping to release this week

chkelly · 2019-11-07T18:46:45Z

Appreciate all the work put into this PR as we believe this will solve issues properly launching build slaves with our Jenkins installs.

Is there any sense of when the next release will be cut?

sdhoka added 2 commits September 15, 2019 18:42

Add a retry logic to connect to the node if aws reports stopped state…

7c0f2f8

… even after sending the start signal

Stop/terminate disconnected instances after idle timeout

1fdc689

res0nance reviewed Sep 19, 2019

View reviewed changes

sdhoka added 2 commits September 18, 2019 21:35

Increase retry attempts to 2

b78c4bd

Use static for DESCRIBE_LIMIT

8c6d855

sdhoka requested a review from res0nance September 19, 2019 03:35

res0nance mentioned this pull request Sep 19, 2019

[JENKINS-57795] Retry when creating planned nodes #398

Closed

res0nance changed the title ~~Retry when waking up orphans or creating new nodes~~ [JENKINS-57795] Retry when waking up orphans or creating new nodes Sep 19, 2019

res0nance added the bugfix label Sep 19, 2019

res0nance closed this Oct 17, 2019

res0nance reopened this Oct 17, 2019

res0nance approved these changes Oct 17, 2019

View reviewed changes

res0nance requested review from jvz and thoulen October 17, 2019 08:21

jvz reviewed Oct 17, 2019

View reviewed changes

res0nance requested a review from fcojfernandez October 18, 2019 00:27

fcojfernandez approved these changes Oct 21, 2019

View reviewed changes

res0nance merged commit 12b1d27 into jenkinsci:master Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

sdhoka commented Sep 19, 2019 •

edited

Loading

sdhoka commented Sep 19, 2019 •

edited

Loading

res0nance Sep 19, 2019

sdhoka Sep 19, 2019

res0nance Sep 19, 2019

sdhoka Sep 20, 2019 •

edited

Loading

res0nance Sep 20, 2019

sdhoka Oct 18, 2019

res0nance Oct 18, 2019

sdhoka Oct 18, 2019

res0nance Oct 21, 2019

thoulen Oct 22, 2019

res0nance Sep 19, 2019

sdhoka Sep 19, 2019

jvz left a comment

fcojfernandez left a comment

mirzmaster commented Oct 22, 2019

sdhoka commented Oct 22, 2019

res0nance commented Oct 22, 2019

res0nance commented Oct 22, 2019

chkelly commented Nov 7, 2019

[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

[JENKINS-57795] Retry when waking up orphans or creating new nodes #399

Conversation

sdhoka commented Sep 19, 2019 • edited Loading

sdhoka commented Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdhoka Sep 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvz left a comment

Choose a reason for hiding this comment

fcojfernandez left a comment

Choose a reason for hiding this comment

mirzmaster commented Oct 22, 2019

sdhoka commented Oct 22, 2019

res0nance commented Oct 22, 2019

res0nance commented Oct 22, 2019

chkelly commented Nov 7, 2019

sdhoka commented Sep 19, 2019 •

edited

Loading

sdhoka commented Sep 19, 2019 •

edited

Loading

sdhoka Sep 20, 2019 •

edited

Loading