-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-57795] Retry when waking up orphans or creating new nodes #399
Conversation
… even after sending the start signal
@res0nance |
@@ -164,11 +164,6 @@ private long internalCheck(EC2Computer computer) { | |||
|
|||
final long idleMilliseconds = this.clock.millis() - computer.getIdleStartMilliseconds(); | |||
|
|||
//Stopped instance restarted and Idletime has not be reset | |||
if ( uptime < idleMilliseconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to prevent computers with stale uptimes from terminating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but this PR should have fixed the stale uptimes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much like this PR that number could possibly be stale due to eventual consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand, I am under the impression that #359 should permanently fix the stale uptime issue in EC2RetentionStrategy.java
since getstate()
will get the latest info from AWS and refresh uptime value. Could you please elaborate on the concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how the EC2 API behaves in this case. Previously we had issues of stale uptimes causing nodes to terminate but those nodes were just spun but did throw the exception seen in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out an experiment on this scenario and was able to confirm that uptime < idleMilliseconds
can be reproduced as follows
- Consider a node that was stopped by ec2-plugin on idle timeout after finishing the build
- For this node ,
uptime > idleMilliseconds
since uptime is counted from the time when the instance was last started but idle time is calculated after the connection process and build completes. - Now, manually start this stopped instance. Since ec2-plugin is not aware of this launch, ec2-plugin will never connect to the node and it will not come online. Idle time here will represent the time since the node got idle before it was stopped by the plugin while uptime will be the time since you just manually launched this machine on AWS. Thus,
uptime < idleMilliseconds
. The presence of this condition inEC2RetentionStrategy.java
basically prevents the ec2-plugin from ever shutting down that manually started instance unless you manually initiate the connection procedure which in turn resets the idle time. - A similar issue occurs in
JENKINS-57795
where ec2-plugin starts the instance but never connects to it, this causes idlemilliseconds to never reset - Use the following script in the Jenkins script console to try this out
package hudson.plugins.ec2;
import java.util.concurrent.TimeUnit;
import org.apache.commons.lang.math.NumberUtils;
Jenkins jenkins = Jenkins.getInstance()
int STARTUP_TIME_DEFAULT_VALUE = 30;
int STARTUP_TIMEOUT = NumberUtils.toInt(System.getProperty(EC2RetentionStrategy.class.getCanonicalName() + ".startupTimeout",String.valueOf(STARTUP_TIME_DEFAULT_VALUE)), STARTUP_TIME_DEFAULT_VALUE)
EC2Computer computer = jenkins.getComputer("Your node name in jenkins")
EC2RetentionStrategy ret = new EC2RetentionStrategy('-5')
idleTerminationMinutes = ret.idleTerminationMinutes
println computer.isIdle()
println computer.getState();
println computer.isOffline()
long uptime = computer.getUptime()
long idleMilliseconds = System.currentTimeMillis() - computer.getIdleStartMilliseconds()
println uptime < idleMilliseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of this new discovery would you re-add that check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant to explain that removing that check is necessary otherwise it will prevent some slaves from getting stopped even after idle timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misunderstood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was introduced because the startup time can be longer than idle
@@ -642,6 +642,8 @@ else if (jenkinsInstance.isTerminating()) { | |||
private PlannedNode createPlannedNode(final SlaveTemplate t, final EC2AbstractSlave slave) { | |||
return new PlannedNode(t.getDisplayName(), | |||
Computer.threadPoolForRemoting.submit(new Callable<Node>() { | |||
int retryCount = 0; | |||
int DESCRIBE_LIMIT = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There probably should be more than 1 retry and it should be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, just fixed it and increased the count to 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about how this feature works to offer a useful review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable.
Thank you @sdhoka & @res0nance for your efforts towards resolving this issue, which I believe we've been running into as well in our environment. When can we expect to see this in a release? |
@res0nance |
It should be that as well |
@mirzmaster I'm hoping to release this week |
Appreciate all the work put into this PR as we believe this will solve issues properly launching build slaves with our Jenkins installs. Is there any sense of when the next release will be cut? |
Root Cause Analysis:
The use of threadpool at times causes some threads to return old values of instance state when making the getState() call. I was able to replicate similar thing using a ruby script (preferred language) that started a stopped instance and used about 10 threads to poll the state method, few threads still returned stopped for a millisecond interval although the instance had already moved to pending. This causes some instances to be started on AWS but are never connected and also, never stopped/terminated
Solution:
stopped
state from thegetstate()
method. The thread will wait for 5 seconds before retrying. I have been testing this fix in our internal environment and a retry attempt of 1 seems to work perfectly. The retry count can be increased but that can keep the thread waiting for long.uptime < idleMilliseconds
check from EC2RetentionStrategy.java. For planned stopped nodes that are started but not connected,idleMilliseconds
will be counted from the time even before when it was last stopped and will be always greater thanuptime
. This prevents such faulty nodes from being terminated and also incurs AWS costs. The condition is ideally true for instances in the launching phase that haven't been connected yet. However, this is already taken care of by the condition here which protects termination of launching nodes from getting terminated before 30 minutes (STARTUP_TIMEOUT)