-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload terminates to separate thread #433
Conversation
@@ -459,6 +465,11 @@ public Node reconfigure(final StaplerRequest req, JSONObject form) throws FormEx | |||
return null; | |||
} | |||
|
|||
@Override | |||
public boolean isAcceptingTasks() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the javadocs this signals that the computer can't accept tasks so when terminating once out of the lock this node should not be able to receive a task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to look good overall, though I'd like to avoid introducing sleep
calls in new tests.
rs.check(c); | ||
EC2AbstractSlave node = c.getNode(); | ||
while (node.terminateScheduled) | ||
Thread.sleep(TimeUnit.SECONDS.toMillis(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I was going to suggest mocking the Clock
instance in the test to advance time, but then I realized this is a spinlock. To be a more reliable test that depends less on the passage of time (i.e., stateful), use of CountDownLatch which counts down at the end of the Runnable
submitted would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how do you propose that to work. Since the call happens inside terminate() I don't see a way of introducing it in tests.
EDIT: I think i understand what you mean, but do we want to change the boolean into something like a CountDownLatch we would need to reassign the latch everytime we call terminate (Which ideally should be once). I do agree that it makes the tests better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to make it repeatable, there's a CyclicBarrier class you can use to reset the counter each time. Basically, you count down on terminate, and in the unit test, you use await() on the latch or barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for testability, it'd be ok to expose as package-private for unit tests (I think there might be a @VisibleForTesting
annotation or something similar you can use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable approach! Perhaps one day someone will introduce RxJava or Reactor or something that makes it a little easier to do these types of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, but these days have been a bit crazy
This PR seems to have introduced a critical bug @res0nance https://issues.jenkins-ci.org/browse/JENKINS-62062 |
@@ -109,6 +112,10 @@ | |||
/* The time at which we fetched the last instance data */ | |||
protected transient long lastFetchTime; | |||
|
|||
/** Terminate was scheduled */ | |||
@Nonnull | |||
protected transient ResettableCountDownLatch terminateScheduled = new ResettableCountDownLatch(1, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@res0nance My java isn't very good, but I think when these objects are deserialized ResettableCountDownLatch
is getting set to null
by default and not a new object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can fix this in readResolve
@@ -459,6 +466,11 @@ public Node reconfigure(final StaplerRequest req, JSONObject form) throws FormEx | |||
return null; | |||
} | |||
|
|||
@Override | |||
public boolean isAcceptingTasks() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of this method has resulted in a critical bug (as noted by @nigelarmstrong-bc) - this method does not exist on serialized versions of this class created by previous plugins, and you get this exception in Jenkins on any access:
Caught exception evaluating: c.iconClassName in /jenkins/ajaxExecutors. Reason: java.lang.reflect.InvocationTargetException
java.lang.NullPointerException
at hudson.plugins.ec2.EC2AbstractSlave.isAcceptingTasks(EC2AbstractSlave.java:471)
at hudson.model.Computer.isAcceptingTasks(Computer.java:1612)
at hudson.slaves.SlaveComputer.isAcceptingTasks(SlaveComputer.java:177)
at hudson.model.Computer.getIconClassName(Computer.java:753)
Caused: java.lang.reflect.InvocationTargetException
EDIT: upon further reflection (ha), I think @nigelarmstrong-bc is correct and readResolve is the place to fix this, as the new field's not getting initialized due to Java serialization semantics - the method is in fact there since the latest version of the class is present in the classloader, but its dependent field is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @thoulen ^^
* [JENKINS-62043] Fix NPE which broke Jenkins builds In PR #433, which was released in AWS EC2 Plugin 1.50, a bug was introduced due to improper handling of an important transient field which is accessed very frequently by the Jenkins server due to it being referenced in `EC2AbstractSlave.isAcceptingTasks()`. This broke any Jenkins user which had running EC2 Nodes during the update to plugin 1.50 (see https://issues.jenkins-ci.org/browse/JENKINS-62043 for details). * Excluded false-positive nullcheck Spotbugs rule Turned off a Spotbugs rule which incorrectly understands how Java serialization works, and which thought a transient field couldn't be null in Object.readResolve() when it actually can be. * Removed @nonnull annotation from transient field Removed a javax.annotation.Nonnull annotation from a transient field to silence a Spotbugs warning which incorrectly flagged a field as never being null when it can in fact be null due to Java serialization semantics * Removed unneeded NONNULL exclusion Code in EC2AbstractSlave was updated to remove a javax.annotation.Nonnull annotation, so this Spotbugs warning should no longer trigger and we can remove the exclusion.
Offload termination of agents so as to not stall
NodeProvisioner.update
when NodeProvisioner.update is called it will hold the queue lock. In order to reduce the hotness of this lock schedule all terminates to be off thread