Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs are not resubmitted. Looks like unexpected disconnect treated as not unexpected. #121

Closed
jsmirnov opened this issue Jul 15, 2019 · 4 comments
Assignees
Labels

Comments

@jsmirnov
Copy link

jsmirnov commented Jul 15, 2019

Hello,

Recently switched to your plugin - works great, thank you for your job. Currently using version 1.9.1.

I'v noticed that some jobs are not resubmitted.
I'v found and investigated a little bit one example.

AWS Console.

Instance termination: Server.SpotInstanceTermination: Spot instance termination

7/15/2019, 12:06:50 PM | instanceChange | launched | {"instanceType":"m5.12xlarge","image":"...","productDescription":"Linux/UNIX","availabilityZone":"..."} | i-053b75f7181fa021b

7/15/2019, 12:10:33 PM | instanceChange | terminated | {"instanceType":"m5.12xlarge","image":"...","productDescription":"Linux/UNIX","availabilityZone":"..."} | i-053b75f7181fa021b

Jenkins job output

Cannot contact i-053b75f7181fa021b: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on i-053b75f7181fa021b failed. The channel is closing down or has closed down
---our timeout---
Cancelling nested steps due to timeout
Could not connect to i-053b75f7181fa021b to send interrupt signal to process

Jenkins logs:

2019-07-15 09:10:39.361+0000 [id=53]	INFO	c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud [docker] Fleet (docker) no longer has the instance i-053b75f7181fa021b, removing from Jenkins.
2019-07-15 09:10:39.362+0000 [id=4297605]	INFO	c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Unexpected removing fleet node termination but resubmit disabled, no actions, disableTaskResubmit: false, offline: true, offlineCause: class hudson.slaves.OfflineCause$SimpleOfflineCause
2019-07-15 09:10:39.364+0000 [id=4301538]	INFO	c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Unexpected removing fleet node termination but resubmit disabled, no actions, disableTaskResubmit: false, offline: true, offlineCause: class hudson.slaves.OfflineCause$SimpleOfflineCause

I see that in code it only handles OfflineCause.ChannelTermination, so it doesn't resubmit a job, as in our cause it is class hudson.slaves.OfflineCause$SimpleOfflineCause.

So it looks like node was terminated by AWS, but for some reason it was some not unexpected ChannelTermination.

So should this check also include other causes, or should we do some additional configuration?

@jsmirnov
Copy link
Author

Just found the similar one:
State transition reason message Server.SpotInstanceTermination: Spot instance termination

2019-07-15 10:08:01.206+0000 [id=55]	INFO	c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud [docker] Fleet (docker) no longer has the instance i-0c09857bbb04db080, removing from Jenkins.
2019-07-15 10:08:01.212+0000 [id=4458832]	INFO	c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Unexpected removing fleet node termination but resubmit disabled, no actions, disableTaskResubmit: false, offline: true, offlineCause: class hudson.slaves.OfflineCause$SimpleOfflineCause
2019-07-15 10:08:01.213+0000 [id=55]	INFO	c.a.jenkins.ec2fleet.CloudNanny#doRun: Error during fleet FleetCloud stats update
java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
	at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
	at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:367)
	at com.amazon.jenkins.ec2fleet.CloudNanny$1.call(CloudNanny.java:44)
	at com.amazon.jenkins.ec2fleet.CloudNanny$1.call(CloudNanny.java:41)
	at hudson.model.Queue._withLock(Queue.java:1438)
	at hudson.model.Queue.withLock(Queue.java:1299)
	at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:41)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2019-07-15 10:08:01.214+0000 [id=4453143]	INFO	c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Unexpected removing fleet node termination but resubmit disabled, no actions, disableTaskResubmit: false, offline: true, offlineCause: class hudson.slaves.OfflineCause$SimpleOfflineCause

@jsmirnov jsmirnov changed the title Jobs are not resubmitted. Looks like disconnect is not unexpected. Jobs are not resubmitted. Looks like unexpected disconnect treated as not unexpected. Jul 15, 2019
@terma
Copy link

terma commented Jul 17, 2019

hi, thx for report, looks like plugin should check this cause too, what version of Jenkins do you use?

@jsmirnov
Copy link
Author

Jenkins ver. 2.186

SrodriguezO added a commit to lucidsoftware/ec2-fleet-plugin that referenced this issue Sep 9, 2020
(computer.getOfflineCause() instanceof OfflineCause.ChannelTermination) does not always hold after an unexpected instance termination.
As reported in issue jenkinsci#121, the offline cause is sometimes simply hudson.slaves.OfflineCause$SimpleOfflineCause; this led to executables only being occasionally resubmitted.

This commit makes it so that active executables are always resubmitted regardless of the offline cause.
The assumption is that active executables that fail due to an instance going offline should always be automatically rescheduled (unless disableTaskResubmit is specified)
@imuqtadir
Copy link

This should now be fixed with #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants