Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner agents are (still) not cleaned up every time #56

Closed
sandrinr opened this issue Oct 5, 2022 · 8 comments · Fixed by #59
Closed

Hetzner agents are (still) not cleaned up every time #56

sandrinr opened this issue Oct 5, 2022 · 8 comments · Fixed by #59
Labels
bug Something isn't working

Comments

@sandrinr
Copy link

sandrinr commented Oct 5, 2022

Jenkins and plugins versions report

Environment
Jenkins: 2.361.1
OS: Linux - 5.4.0-105-generic
---
ace-editor:1.1
analysis-model-api:10.17.0
ansicolor:1.0.2
antisamy-markup-formatter:2.7
apache-httpcomponents-client-4-api:4.5.13-138.v4e7d9a_7b_a_e61
authentication-tokens:1.4
bootstrap5-api:5.2.1-2
bouncycastle-api:2.26
branch-api:2.1046.v0ca_37783ecc5
build-blocker-plugin:1.7.8
build-monitor-plugin:1.13+build.202205140447
caffeine-api:2.9.3-65.v6a_47d0f4d1fe
checks-api:1.7.5
cloud-stats:0.27
cloudbees-folder:6.758.vfd75d09eea_a_1
cobertura:1.17
code-coverage-api:3.1.0
command-launcher:90.v669d7ccb_7c31
copyartifact:1.47
credentials:1189.vf61b_a_5e2f62e
credentials-binding:523.vd859a_4b_122e6
dark-theme:245.vb_a_2b_b_010ea_96
data-tables-api:1.12.1-3
display-url-api:2.3.6
docker-commons:1.21
docker-workflow:521.v1a_a_dd2073b_2e
downstream-build-cache:1.7
durable-task:500.v8927d9fd99d8
echarts-api:5.3.3-1
extended-read-permission:3.2
font-awesome-api:6.2.0-3
forensics-api:1.16.0
git:4.12.1
git-client:3.12.0
git-server:99.va_0826a_b_cdfa_d
gitlab-api:5.0.1-78.v47a_45b_9f78b_7
gitlab-branch-source:640.v7101b_1c0def9
gitlab-plugin:1.5.35
handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953
hetzner-cloud:66.v5cb45cbd155e
instance-identity:116.vf8f487400980
ionicons-api:28.va_f3a_84439e5f
jackson2-api:2.13.3-285.vc03c0256d517
jacoco:3.3.2
jakarta-activation-api:2.0.1-1
jakarta-mail-api:2.0.1-1
javax-activation-api:1.2.0-4
javax-mail-api:1.6.2-7
jaxb:2.3.6-1
jdk-tool:55.v1b_32b_6ca_f9ca
jersey2-api:2.37-1
jquery3-api:3.6.1-1
jsch:0.1.55.61.va_e9ee26616e7
junit:1143.v8d9a_e3355270
lockable-resources:2.16
mailer:438.v02c7f0a_12fa_4
matrix-project:785.v06b_7f47b_c631
mina-sshd-api-common:2.9.1-44.v476733c11f82
mina-sshd-api-core:2.9.1-44.v476733c11f82
parameterized-trigger:2.45
pipeline-build-step:2.18
pipeline-groovy-lib:612.v84da_9c54906d
pipeline-input-step:451.vf1a_a_4f405289
pipeline-milestone-step:101.vd572fef9d926
pipeline-model-api:2.2114.v2654ca_721309
pipeline-model-definition:2.2114.v2654ca_721309
pipeline-model-extensions:2.2114.v2654ca_721309
pipeline-stage-step:296.v5f6908f017a_5
pipeline-stage-tags-metadata:2.2114.v2654ca_721309
pipeline-utility-steps:2.13.0
plain-credentials:139.ved2b_9cf7587b
plot:2.1.11
plugin-util-api:2.17.0
popper2-api:2.11.6-1
prism-api:1.29.0-1
role-strategy:561.v9846c7351a_41
saml:4.352.vb_722786ea_79d
scm-api:621.vda_a_b_055e58f7
script-security:1183.v774b_0b_0a_a_451
scriptler:3.5
snakeyaml-api:1.32-86.ve3f030a_75631
ssh-agent:295.v9ca_a_1c7cc3a_a_
ssh-credentials:305.v8f4381501156
ssh-slaves:2.846.v1b_70190624f5
sshd:3.249.v2dc2ea_416e33
structs:324.va_f5d6774f3a_d
theme-manager:1.5
timestamper:1.20
token-macro:308.v4f2b_ed62b_b_16
trilead-api:2.72.v2a_3236754f73
variant:59.vf075fe829ccb
warnings-ng:9.20.0
workflow-aggregator:590.v6a_d052e5a_a_b_5
workflow-api:1192.v2d0deb_19d212
workflow-basic-steps:994.vd57e3ca_46d24
workflow-cps:2801.vf82a_b_b_e3e8a_5
workflow-durable-task-step:1199.v02b_9244f8064
workflow-job:1239.v71b_b_a_124a_725
workflow-multibranch:716.vc692a_e52371b_
workflow-scm-step:400.v6b_89a_1317c9a_
workflow-step-api:639.v6eca_cd8c04a_a_
workflow-support:838.va_3a_087b_4055b
yet-another-build-visualizer:1.16

What Operating System are you using (both controller, and any agents involved in the problem)?

Ubuntu 20.04 as the host, Jenkins is running inside a Docker container, using the image jenkins/jenkins:lts.

Reproduction steps

Unclear...

  1. Observe agents being created and deleted according to the expectation of the configured shutdown policy
  2. Wait some time, maybe restart Jenkins, ...unclear
  3. Observe agents being not being deleted anymore, even though the configured shutdown policy would mandate it. to the expectation of the configured shutdown policy

Expected Results

Agents are always deleted according to the configured shutdown policy.

Actual Results

After a while, agents are not deleted anymore but are used basically as if they would be static agents.

Anything else?

After manually deleting those stale agents in Jenkins, Jenkins creates new ones when there is demand and there is a chance these new agents actually are being cleaned up again according to the configured shutdown policy.

Example:

Consider the following screenshot from our Jenkins' cloud statistics:
Greenshot 2022-10-05 11 17 43 copy

In this screenshot you can see three areas: blue, red and purple.

  • The purple area denotes the agents that were created and removed according to the expectation of the shutdown policy
  • The red area denotes agents that were not shut down automatically and were deleted manually
  • The blue area denotes an agent that was created after the manual clean up and that was deleted automatically according to the expectation of the shutdown policy.

Maybe important additional information:

  • The Docker container was restarted (re-created) on 2022-10-01 14:24:33 which is in between of the purple and the red area. So basically after the Jenkins container was recreated, the shutdown policy did not work anymore.
  • In the logs I don't see any log message of the Hetzner plugin between 2022-10-01 21:20:42 and 2022-10-05 07:33:00 (basically during the agents were stale, i.e. the red area)
  • What I see a lot in the log is an exception of the form [0]. However, I think this is exception is caused by maybe a bug in another plugin. It also happens during times when the Hetzner plugin deletes agents as expected.
[0]
2022-10-04 22:28:49.554+0000 [id=108311]	SEVERE	hudson.plugins.plot.CSVSeries#loadSeries: Exception trying to retrieve series files
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to hcloud-ttvgwgquximu2tex
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
		at hudson.remoting.Channel.call(Channel.java:1000)
		at hudson.FilePath.act(FilePath.java:1186)
		at hudson.FilePath.act(FilePath.java:1175)
		at hudson.FilePath.list(FilePath.java:2125)
		at hudson.FilePath.list(FilePath.java:2108)
		at hudson.FilePath.list(FilePath.java:2092)
		at hudson.plugins.plot.CSVSeries.loadSeries(CSVSeries.java:149)
		at hudson.plugins.plot.Plot.addBuild(Plot.java:720)
		at hudson.plugins.plot.PlotBuilder.perform(PlotBuilder.java:235)
		at jenkins.tasks.SimpleBuildStep.perform(SimpleBuildStep.java:123)
		at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:101)
		at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:71)
		at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
		at java.base/java.lang.Thread.run(Thread.java:829)
java.io.IOException: Expecting Ant GLOB pattern, but saw '/home/jenkins/workspace/vices_infrastructuretest_SA-8466/jenkins-sgapplianceInfrastructureTesttest.csv'. See https://ant.apache.org/manual/Types/fileset.html for syntax
	at hudson.FilePath.glob(FilePath.java:2162)
	at hudson.FilePath$ListGlob.invoke(FilePath.java:2143)
	at hudson.FilePath$ListGlob.invoke(FilePath.java:2128)
	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3492)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:376)
	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
@sandrinr sandrinr added the bug Something isn't working label Oct 5, 2022
@rkosegi
Copy link
Contributor

rkosegi commented Oct 5, 2022

Hi @sandrinr, thank you for bug report.
I'm afraid this looks like what we have seen in past and we have no fix available for.

Can you please provide:

  • what's configured shutdown policy on affected agents
  • output of this script executed in script console

@sandrinr
Copy link
Author

sandrinr commented Oct 9, 2022

Hi @rkosegi, thanks for picking it up.

Now, it happened again. Again after a restart of the Jenkins Docker container. Again the first node(s) spawned did not get shut down when they should be.

image

The output of the script
true
HetznerServerInfo(sshKeyDetail=SshKeyDetail(created=2022-03-23T06:14:25+00:00, fingerprint=31:b3:c1:4f:d4:6b:5f:97:b4:50:f1:1f:3e:4d:49:7c, labels={jenkins.io/managed-by=hetzner-jenkins-plugin, jenkins.io/credentials-id=..}, name=..., publicKey=...), serverDetail=ServerDetail(name=hcloud-jdawffnkmzmsg2l5, status=running, created=2022-10-09T03:48:11+00:00, publicNet=PublicNetDetail(), privateNet=[], serverType=ServerType(name=cpx41, description=CPX 41, cores=8, memory=16, disk=240), datacenter=DatacenterDetail(name=nbg1-dc3, description=Nuremberg 1 DC 3, location=LocationDetail(name=nbg1, description=Nuremberg DC Park 1, country=DE, city=Nuremberg)), image=ImageDetail(type=snapshot, status=available, name=null, description=Generated by jenkins_node_hetzner_cloud_image.jenkins)))
true
-------------
Result: [cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer@2ede2248]

As I see it that node should be shut down (all conditions are met). But somehow, I think, the code to decide whether to shut the node down is never called.

@sandrinr
Copy link
Author

Adding another datapoint: The issue was currently happening for us all the time. There were no automatic deletion of agents anymore. Also restarts did not help.

What I did now, is to go to the Cloud settings and change the shutdown policy to time out based, save and change it back to, billing cycle based and saved. Now agents were automatically deleted as expected. My guess is that it will now work for a while until it stops working again.

@sandrinr
Copy link
Author

@rkosegi I think I can share some updates.

Recently the issue came up again. I started to dig and made the following discovery.

I wanted to see the actual retention strategy active on the Hetzner cloud nodes. For that I ran the following script:

def hetznerComputers = Jenkins.instance.computers.findAll {it instanceof cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer}
hetznerComputers.each { println("${it.name}: ${it.retentionStrategy}") }

The output of the script during the time when nodes where not shutdown correctly was:

hcloud-gdazcgia5at0q8dq: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-i6hclb7t28kxuabz: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-i9e1nplpjyl5eh3b: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-jjvddplmktenjx9h: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-qwlivqfkvtxulvpl: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-ul9dz5soavq4qnf5: hudson.slaves.RetentionStrategy$Always@10fad782

Also, when deleting all nodes manually and triggering a build that will trigger the creation of a new node, the retention strategy will be the same:

hcloud-rnkbnep9dcpisyyb: hudson.slaves.RetentionStrategy$Always@10fad782

Actually, even restarting Jenkins did not change this. After a restart and triggering a node creation it was the same:

hcloud-whjnixzynhf653qf: hudson.slaves.RetentionStrategy$Always@7f2c4f8b

I was able to fix the issue by changing some setting in the Hetzner cloud settings, in my case a changed the max number of allowed nodes, after that (and without Jenkins restart), when triggering the creation of a new node its retention policy was what I expected:

hcloud-j5pxincalzklmnox: cloud.dnation.jenkins.plugins.hetzner.shutdown.BeforeHourWrapsPolicy$RetentionStrategyImpl@5d46de5c

And also the node was shut down as expected by the retention strategy.

So the issue seems to be that after some point all newly created nodes get the retention policy Always assigned and as such are never shut down anymore.

I was unable to find out where this is coming from. Even when checking the main Jenkins config.xml the configured retention strategy was

<shutdownPolicy class="cloud.dnation.jenkins.plugins.hetzner.shutdown.BeforeHourWrapsPolicy"/>

Since the issue seems to persist even during restarts, this needs to be persisted somewhere...

Also, I have no idea what is causing the nodes to have the retention policy Always assigned in the first place.

@rkosegi
Copy link
Contributor

rkosegi commented Oct 25, 2022

Hi @sandrinr, this is very interesting finding. I will try to find some time soon to look into that.

rkosegi added a commit that referenced this issue Oct 30, 2022
For some reason, value maintained by hudson.model.Slave is null under
cirsumstances described in #56, which is causing return value to be
set to RetentionStrategy.Always.INSTANCE.

Closes #56

Signed-off-by: Richard Kosegi <richard.kosegi@gmail.com>
rkosegi added a commit that referenced this issue Oct 30, 2022
For some reason, value maintained by hudson.model.Slave is null under
cirsumstances described in #56, which is causing return value to be
set to RetentionStrategy.Always.INSTANCE.

Closes #56

Signed-off-by: Richard Kosegi <richard.kosegi@gmail.com>
@rkosegi
Copy link
Contributor

rkosegi commented Oct 30, 2022

@sandrinr I hope I fixed root cause, but I want to emphasize that when you restart Jenkins controller, plugin assumes that no servers were running.

@sandrinr
Copy link
Author

@rkosegi We have the fix in operation for a while now and never experienced the issue again. Thank you very much for the fix 🙏 .

@rkosegi
Copy link
Contributor

rkosegi commented Nov 16, 2022

@sandrinr that are great news, thanks for confirmation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants