Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

ktf · 2022-08-10T15:40:23Z

Nomad version

Nomad v1.3.3 (428b2cd)

Operating system and Environment details

Centos 7

Issue

Today a given machine in our cluster was failing to run some services and the only message which was there is:

"logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory"

restarting the machine did not help. Then suddenly after I pushed a raw_exec batch job for test to the same machine, things got back into business. Any idea of what might have been going on?

Reproduction steps

No idea.

Expected Result

Having tasks running on that machine, as usual.

Actual Result

Tasks were not running.

The text was updated successfully, but these errors were encountered:

shoenig · 2022-08-10T19:24:14Z

Hi @ktf, can't say I've seen anything like this.

Were the failures coming from tasks being launched for the first time? Or were these tasks that Nomad was trying to reattach to? What task driver were these tasks using?

ktf · 2022-08-10T19:41:59Z

Thank you for your reply. No, the tasks have been there since a while and they have a constrain of the given machine. They are stateless long running services. They are using the exec driver. Could it be that some internal nomad state was not fully committed and that the scheduling of the raw_exec batch job cleaned up something? Does the error message mean that the executable /usr/bin/nomad is not found?

shoenig · 2022-08-12T15:12:23Z

@ktf, was the nomad executable itself modified / replaced / upgraded while the tasks were running? The error message suggests os.Executable returned the literal string: /usr/bin/nomad;62edd7f7 (i.e. including the odd ;62edd7f7 suffix). I'm not aware of anything in Nomad code that would do that 🤔

https://github.com/hashicorp/nomad/blob/v1.3.3/client/logmon/plugin.go#L19

ktf · 2022-08-12T15:36:20Z

Yes, I suspect this correlates with the move from 1.3.2 to 1.3.3, which was done via yum update in the background. I will pin the versions for the future, but right now I am still unable to launch jobs on many of the machines (but not all).

ktf · 2022-08-12T15:42:57Z

for the record... I also tried adding a symlink to nomad;62edd7f7 and it still does not work.

ktf · 2022-08-12T15:43:11Z

also, the hash is different on every machine which does not work.

shoenig · 2022-08-12T15:55:39Z

Can we get the exact kernel you're using via uname -a ?

ktf · 2022-08-12T16:07:21Z

Linux XXX.cern.ch 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

ktf · 2022-08-13T07:25:43Z

I am not 100% sure, but apparently:

https://github.com/golang/go/blob/946b4baaf6521d521928500b2b57429c149854e7/src/syscall/syscall_linux.go#L262

does not keep into account that readlink does not need to return a 0 terminated string. No?
Am I wrong?

shoenig · 2022-08-15T14:02:17Z

I think the readlink call is okay - the buffer being passed in to store the result is created via Go's make which automatically zero-fills the values

https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/os/file_unix.go;l=370

I did however come across https://superuser.com/questions/1714705/bash-processes-with-deleted-link-targets-in-proc-pid-exe which describes exactly the same format you're seeing - /usr/bin/bash;61f1808a (deleted) (in our case, the (deleted) is being removed by the os.Executable implementation). They are also using CentOS 7 / a 3.10 kernel.

ktf · 2022-08-15T14:32:10Z

Ah, ok, sorry my go-fu is limited. :-)

Thank you for the pointer, I will also inquire with our VM experts, since one of the commonalities seems to be the fact it happens in both cases when running on an hypervisor.

That said, do you think you could simply strip whatever is right of ; on your side (just like you remove (deleted)?

ktf · 2022-08-15T15:02:09Z

I suspect this:

https://www.reddit.com/r/linux/comments/apmptq/comment/egcc313/

is also relevant. Apparently there is some container related vulnerability with /proc/self/exe and containers, which might be somehow worked around by RH.

shoenig · 2022-08-15T19:30:50Z

That said, do you think you could simply strip whatever is right of ; on your side (just like you remove (deleted)?

Yeah I think that's likely where this is headed, it would just be so much more comforting if we could at least track down where the suffix is originating 😬

One question that came up in an internal discussion - you mentioned rebooting the machine didn't fix it - do you mean you continued to see the exact same issue - or was there some other error condition?

ktf · 2022-08-16T06:03:34Z

One question that came up in an internal discussion - you mentioned rebooting the machine didn't fix it - do you mean you continued to see the exact same issue - or was there some other error condition?

Exact same condition after a reboot (i.e. issuing "reboot" on a root shell). Notice this is a VM on OpenStack (I think the actual backend is QEMU), I do not know if that might be the reason. I have not tried an hard reboot from the OpenStack console.

I am pretty much convinced this is SELinux coming into play. What I can do on my side is to:

Hard reboot the VM
Hard reboot the VM disabling SELinux

and see if there is any improvements. I will try in a bit.

This PR causes the logmon task runner to acquire the binary of the Nomad executable in an 'init' block, so as to almost certainly get the name while the nomad file still exists. This is an attempt at fixing the case where a deleted Nomad file (e.g. during upgrade) may be getting renamed with a mysterious suffix first. If this doesn't work, as a last resort we can literally just trim the mystery string. Fixes: #14079

github-actions · 2022-12-23T02:13:55Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ktf added the type/bug label Aug 10, 2022

shoenig added the stage/needs-investigation label Aug 10, 2022

shoenig self-assigned this Aug 12, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 22, 2022

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Aug 22, 2022

tgross moved this from In Progress to Triaging in Nomad - Community Issues Triage Aug 22, 2022

shoenig mentioned this issue Aug 24, 2022

client/logmon: acquire executable in init block #14297

Merged

shoenig closed this as completed in #14297 Aug 24, 2022

Nomad - Community Issues Triage automation moved this from Triaging to Done Aug 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

ktf commented Aug 10, 2022 •

edited

Loading

shoenig commented Aug 10, 2022

ktf commented Aug 10, 2022 •

edited

Loading

shoenig commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 12, 2022

shoenig commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 13, 2022

shoenig commented Aug 15, 2022

ktf commented Aug 15, 2022

ktf commented Aug 15, 2022

shoenig commented Aug 15, 2022

ktf commented Aug 16, 2022

github-actions bot commented Dec 23, 2022

Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

Comments

ktf commented Aug 10, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

shoenig commented Aug 10, 2022

ktf commented Aug 10, 2022 • edited Loading

shoenig commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 12, 2022

shoenig commented Aug 12, 2022

ktf commented Aug 12, 2022

ktf commented Aug 13, 2022

shoenig commented Aug 15, 2022

ktf commented Aug 15, 2022

ktf commented Aug 15, 2022

shoenig commented Aug 15, 2022

ktf commented Aug 16, 2022

github-actions bot commented Dec 23, 2022

ktf commented Aug 10, 2022 •

edited

Loading

ktf commented Aug 10, 2022 •

edited

Loading