tasks fail to restart because of failure to launch logmon #13198

krishnaprateek6 · 2022-06-01T16:55:57Z

Hi there,

Nomad version

v1.2.6

Operating system and Environment details

Centos 7

Issue

Once docker container is restarted successfully nomad fails to re attach the job in nomad. This issue was reported long time ago #5296 but this seems to be an issue on nomad v1.2.6. As the docker container is still running on VM but nomad does not show up as running job & this seems to be an intermittent problem where containers still running but magically jobs in nomad disappear. Also other issue we notice is any running docker containers are sending out tons of warnings "error logs unable to read json" in docker & when we do systemctl restart nomad it takes too long than expected but eventually starts. So we are hoping that above is issue is interconnected. The big problem we are unable to reproduce this issue at will but this keeps on coming up every now & then & then we have to react. restarting docker & nomad seems to fix this issue but we need to do this every time we see this problem.

Reproduction steps

nomad stop ;nomad run docker container is up & running but job fails to show up on nomad UI or nomad status even though docker ps shows container is up. even though it is not consistent but seems to happen every now & then.

Expected Result

jobs need to show up in nomad whenever docker containers are fully up.

Actual Result

docker ps shows containers is up but does not show up in nomad as a registered job.

Job file (if appropriate)

dockerd[30372]: time="2022-06-01T09:15:25.317857180-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=170
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317871168-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=171
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317884580-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=172
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317899266-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=173
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317913751-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=174
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317928000-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=175
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317942272-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=176
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317956568-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=177
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.317980363-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=178
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318001335-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=179
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318019534-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=180
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318034561-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=181
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318048961-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=182
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318068103-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=183
Jun 01 09:15:25 dockerd[30372]: time="2022-06-01T09:15:25.318085230-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=184
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318099485-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=185
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318113410-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=186
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.318128129-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=187
Jun 01 09:15:25  dockerd[30372]: time="2022-06-01T09:15:25.323296331-06:00" level=warning msg="got error while decoding json" error="unexpected EOF" retries=188

Nomad Client logs (if appropriate)

I have attached nomad logs below

The text was updated successfully, but these errors were encountered:

krishnaprateek6 · 2022-06-01T17:21:57Z

krishnaprateek6 · 2022-06-02T13:42:59Z

NomadFailure (1).txt

specially in line 160 nomad starts throwing logmon errors & everything goes side ways for all nomad jobs

krishnaprateek6 · 2022-06-02T14:14:39Z

Task hook failed logmon: Unrecognized remote plugin message: This usually means that the plugin is either invalid or simply needs to be recompiled to support the latest protocol

tgross · 2022-06-06T18:24:33Z

@krishnaprateek6 I've edited your original request because all the information was hidden inside an HTML comment for some reason. The relevant bits from the attached logs look to be here:

May 31 06:06:50 test-server nomad[3720]:  client.driver_mgr.docker: created container: driver=docker container_id=6fd72c65b07b8791b2693ec3170a69fa3a9004e1d3bcf029c8
May 31 06:06:50 test-server sh[3714]: 2022-05-31T06:06:50.459-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=6fd72c65b07b8791b
May 31 06:07:31 test-server nomad[3720]: client.driver_mgr.docker: failed to start container: driver=docker container_id=ea0fea5546d4058a99d92ba88f1bbc5ab29fa3737ed
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.281-0600 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=ea0fea554
May 31 06:07:31 test-server nomad[3720]: client.alloc_runner.task_runner: running driver failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service error=
May 31 06:07:31 test-server nomad[3720]:  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service reason="Res
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.293-0600 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=fd98c3f3-a865-c30d-14db-
May 31 06:07:31 test-server sh[3714]: 2022-05-31T06:07:31.293-0600 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420
May 31 06:07:46 test-server nomad[3720]: client.alloc_runner.task_runner.task_hook: failed to launch logmon process: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 t
                                          error=
                                          | Unrecognized remote plugin message:
                                          |
                                          | This usually means that the plugin is either invalid or simply
                                          | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.960-0600 [ERROR] client.alloc_runner.task_runner.task_hook: failed to launch logmon process: alloc_id=fd98
May 31 06:07:46 test-server sh[3714]: error=
May 31 06:07:46 test-server sh[3714]: | Unrecognized remote plugin message:
May 31 06:07:46 test-server sh[3714]: |
May 31 06:07:46 test-server sh[3714]: | This usually means that the plugin is either invalid or simply
May 31 06:07:46 test-server sh[3714]: | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server nomad[3720]: client.alloc_runner.task_runner: prestart failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420b948f0 task=service
                                          error=
                                          | prestart hook "logmon" failed: Unrecognized remote plugin message:
                                          |
                                          | This usually means that the plugin is either invalid or simply
                                          | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.971-0600 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=fd98c3f3-a865-c30d-14db-7d8420
May 31 06:07:46 test-server sh[3714]: error=
May 31 06:07:46 test-server sh[3714]: | prestart hook "logmon" failed: Unrecognized remote plugin message:
May 31 06:07:46 test-server sh[3714]: |
May 31 06:07:46 test-server sh[3714]: | This usually means that the plugin is either invalid or simply
May 31 06:07:46 test-server sh[3714]: | needs to be recompiled to support the latest protocol.
May 31 06:07:46 test-server sh[3714]: 2022-05-31T06:07:46.971-0600 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd98c3f3-a865-c30d-14db-7d8420

It's a little unclear to me from your description how the situation starts. Is this happening after restarting the Nomad client agent, or after restarting the Docker container? If you mean the Docker container, was it restarted because it failed or was it restarted through a docker command?

Possible duplicate of #11939 or #6715, except it looks like this is a non-Windows case.

krishnaprateek6 · 2022-06-07T17:52:49Z

possibly something related to this #5577 but looks like this was fixed in older version of nomad but after when we start seeing these logmon errors in nomad client logs that's when we notice odd behavior with nomad where it cannot re-attach existing job even though docker ps shows that the container is up & running.

krishnaprateek6 · 2022-06-07T19:18:21Z

@tgross looks like i have more concrete info now, the nomad job itself is running but the job that is running on static port becomes inaccessible after we see above logmon errors its as if nomad cannot release static ports after job restarts itself may be something that would be useful for you. https://discuss.hashicorp.com/t/nomad-job-unreachable-in-bridge-mode-networking-even-though-container-runs-fine/38202/2

tgross · 2022-06-08T18:59:00Z

@krishnaprateek6 let's try to avoid hypothesizing for now and focus on describing the sequence of events and symptoms you're seeing. So far we only know that:

The nomad logs show that a task is restarting, followed by logmon launch errors.
The dockerd logs show a bunch of JSON errors.
The nomad logs show that we may be launching a docker container but then it's immediately getting killed because Nomad can't run logmon.

Stuff that would be helpful for me to help debug:

Earlier I asked if you knew what started the problem in the first place. Did the client host get restarted, or did the task restart for some other reason? Did this problem start appearing after an upgrade?
Can you provide the output of docker --version? (Also: is this upstream Docker or RedHat's podman pretending to be docker?)

krishnaprateek6 · 2022-06-08T19:17:38Z

@tgross to answer your first question no the client did not get restarted but yes task got restarted by nomad when we first looked at it was a db connection issue then we resolved it but second time it happened there were no errors task was running in nomad but port on which task was running became inaccessible. FYI, We have all Firewall rules open for static ports nomad somehow not sure why tries to hold up static port so when the job tries to restart on nomad even thought there is no port that's allocated it thinks port is already allocated so service itself running on that static port becomes inaccessible. Unfortunately we are unable to replicate the missing jobs issue in nomad.

Output docker version: 19.03.13, build 4484c46d9d ( This is on Centos7 VM)

tgross · 2022-06-08T19:30:12Z

@krishnaprateek6 if you have a running task that can't be reached on the network, that's a completely different problem than the logmon attachment. Please open a separate issue for that.

tgross · 2022-06-08T19:31:50Z

I don't have a solution for this but I'm going to mark this issue for further investigation. There may be something related to #11939 or #6715 here but we'll need to build an independent reproduction for this one I think. I've retitled the issue for clarity.

krishnaprateek6 · 2022-09-21T20:25:39Z

is there any info or progress on why systemctl restart nomad takes too long for nomad service to restart please

tgross · 2022-09-21T20:26:58Z

@krishnaprateek6 open a new issue detailing that problem. Please don't pollute existing issues with unrelated questions.

nobert · 2022-10-13T11:41:09Z

Hi @tgross, we are running into the same problem. We're running 1.4.1 on Debian 11. It happens on a nomad client restart. We pretty much only restart clients when upgrading nomad, but I imagine it can happen on a normal restart as well.

Up to now we've only detected it via expiring vault tokens which are still in use by allocations which failed to restart after receiving new secrets. Looking at the logs it appears that the logmon error is happening to most, if not all, allocations after a nomad client restart. However, the number of allocations that continue using old (expired) vault tokens after that is very small (most recently, 1 allocation out of 14 allocations on 1 node, out of 3 nodes total with roughly ~14-15 allocations per node), so it would appear that most are restarting successfully in the end anyway.

It's a strange issue and we're not sure how to proceed. Is there anything you can suggest to debug further?

krishnaprateek6 added the type/bug label Jun 1, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 2, 2022

tgross added theme/logging theme/client-restart labels Jun 6, 2022

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 6, 2022

tgross added the stage/waiting-reply label Jun 6, 2022

tgross self-assigned this Jun 6, 2022

tgross removed the stage/waiting-reply label Jun 8, 2022

tgross removed their assignment Jun 8, 2022

tgross moved this from In Progress to Done in Nomad - Community Issues Triage Jun 8, 2022

tgross moved this from Done to Needs Roadmapping in Nomad - Community Issues Triage Jun 8, 2022

tgross added the stage/needs-investigation label Jun 8, 2022

tgross changed the title ~~Nomad cannot re-attach Job inspite of docker containers still running on VM & healthy~~ tasks fail to restart because of failure to launch logmon Jun 8, 2022

nobert mentioned this issue Oct 27, 2022

Nomad gets credential from vault twice, exports the first as env variable, but only renews the second #15057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tasks fail to restart because of failure to launch logmon #13198

tasks fail to restart because of failure to launch logmon #13198

krishnaprateek6 commented Jun 1, 2022 •

edited by tgross

Loading

krishnaprateek6 commented Jun 1, 2022

krishnaprateek6 commented Jun 2, 2022

krishnaprateek6 commented Jun 2, 2022

tgross commented Jun 6, 2022

krishnaprateek6 commented Jun 7, 2022

krishnaprateek6 commented Jun 7, 2022 •

edited

Loading

tgross commented Jun 8, 2022

krishnaprateek6 commented Jun 8, 2022

tgross commented Jun 8, 2022

tgross commented Jun 8, 2022 •

edited

Loading

krishnaprateek6 commented Sep 21, 2022

tgross commented Sep 21, 2022 •

edited

Loading

nobert commented Oct 13, 2022

tasks fail to restart because of failure to launch logmon #13198

tasks fail to restart because of failure to launch logmon #13198

Comments

krishnaprateek6 commented Jun 1, 2022 • edited by tgross Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Client logs (if appropriate)

krishnaprateek6 commented Jun 1, 2022

krishnaprateek6 commented Jun 2, 2022

krishnaprateek6 commented Jun 2, 2022

tgross commented Jun 6, 2022

krishnaprateek6 commented Jun 7, 2022

krishnaprateek6 commented Jun 7, 2022 • edited Loading

tgross commented Jun 8, 2022

krishnaprateek6 commented Jun 8, 2022

tgross commented Jun 8, 2022

tgross commented Jun 8, 2022 • edited Loading

krishnaprateek6 commented Sep 21, 2022

tgross commented Sep 21, 2022 • edited Loading

nobert commented Oct 13, 2022

krishnaprateek6 commented Jun 1, 2022 •

edited by tgross

Loading

krishnaprateek6 commented Jun 7, 2022 •

edited

Loading

tgross commented Jun 8, 2022 •

edited

Loading

tgross commented Sep 21, 2022 •

edited

Loading