Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad marks deployments as successful, before Consul checks are passed. #15536

Closed
Nalgorithm opened this issue Dec 12, 2022 · 4 comments · Fixed by #15822
Closed

Nomad marks deployments as successful, before Consul checks are passed. #15536

Nalgorithm opened this issue Dec 12, 2022 · 4 comments · Fixed by #15822

Comments

@Nalgorithm
Copy link

Nalgorithm commented Dec 12, 2022

Nomad version and environment details

Nomad 1.1.2 -> 1.4.2
Consul 1.9.5
Ubuntu 20.04

Environment configuration:

Servers: 3 nomad servers, consul servers (services register here) + consul servers with ACL.
Clients: we've installed nomad client and consul agent.

Issue

After upgrading Nomad from 1.1.2 to 1.4.2 some deployments are marked as successful, despite that service has not yet been registered with Consul. This problem is reproduced when running several deployments at the same time.

Reproduction steps

  1. Make docker image of the service, which runs 3 seconds after starting and then fails.
  2. Run several deployments via script - nomad-reproducer.go.txt

Expected Behaviour

All deployments fail.

Actual Behaviour

Few deployments are successful.

Logs for one allocation for successful deployment

Nomad start to process deployment:

nomad[2668973]:     2022-12-09T13:34:35.361+0300 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
nomad[2668973]:     2022-12-09T13:34:35.422+0300 [WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
nomad[2668973]:     2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad args=["/usr/bin/nomad", "logmon"]
nomad[2668973]:     2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad pid=1424417
nomad[2668973]:     2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad
nomad[2668973]:     2022-12-09T13:34:35.435+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon address=/tmp/plugin1614056546 network=unix timestamp="2022-12-09T13:34:35.435+0300"
nomad[2668973]:     2022-12-09T13:34:35.435+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task version=2
nomad[2668973]:     2022-12-09T13:34:35.442+0300 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stdout.fifo timestamp="2022-12-09T13:34:35.442+0300"
nomad[2668973]:     2022-12-09T13:34:35.443+0300 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stderr.fifo timestamp="2022-12-09T13:34:35.443+0300"
nomad[2668973]:     2022-12-09T13:34:35.534+0300 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task binds="[]string{\"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc:/alloc\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/local:/local\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/secrets:/secrets\", \"/etc/docker/resolv.conf:/etc/resolv.conf:ro\"}"
nomad[2668973]:     2022-12-09T13:34:35.535+0300 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task labels=map[com.hashicorp.nomad.alloc_id:384708cc-5ca5-1e14-ca6b-7017045ee3c7]
nomad[2668973]:     2022-12-09T13:34:35.535+0300 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task container_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7

Dockerd start container:

dockerd[60232]: 2022-12-09T13:34:35.535621200+03:00 level=debug msg="Calling POST /containers/create?name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7"
dockerd[60232]: 2022-12-09T13:34:51.808265565+03:00 level=debug msg="Assigning addresses for endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7's interface on network vertis"

Container fail:

dockerd[60232]: 2022-12-09T13:34:57.848499016+03:00 level=debug msg="Revoking external connectivity on endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7 (0469b6ae1bb562a08331195c330fb5fb05e231499861b4f5927ddc25020ef9cc)"
dockerd[60232]: 2022-12-09T13:35:03.354392690+03:00 level=debug msg="Releasing addresses for endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7's interface on network vertis"

Nomad marks deployment as successful:

2022-12-09 13:35:06.120 Deployment Update for ID "cae9bf7e-d534-eab4-a963-7e68893d8474": Status "successful"; Description "Deployment completed successfully"
_PID=241887 _SYSTEMD_UNIT=nomad.service

Container restart due to restart policy:

 nomad[2668973]:     2022-12-09T13:35:06.778+0300 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task reason="Restart within policy" delay=1.155116655s
 nomad[2668973]:     2022-12-09T13:35:07.938+0300 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
 nomad[2668973]:     2022-12-09T13:35:07.940+0300 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stdout.fifo timestamp="2022-12-09T13:35:07.940+0300"
 nomad[2668973]:     2022-12-09T13:35:07.940+0300 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stderr.fifo @module=logmon timestamp="2022-12-09T13:35:07.940+0300"
 nomad[2668973]:     2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task binds="[]string{\"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc:/alloc\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/local:/local\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/secrets:/secrets\", \"/etc/docker/resolv.conf:/etc/resolv.conf:ro\"}"
 nomad[2668973]:     2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task labels=map[com.hashicorp.nomad.alloc_id:384708cc-5ca5-1e14-ca6b-7017045ee3c7]
 nomad[2668973]:     2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task container_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7

First Consul check logs for alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7:

consul[48541]:     2022-12-09T13:36:17.347+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.599+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.636+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.684+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.685+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.717+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.732+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.772+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.817+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.818+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.855+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.858+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.877+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.879+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.880+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.898+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.981+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:17.996+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]:     2022-12-09T13:36:18.120+0300 [WARN]  agent: Check is now critical: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
@usovamaria
Copy link

@shoenig Hi! Could you please look at this issue?

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Dec 14, 2022
@tgross
Copy link
Member

tgross commented Dec 14, 2022

Hi folks, we've marked this issue for triage.

@tIGO
Copy link

tIGO commented Jan 13, 2023

Hi, any updates here? Unfortunately, we can't use this version in production and downgrades are not the strong side of nomad. Do you need more details?

@shoenig shoenig self-assigned this Jan 17, 2023
@shoenig
Copy link
Member

shoenig commented Jan 17, 2023

Hi folks, sorry for missing this issue before. I am able to reproduce fairly easily by fudging our use of Consul API to pretend like checks not being registered on time. Will find a fix and backport in the next bugfix release set.

+++ b/command/agent/consul/service_client.go
@@ -1474,7 +1474,10 @@ func (c *ServiceClient) AllocRegistrations(allocID string) (*serviceregistration
                        sreg.Service = services[serviceID]
                        for checkID := range sreg.CheckIDs {
                                if check, ok := checks[checkID]; ok {
-                                       sreg.Checks = append(sreg.Checks, check)
+
+                                       // pretend like check not in consul yet
+                                       _ = check
+                                       // sreg.Checks = append(sreg.Checks, check)
                                }

@shoenig shoenig added this to the 1.4.x milestone Jan 17, 2023
@shoenig shoenig removed this from Needs Triage in Nomad - Community Issues Triage Jan 17, 2023
shoenig added a commit that referenced this issue Jan 19, 2023
This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536
shoenig added a commit that referenced this issue Jan 19, 2023
This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536
shoenig added a commit that referenced this issue Jan 19, 2023
* consul: correctly understand missing consul checks as unhealthy

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536

* consul: minor CR refactor using maps not sets

* consul: observe transition from healthy to unhealthy checks

* consul: spell healthy correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants