Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of warning logs with nomad service provider #16238

Open
chenjpu opened this issue Feb 22, 2023 · 7 comments · Fixed by #16240
Open

Large number of warning logs with nomad service provider #16238

chenjpu opened this issue Feb 22, 2023 · 7 comments · Fixed by #16240
Assignees
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug

Comments

@chenjpu
Copy link

chenjpu commented Feb 22, 2023

Nomad version

1.4.4

Issue

logs( journalctl -fu nomad )

 
Feb 22 11:19:06 ai-149 nomad[10066]: 2023-02-22T11:19:06.652+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:07 ai-149 nomad[10066]: 2023-02-22T11:19:07.653+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:08 ai-149 nomad[10066]: 2023-02-22T11:19:08.653+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:09 ai-149 nomad[10066]: 2023-02-22T11:19:09.654+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:10 ai-149 nomad[10066]: 2023-02-22T11:19:10.654+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:11 ai-149 nomad[10066]: 2023-02-22T11:19:11.654+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:12 ai-149 nomad[10066]: 2023-02-22T11:19:12.655+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:13 ai-149 nomad[10066]: 2023-02-22T11:19:13.655+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:14 ai-149 nomad[10066]: 2023-02-22T11:19:14.656+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:15 ai-149 nomad[10066]: 2023-02-22T11:19:15.656+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:16 ai-149 nomad[10066]: 2023-02-22T11:19:16.657+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:17 ai-149 nomad[10066]: 2023-02-22T11:19:17.657+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:18 ai-149 nomad[10066]: 2023-02-22T11:19:18.657+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:19 ai-149 nomad[10066]: 2023-02-22T11:19:19.658+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:20 ai-149 nomad[10066]: 2023-02-22T11:19:20.658+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:21 ai-149 nomad[10066]: 2023-02-22T11:19:21.659+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:22 ai-149 nomad[10066]: 2023-02-22T11:19:22.659+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:23 ai-149 nomad[10066]: 2023-02-22T11:19:23.660+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:24 ai-149 nomad[10066]: 2023-02-22T11:19:24.660+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:25 ai-149 nomad[10066]: 2023-02-22T11:19:25.661+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:26 ai-149 nomad[10066]: 2023-02-22T11:19:26.661+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:27 ai-149 nomad[10066]: 2023-02-22T11:19:27.662+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:28 ai-149 nomad[10066]: 2023-02-22T11:19:28.662+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:29 ai-149 nomad[10066]: 2023-02-22T11:19:29.663+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:30 ai-149 nomad[10066]: 2023-02-22T11:19:30.663+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:31 ai-149 nomad[10066]: 2023-02-22T11:19:31.664+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:32 ai-149 nomad[10066]: 2023-02-22T11:19:32.664+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:33 ai-149 nomad[10066]: 2023-02-22T11:19:33.665+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:34 ai-149 nomad[10066]: 2023-02-22T11:19:34.665+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:35 ai-149 nomad[10066]: 2023-02-22T11:19:35.665+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:36 ai-149 nomad[10066]: 2023-02-22T11:19:36.666+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:37 ai-149 nomad[10066]: 2023-02-22T11:19:37.667+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:38 ai-149 nomad[10066]: 2023-02-22T11:19:38.667+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:39 ai-149 nomad[10066]: 2023-02-22T11:19:39.668+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:40 ai-149 nomad[10066]: 2023-02-22T11:19:40.668+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:41 ai-149 nomad[10066]: 2023-02-22T11:19:41.669+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:42 ai-149 nomad[10066]: 2023-02-22T11:19:42.669+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:43 ai-149 nomad[10066]: 2023-02-22T11:19:43.670+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:44 ai-149 nomad[10066]: 2023-02-22T11:19:44.670+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:45 ai-149 nomad[10066]: 2023-02-22T11:19:45.671+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:46 ai-149 nomad[10066]: 2023-02-22T11:19:46.671+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:47 ai-149 nomad[10066]: 2023-02-22T11:19:47.671+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:48 ai-149 nomad[10066]: 2023-02-22T11:19:48.672+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:49 ai-149 nomad[10066]: 2023-02-22T11:19:49.672+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:50 ai-149 nomad[10066]: 2023-02-22T11:19:50.673+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:51 ai-149 nomad[10066]: 2023-02-22T11:19:51.673+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:52 ai-149 nomad[10066]: 2023-02-22T11:19:52.673+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:53 ai-149 nomad[10066]: 2023-02-22T11:19:53.673+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:54 ai-149 nomad[10066]: 2023-02-22T11:19:54.673+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:55 ai-149 nomad[10066]: 2023-02-22T11:19:55.674+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:56 ai-149 nomad[10066]: 2023-02-22T11:19:56.674+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:57 ai-149 nomad[10066]: 2023-02-22T11:19:57.675+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:58 ai-149 nomad[10066]: 2023-02-22T11:19:58.675+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:19:59 ai-149 nomad[10066]: 2023-02-22T11:19:59.675+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:00 ai-149 nomad[10066]: 2023-02-22T11:20:00.676+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:01 ai-149 nomad[10066]: 2023-02-22T11:20:01.676+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:02 ai-149 nomad[10066]: 2023-02-22T11:20:02.677+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:03 ai-149 nomad[10066]: 2023-02-22T11:20:03.677+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:04 ai-149 nomad[10066]: 2023-02-22T11:20:04.678+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:05 ai-149 nomad[10066]: 2023-02-22T11:20:05.678+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:06 ai-149 nomad[10066]: 2023-02-22T11:20:06.679+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:07 ai-149 nomad[10066]: 2023-02-22T11:20:07.680+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:08 ai-149 nomad[10066]: 2023-02-22T11:20:08.680+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:09 ai-149 nomad[10066]: 2023-02-22T11:20:09.681+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:10 ai-149 nomad[10066]: 2023-02-22T11:20:10.681+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:11 ai-149 nomad[10066]: 2023-02-22T11:20:11.682+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1
Feb 22 11:20:12 ai-149 nomad[10066]: 2023-02-22T11:20:12.682+0800 [WARN]  watch.checks: watched check not found: check_id=90d04bc04f044aba3fe01081bff55de1


job.hcl

job "demo" {
  datacenters = ["dc1"]
  type        = "service"


  update {
    max_parallel     = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    auto_revert      = true
    auto_promote     = true
    canary           = 1
  }

  reschedule {
    attempts       = 15
    interval       = "1h"
    delay          = "15s"
    delay_function = "exponential"
    max_delay      = "120s"
    unlimited      = false
  }

  group "service" {

    scaling {
      enabled = true
      min     = 1
      max     = 3
    }

    restart {
      interval = "3m"
      attempts = 3
      delay    = "15s"
      mode     = "delay"
    }

    network {
      port "http" { to = 8080 }
    }

    service {
      name         = "test"
      port         = "http"
      address_mode = "host"
      tags = [ ]
      provider     = "nomad"
      check {
        type           = "http"
        port           = "http"
        path           = "/"
        interval       = "12s"
        timeout        = "6s"

        check_restart {
          limit           = 3
          grace           = "10s"
        }
      }
    }

    task "app" {
      driver = "docker"
      config {
        image   = "...."
        command = "python"
        ports   = ["http"]
        args = [

        ]
      }
    }
  }
}

@chenjpu chenjpu changed the title Large number of warning logs with nomad service prrovider Large number of warning logs with nomad service provider Feb 22, 2023
@shoenig shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad labels Feb 22, 2023
@shoenig shoenig self-assigned this Feb 22, 2023
@shoenig
Copy link
Member

shoenig commented Feb 22, 2023

Thanks for reporting @chenjpu! Indeed I was able to reproduce this, but with the service belonging to a task rather than the group - any chance your job was actually configured that way too?

Fix in #16240

@chenjpu
Copy link
Author

chenjpu commented Mar 1, 2023

Thanks for reporting @chenjpu! Indeed I was able to reproduce this, but with the service belonging to a task rather than the group - any chance your job was actually configured that way too?

Fix in #16240
Yes, my service is configured on the goup

@chenjpu
Copy link
Author

chenjpu commented Mar 1, 2023

@shoenig
When I upgraded to 1.5.0-rc1, the above problems still remained

@shoenig
Copy link
Member

shoenig commented Mar 1, 2023

Whelp I guess there was more than one bug then 😬

I'll try and reproduce again with a job structured exactly like yours @chenjpu

@shoenig shoenig reopened this Mar 1, 2023
@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Mar 24, 2023
shoenig added a commit that referenced this issue May 12, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 12, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
@seanamos
Copy link

seanamos commented Sep 6, 2023

I am experiencing a tsunami of these logs with Nomad 1.6.1. It's also generating a large amount of request error logs in Consul.

Example consul log:

/v1/agent/check/register ServiceID "_nomad-task-9ec1786d-e9b1-db33-a381-c748bf033163-group-xxx-xxx-xxx-xxx-xxx-xxx-80" does not exist

This seems to have started with a task that failed to start due to a missing vault secret. It's now generating hundreds of thousands of logs between nomad and consul.

The job itself has two groups, each with different vault policies. Each group contains 1 service, 1 task. Only the second group has a service check, it is this check that is causing all the errors.

The issue causing the task to fail has been resolved, however, the log tsunami persists.

In the short term, is there a workaround to get the logs to stop, short of restoring to an older snapshot?

@seanamos
Copy link

seanamos commented Sep 6, 2023

For others running into this and need a quick way to stop the influx of logs, restart the nomad client.

For us, we ran an instance refresh across the affected autoscaling group.

@lattwood
Copy link
Contributor

We're also running into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery/nomad type/bug
Projects
Status: Needs Roadmapping
Development

Successfully merging a pull request may close this issue.

5 participants