Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks may fail on reconnect if alloc and task runners run before client connects to servers #15139

Closed
lgfa29 opened this issue Nov 3, 2022 · 1 comment · Fixed by #15140
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug

Comments

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 3, 2022

Nomad version

Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e)

Operating system and Environment details

macOs

Issue

Jobs that need alloc or task runner hooks that communicate with Nomad servers (such as Nomad Native Service Discovery) may fail on reconnect if these hooks run before the client establishes a connection with the server.

image

Reproduction steps

  1. Start a 2 client Nomad cluster
  2. Run job
  3. Make sure each client has one allocation and stop one of the clients
  4. Wait for the allocation to be unknown
  5. Restart client

This is a race condition problem, so you may need to try this a few times 😅

Expected Result

Allocation reconnects properly.

Actual Result

Allocation fails on reconnect.

Job file (if appropriate)

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    count = 2
    max_client_disconnect = "5m"

    network {
      port "db" {
        to = 6379
      }
    }

    service {
      name     = "redis"
      port     = "db"
      provider = "nomad"
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:6"
        ports          = ["db"]
        auth_soft_fail = true
      }
    }
  }
}

Sample configuration files

name      = "server1"
data_dir  = "/tmp/nomad/server"
log_level = "TRACE"

ports {
  http = 4646
  rpc  = 4647
  serf = 4648
}

server {
  enabled           = true
  bootstrap_expect  = 1
  heartbeat_grace   = "1s"
  min_heartbeat_ttl = "5s"
}

client {
  enabled = true
}
name      = "client1"
data_dir  = "/tmp/nomad/client1"
log_level = "TRACE"

ports {
  http = 4656
  rpc  = 4657
  serf = 4658
}

server {
  enabled = false
}

client {
  enabled = true

  server_join {
    retry_join = ["127.0.0.1"]
  }
}

Nomad Client logs (if appropriate)

    2022-11-03T19:03:37.363-0400 [DEBUG] client: registration waiting on servers
    2022-11-03T19:03:37.363-0400 [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: connection refused"
    2022-11-03T19:03:37.364-0400 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd from=init to=init
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd from=init to=prestart
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd from=prestart to=main
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd from=main to=poststart
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd from=poststart to=wait_alloc
    2022-11-03T19:03:37.372-0400 [INFO]  client: started client: node_id=26683602-1db8-25b6-e1d8-169a28aac063
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hooks: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd start="2022-11-03 19:03:37.372683 -0400 EDT m=+6.217725751"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=alloc_dir start="2022-11-03 19:03:37.372707 -0400 EDT m=+6.217749460"
    2022-11-03T19:03:37.372-0400 [TRACE] consul.sync: commit sync operations: ops="<1, 1, 0, 0>"
    2022-11-03T19:03:37.372-0400 [WARN]  client.server_mgr: no servers available
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=alloc_dir end="2022-11-03 19:03:37.37292 -0400 EDT m=+6.217962335" duration="212.875µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=cgroup start="2022-11-03 19:03:37.372925 -0400 EDT m=+6.217967876"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=cgroup end="2022-11-03 19:03:37.372928 -0400 EDT m=+6.217970168" duration="2.292µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=await_previous_allocations start="2022-11-03 19:03:37.37293 -0400 EDT m=+6.217972251"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=await_previous_allocations end="2022-11-03 19:03:37.372932 -0400 EDT m=+6.217974126" duration="1.875µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=migrate_disk start="2022-11-03 19:03:37.372938 -0400 EDT m=+6.217980085"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=migrate_disk end="2022-11-03 19:03:37.372943 -0400 EDT m=+6.217985168" duration="5.083µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=alloc_health_watcher start="2022-11-03 19:03:37.372945 -0400 EDT m=+6.217987543"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: not watching; already has health set: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=alloc_health_watcher end="2022-11-03 19:03:37.372948 -0400 EDT m=+6.217990543" duration="3µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=network start="2022-11-03 19:03:37.37295 -0400 EDT m=+6.217992835"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=network end="2022-11-03 19:03:37.372952 -0400 EDT m=+6.217994418" duration="1.583µs"
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: running pre-run hook: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd name=group_services start="2022-11-03 19:03:37.372954 -0400 EDT m=+6.217996710"
    2022-11-03T19:03:37.372-0400 [DEBUG] http: UI is enabled
    2022-11-03T19:03:37.372-0400 [WARN]  client.server_mgr: no servers available
    2022-11-03T19:03:37.372-0400 [TRACE] client.alloc_runner: finished pre-run hooks: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd end="2022-11-03 19:03:37.372995 -0400 EDT m=+6.218037835" duration="312.084µs"
    2022-11-03T19:03:37.372-0400 [ERROR] client.alloc_runner: prerun failed: alloc_id=61d26e2f-3f16-5149-614a-c23f1c0199fd error="pre-run hook \"group_services\" failed: no servers"
    2022-11-03T19:03:37.373-0400 [DEBUG] http: UI is enabled
    2022-11-03T19:03:37.373-0400 [INFO]  agent.joiner: starting retry join: servers=127.0.0.1
    2022-11-03T19:03:37.374-0400 [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:4647] old_servers=[]
    2022-11-03T19:03:37.374-0400 [INFO]  agent.joiner: retry join completed: initial_servers=1 agent_mode=client
@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant