Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server doesn't accept connections after some time #8038

Closed
pznamensky opened this issue May 21, 2020 · 5 comments
Closed

Nomad server doesn't accept connections after some time #8038

pznamensky opened this issue May 21, 2020 · 5 comments

Comments

@pznamensky
Copy link

Nomad version

Nomad v0.11.2 (807cfeb)

Operating system and Environment details

CentOS 7.8

Issue

We've got a nomad cluster with 3 server nodes. Each several days one of the nomad servers stops receiving any connections:

srv2~ $ nomad status
Error querying jobs: Get "https://127.0.0.1:4646/v1/jobs": net/http: TLS handshake timeout

And others mark that node as left:

~ $ nomad server members
Name                           Address                         Port  Status  Leader  Protocol  Build   Datacenter  Region
srv1.global  <ip>  4648  alive   false   2         0.11.2  staging     global
srv2.global  <ip>  4648  left    false   2         0.11.2  staging     global
srv3.global  <ip>  4648  alive   true    2         0.11.2  staging     global

I tried to trace the broken nomad process with strace, but the only system calls were: epoll_pwait, nanosleep, sched_yield and futex.
Previous release (0.10.5) seems to be working fine. Nomad agents work well so far.

Reproduction steps

Set up nomad cluster and wait several days :)

Nomad Server config

datacenter = "staging"
data_dir = "/var/lib/nomad"
bind_addr = "::"
enable_syslog = true

server {
    enabled = true
    bootstrap_expect = 3

    retry_join = ["srv2:4648","srv3:4648"]
    retry_interval = "15s"
}

client {
    enabled = false
}

advertise {
    http = "<ip>:4646"
    rpc  = "<ip>:4647"
    serf = "<ip>:4648"
}

consul {
   server_auto_join = false
   client_auto_join = false
   token = "<token>"
}

tls {
   http = true
   rpc  = true

   ca_file   = "/etc/nomad.d/nomad-ca.crt"
   cert_file = "/etc/nomad.d/server.global.nomad.crt"
   key_file  = "/etc/nomad.d/server.global.nomad.private.key"

   verify_server_hostname = true
   verify_https_client    = false
}

acl {
   enabled = true
   token_ttl = "60s"
   policy_ttl = "60s"
}

log_level = "DEBUG"

Nomad Server logs

The last lines on the failed server
May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.401+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/9c9729b0-c6fd-eafb-ffce-127742366060 duration=18.652021ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/56d54b1f-2462-1a5f-b493-ac873d35bc92 duration=17.652458ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/805e9147-c708-1fed-9e10-9b31a27c7ade duration=17.441663ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/30d5c473-3ffc-968a-31b6-5610b1e211fc duration=17.117029ms
May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.402+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/082342be-41ad-12cb-5933-18b85c4a8c0c duration=18.67848ms
May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.404+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/ec8c15f8-0d02-f143-77f6-f57e963ab6b9 duration=21.744965ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/node/c7533df9-ed3c-de50-db4d-a12e31cfeffe duration=9.68731ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/node/12cb5e69-dc66-1411-2f18-048edfb0c2ac duration=16.544546ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/0b5dd865-4e3b-b444-f26f-b9faa8c0241e duration=18.700859ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/9c9729b0-c6fd-eafb-ffce-127742366060 duration=18.652021ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/082342be-41ad-12cb-5933-18b85c4a8c0c duration=18.67848ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/ec8c15f8-0d02-f143-77f6-f57e963ab6b9 duration=21.744965ms
Corresponding logs on another server
May 21 15:13:20 srv1 nomad: 2020-05-21T15:13:20.225+0300 [DEBUG] nomad: memberlist: Failed ping: srv2.global (timeout reached)
May 21 15:13:20 srv1 nomad[22533]: nomad: memberlist: Failed ping: srv2.global (timeout reached)

I understand that it's probably not enough diagnostic information and I could provide more information if you let me know what could also be useful.

@zyclonite
Copy link

having a similar issue, happens randomly after some days and one cpu core goes up to 100%
this happens with all 0.11.x versions but i could not reproduce it on demand

@pznamensky
Copy link
Author

@schmichael any chances this behaviour will be fixed in 0.11.3?
I would say it's a critical bug.
We have had to roll back our cluster to 0.10 but still hope the issue will be fixed in 0.11.3.

@pznamensky
Copy link
Author

After sending SIGABRT (09:03:54 in the log) to unresponsive process, I got this log:
nomad.log

@pznamensky
Copy link
Author

Can't reproduce on 0.11.3. See #8163 (comment)

@github-actions
Copy link

github-actions bot commented Nov 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants