Nodes sporadically marks status as down, causing allocation loss #3595

lovwal · 2017-11-28T16:36:49Z

Nomad version

Nomad v0.7.0

Operating system and Environment details

Centos 7.4 3.10.0-693.el7.x86_64

Issue

Nodes sporadically marks status as down, causing allocations to be lost:

nomad node-status -self
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = ready
Drivers = docker,exec,java
Uptime = 191h9m11s

Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0

Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB

Host Resource Utilization
CPU Memory Disk
13419/15594 MHz 18 GiB/20 GiB 4.7 GiB/100 GiB

Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 run pending 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop pending 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...

Moments later:
nomad node-status -self

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): node down
ID = 25e16dd3
Name = -
Class =
DC = thule
Drain = false
Status = down
Drivers = docker,exec,java

Allocated Resources
CPU Memory Disk IOPS
0/15594 MHz 0 B/20 GiB 0 B/100 GiB 0/0

Allocation Resource Utilization
CPU Memory
0/15594 MHz 0 B/20 GiB

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present

Allocations
ID Node ID Task Group Version Desired Status Created At
9bdaf6a0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:04:02 UTC
feab91c0 25e16dd3 cadvisor 0 stop lost 11/28/17 16:02:47 UTC
205dafc2 25e16dd3 cadvisor 0 stop complete 11/28/17 16:01:41 UTC
f17e652b 25e16dd3 cadvisor 0 stop complete 11/28/17 16:00:46 UTC
d2c047b9 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:58 UTC
492c658e 25e16dd3 cadvisor 0 stop complete 11/28/17 15:59:18 UTC
a6b62701 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:42 UTC
214a4eb2 25e16dd3 cadvisor 0 stop complete 11/28/17 15:58:12 UTC
6de2b900 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:43 UTC
6404e852 25e16dd3 cadvisor 0 stop complete 11/28/17 15:57:17 UTC
1f5ea2c4 25e16dd3 cadvisor 0 stop complete 11/28/17 15:56:48 UTC
...

Worth mentioning, we saw CPU utilization reaching ~90% (this might be the actual causation) on the affected node and memory utilization steadily going up until OOM killed. We've seen this occur 2-3 times since upgrade to 0.7.0.

Reproduction steps

Unclear.

Nomad Client logs (if appropriate)

Reloaded with DEBUG level while the issue was ongoing, so unfortunately I don't think it will provide much.
nomad-logs.txt

Please let me know what further information to collect and I'll grab it next time we see this occur.

chelseakomlo · 2017-11-28T17:12:09Z

Hi, thanks for reporting this issue. Have you seen this issue in the past only when clients are at high CPU utilization? Any more logs that you have for this issue would be helpful.

lovwal · 2017-11-29T08:31:26Z

I looked through old metrics during the last occurrence, looks like this issue wasn't related to CPU utilization during that occurrence. Note however, nomad was restarted before being OOM killed that time.
Unfortunately, those were all the logs I had from the node. One of the server members reports heartbeats being missed from the affected node
Nov 28 15:56:43 thu-ctrl5 nomad[637]: 2017/11/28 15:56:43.924767 [WARN] nomad.heartbeat: node '25e16dd3-8013-9436-0d3b-732ecb1
af322' TTL expired
...
Nov 28 16:04:25 thu-ctrl5 nomad[637]: 2017/11/28 16:04:25.897615 [WARN] nomad.heartbeat: node '25e16dd3-8013-9436-0d3b-732ecb1af322' TTL expired

But that's all I got unfortunately.

TheSpbra1n · 2017-11-29T10:32:17Z

Caught the same error.
Some logs at this time:

Nov 29 10:49:20 docker-01 nomad[9241]:     2017/11/29 10:49:20.828705 [DEBUG] http: Request /v1/agent/health?type=client (346.809µs)
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.049446 [DEBUG] client: RPC failed to server [xxx]:4647: rpc error: EOF
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22 [ERR] yamux: keepalive failed: session shutdown
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.049486 [DEBUG] client: RPC failed to server [xxx]:4647: rpc error: EOF
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.065911 [DEBUG] client: updated allocations at index 2655224 (total 98) (pulled 1) (filtered 97)
Nov 29 10:49:22 docker-01 nomad[9241]:     2017/11/29 10:49:22.066198 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 97)

dadgar · 2017-12-07T02:04:27Z

Hey all,

The heartbeat loop is done in a contention freeway with its own go-routine so the fact that it missed its heartbeat may mean that the server is so overloaded that it failed to send its response for quite a while! I added some logging that will be in 0.7.1 that will help us dig further. So please update to that when you get a chance and report any other failures with those logs!

dadgar · 2018-03-12T21:36:17Z

Closing due to #3890

capone212 · 2018-05-22T13:59:29Z

Hi all, I have this issue on cluster with windows boxes.
Managed to reproduce it stable. Simplest steps are:

Run one stand-alone node (server+client, bootstrap expect =1)
Run some tasks
Hibernate (sleep mode) windows for 30 minutes
wake up windows.

At this point nomad node status constantly flapping from ready to down. Issue goes away after nomad restart.

capone212 · 2018-05-22T14:47:25Z

Please find nomad logs
https://gist.github.com/capone212/4a36e98aca87bfadff3b4399363ff691

Problem starts at
2018/05/22 16:06:32.751737 [WARN] nomad.heartbeat: node '5c645ff5-528a-9e2b-744d-6e6165eedef4' TTL expired

capone212 · 2018-05-22T14:48:20Z

@dadgar please consider reopening this issue

capone212 · 2018-05-22T14:51:38Z

2018/05/22 16:06:42.288382 [WARN] client: heartbeat missed (request took 35.3634ms). Heartbeat TTL was 9.500984734s and heartbeated after 9.5366451s
2018/05/22 16:06:52.616929 [WARN] client: heartbeat missed (request took 31.0827ms). Heartbeat TTL was 10.296911237s and heartbeated after 10.3285475s
2018/05/22 16:07:04.891458 [WARN] client: heartbeat missed (request took 18.0339ms). Heartbeat TTL was 12.256493582s and heartbeated after 12.274529s
2018/05/22 16:07:15.195204 [WARN] client: heartbeat missed (request took 30.3297ms). Heartbeat TTL was 10.273167754s and heartbeated after 10.3037461s
2018/05/22 16:07:29.220006 [WARN] client: heartbeat missed (request took 34.3677ms). Heartbeat TTL was 13.989477386s and heartbeated after 14.0248014s
2018/05/22 16:07:46.411138 [WARN] client: heartbeat missed (request took 17.0588ms). Heartbeat TTL was 8.415857593s and heartbeated after 8.4330008s
2018/05/22 16:08:07.818020 [WARN] client: heartbeat missed (request took 24.6247ms). Heartbeat TTL was 12.407020047s and heartbeated after 12.4326358s
2018/05/22 16:08:21.051683 [WARN] client: heartbeat missed (request took 33.9317ms). Heartbeat TTL was 13.199205098s and heartbeated after 13.2336625s
2018/05/22 16:08:43.145138 [WARN] client: heartbeat missed (request took 23.296ms). Heartbeat TTL was 13.123152495s and heartbeated after 13.1473043s
2018/05/22 16:08:59.969811 [WARN] client: heartbeat missed (request took 36.6344ms). Heartbeat TTL was 8.36411564s and heartbeated after 8.4007989s
2018/05/22 16:09:22.946321 [WARN] client: heartbeat missed (request took 31.0424ms). Heartbeat TTL was 13.500516305s and heartbeated after 13.5317883s
2018/05/22 16:09:44.635384 [WARN] client: heartbeat missed (request took 27.8115ms). Heartbeat TTL was 13.884580018s and heartbeated after 13.9123942s
2018/05/22 16:09:57.672985 [WARN] client: heartbeat missed (request took 35.7857ms). Heartbeat TTL was 13.001369943s and heartbeated after 13.0376008s
2018/05/22 16:10:16.330154 [WARN] client: heartbeat missed (request took 21.0317ms). Heartbeat TTL was 11.262666605s and heartbeated after 11.2845332s

dadgar · 2018-05-22T18:48:59Z

@capone212 Would you mind trying on 0.8.3. There have been quite a few improvements to the heartbeating system. For more details on them you can look at the changelog.

capone212 · 2018-05-24T13:27:03Z

Opened pull request #4331

dadgar · 2018-05-24T17:33:08Z

@capone212 With that patch applied you do not see those logs with your reproduction steps?

capone212 · 2018-05-24T21:10:57Z

@dadgar I can reproduce the problem on vanilla v0.8.3, but with my patch I can't reproduce it any more.
I am sure it fixes the problem.

capone212 · 2018-05-24T21:33:00Z

@dadgar I see a clear race there
Consider the following scenario:

Timer expires and timer handler (invalidateHeartbeat) scheduled for execution
The Timer is Reset-ed in resetHeartbeatTimerLocked after client heartbeat
invalidateHeartbeat removes active timer from heartbeatTimers maps
client hearbeat inserts new active timer in resetHeartbeatTimerLocked
after leaked timer from step 3 expires, it calls invalidateHeartbeat which again leaks active timer from step 4, and essentially we are in circle loop from steps 3 to 5.

The net effect is client status flaps from down to ready constantly.

Stopping heartbeat timer before remove

Fixed #3595

github-actions · 2022-11-30T02:17:54Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chelseakomlo added theme/client type/bug labels Nov 28, 2017

chelseakomlo added the stage/waiting-reply label Nov 28, 2017

dukeland9 mentioned this issue Nov 29, 2017

Under high load, repeated use of allocation name #3593

Closed

chelseakomlo removed the stage/waiting-reply label Nov 29, 2017

dadgar mentioned this issue Dec 7, 2017

Add logging around heartbeats #3628

Merged

dadgar closed this as completed Mar 12, 2018

capone212 mentioned this issue May 24, 2018

Fixed #3595 #4331

Merged

dadgar pushed a commit that referenced this issue May 25, 2018

Fixed #3595 (#3595)

aa2dc83

Stopping heartbeat timer before remove

dadgar added a commit that referenced this issue May 25, 2018

Merge pull request #4331 from capone212/b-3595-fix-heartbeat

3461634

Fixed #3595

github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes sporadically marks status as down, causing allocation loss #3595

Nodes sporadically marks status as down, causing allocation loss #3595

lovwal commented Nov 28, 2017

chelseakomlo commented Nov 28, 2017

lovwal commented Nov 29, 2017 •

edited

Loading

TheSpbra1n commented Nov 29, 2017

dadgar commented Dec 7, 2017

dadgar commented Mar 12, 2018

capone212 commented May 22, 2018 •

edited

Loading

capone212 commented May 22, 2018

capone212 commented May 22, 2018

capone212 commented May 22, 2018

dadgar commented May 22, 2018

capone212 commented May 24, 2018

dadgar commented May 24, 2018

capone212 commented May 24, 2018 •

edited

Loading

capone212 commented May 24, 2018 •

edited

Loading

github-actions bot commented Nov 30, 2022

Nodes sporadically marks status as down, causing allocation loss #3595

Nodes sporadically marks status as down, causing allocation loss #3595

Comments

lovwal commented Nov 28, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

chelseakomlo commented Nov 28, 2017

lovwal commented Nov 29, 2017 • edited Loading

TheSpbra1n commented Nov 29, 2017

dadgar commented Dec 7, 2017

dadgar commented Mar 12, 2018

capone212 commented May 22, 2018 • edited Loading

capone212 commented May 22, 2018

capone212 commented May 22, 2018

capone212 commented May 22, 2018

dadgar commented May 22, 2018

capone212 commented May 24, 2018

dadgar commented May 24, 2018

capone212 commented May 24, 2018 • edited Loading

capone212 commented May 24, 2018 • edited Loading

github-actions bot commented Nov 30, 2022

lovwal commented Nov 29, 2017 •

edited

Loading

capone212 commented May 22, 2018 •

edited

Loading

capone212 commented May 24, 2018 •

edited

Loading

capone212 commented May 24, 2018 •

edited

Loading