Nomad Server: fatal error: concurrent map read and map write #4607

jippi · 2018-08-22T07:32:41Z

Nomad version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

Issue

Nomad server panics with fatal error: concurrent map read and map write

It happened on 2 out of the 3 Nomad servers in the cluster

Restarting the Nomad servers causes them to panic the same way after a short time

Clients seem unaffected

go test -race -run 'TestEval*' ./nomad/ seem to output tons of data race issues in general

Reproduction steps

Was draining a bunch of nodes, not sure if its related or not

Nomad Server logs (if appropriate)

https://gist.github.com/jippi/7b41d871cdbda6e24d202e09f0925438

Hotfix

diff --git a/lib/delay_heap.go b/lib/delay_heap.go
index 155583198..540593e76 100644
--- a/lib/delay_heap.go
+++ b/lib/delay_heap.go
@@ -3,11 +3,14 @@ package lib
 import (
 	"container/heap"
 	"fmt"
+	"sync"
 	"time"
 
 	"github.com/hashicorp/nomad/nomad/structs"
 )
 
+var delayHeapMutex = sync.RWMutex{}
+
 // DelayHeap wraps a heap and gives operations other than Push/Pop.
 // The inner heap is sorted by the time in the WaitUntil field of delayHeapNode
 type DelayHeap struct {
@@ -91,13 +94,21 @@ func (p *DelayHeap) Push(dataNode HeapNode, next time.Time) error {
 		ID:        dataNode.ID(),
 		Namespace: dataNode.Namespace(),
 	}
+
+	delayHeapMutex.RLock()
 	if _, ok := p.index[tuple]; ok {
+		delayHeapMutex.RUnlock()
 		return fmt.Errorf("node %q (%s) already exists", dataNode.ID(), dataNode.Namespace())
 	}
+	delayHeapMutex.RUnlock()
 
 	delayHeapNode := &delayHeapNode{dataNode, next, 0}
 	p.index[tuple] = delayHeapNode
+
+	delayHeapMutex.Lock()
 	heap.Push(&p.heap, delayHeapNode)
+	delayHeapMutex.Unlock()
+
 	return nil
 }
 
@@ -120,6 +131,8 @@ func (p *DelayHeap) Peek() *delayHeapNode {
 		return nil
 	}
 
+	delayHeapMutex.RLock()
+	defer delayHeapMutex.RUnlock()
 	return p.heap[0]
 }
 
@@ -128,7 +141,9 @@ func (p *DelayHeap) Contains(heapNode HeapNode) bool {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+	delayHeapMutex.RLock()
 	_, ok := p.index[tuple]
+	delayHeapMutex.RUnlock()
 	return ok
 }
 
@@ -137,14 +152,17 @@ func (p *DelayHeap) Update(heapNode HeapNode, waitUntil time.Time) error {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+
+	delayHeapMutex.RLock()
 	if existingHeapNode, ok := p.index[tuple]; ok {
+		delayHeapMutex.RUnlock()
 		// Need to update the job as well because its spec can change.
 		existingHeapNode.Node = heapNode
 		existingHeapNode.WaitUntil = waitUntil
 		heap.Fix(&p.heap, existingHeapNode.index)
 		return nil
 	}
-
+	delayHeapMutex.Unlock()
 	return fmt.Errorf("heap doesn't contain object with ID %q (%s)", heapNode.ID(), heapNode.Namespace())
 }
 
@@ -153,6 +171,10 @@ func (p *DelayHeap) Remove(heapNode HeapNode) error {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+
+	delayHeapMutex.Lock()
+	defer delayHeapMutex.Unlock()
+
 	if node, ok := p.index[tuple]; ok {
 		heap.Remove(&p.heap, node.index)
 		delete(p.index, tuple)

The text was updated successfully, but these errors were encountered:

tantra35 · 2018-08-23T14:12:41Z

We have the same issues, but we still have issues described here #4477, and simple restarts hung nomad servers, after they stuck

Here our logs
nomad_race.logs.txt

But patch that you are provided doesn't looks legal, DelayHeap imho must be threadsafe neutral
and more correct patch must change more higtlevel structures and will be looks like so

diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index 54e32e7a8..59fde90dc 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -763,8 +763,8 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
 			return
 		case <-timerChannel:
 			// remove from the heap since we can enqueue it now
-			b.delayHeap.Remove(&evalWrapper{eval})
 			b.l.Lock()
+			b.delayHeap.Remove(&evalWrapper{eval})
 			b.stats.TotalWaiting -= 1
 			b.enqueueLocked(eval, eval.Type)
 			b.l.Unlock()
@@ -777,11 +777,14 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
 // nextDelayedEval returns the next delayed eval to launch and when it should be enqueued.
 // This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
 func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
+	b.l.RLock()
 	// If there is nothing wait for an update.
 	if b.delayHeap.Length() == 0 {
+		b.l.RUnlock()
 		return nil, time.Time{}
 	}
 	nextEval := b.delayHeap.Peek()
+	b.l.RUnlock()
 
 	if nextEval == nil {
 		return nil, time.Time{}

It will by very helpful if guys from hashicorp clarify this moment

jippi · 2018-08-24T09:22:50Z

my patch made the cluster non-crashy, so it did what i wanted it to :)

jippi · 2018-08-27T14:27:41Z

@preetapan can you or someone on the nomad team please triage this?

pkrolikowski · 2018-08-31T12:33:17Z

Hi,
We are also facing the same problem. Just as @jippi described, Nomad servers are throwing:
fatal error: concurrent map read and map write
and it seems that's related with draining bunch of nodes (that's the only thing we've done).

Nomad version:
Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial

preetapan · 2018-08-31T13:18:24Z

Sorry for the late response, been traveling this week.

This is on our radar and I'll have a PR up for it by early next week.

pkrolikowski · 2018-08-31T13:36:54Z

@preetapan thanks, for info. Any suggestion for some hotfix ? Server restart or sth.
We are not able to interact with our production cluster at the moment and it's crucial for us to solve this problem somehow.

preetapan · 2018-08-31T13:45:57Z

@pkrolikowski yes restarting the server should work. Draining several nodes at once increases the chances of the race condition occurring again so I would suggest avoiding that till I merge a fix to master. Sorry for the bug,

github-actions · 2022-11-28T02:20:09Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jippi mentioned this issue Aug 22, 2018

Run test suite with -race #4608

Closed

dadgar added type/bug theme/crash labels Aug 27, 2018

jippi mentioned this issue Aug 28, 2018

Add a -race enabled build to Travis #4620

Merged

preetapan mentioned this issue Aug 31, 2018

Use eval broker lock when reading/modifying delay heap #4632

Merged

preetapan closed this as completed in #4632 Sep 4, 2018

github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Server: fatal error: concurrent map read and map write #4607

Nomad Server: fatal error: concurrent map read and map write #4607

jippi commented Aug 22, 2018 •

edited

Loading

tantra35 commented Aug 23, 2018 •

edited

Loading

jippi commented Aug 24, 2018

jippi commented Aug 27, 2018

pkrolikowski commented Aug 31, 2018

preetapan commented Aug 31, 2018

pkrolikowski commented Aug 31, 2018

preetapan commented Aug 31, 2018

github-actions bot commented Nov 28, 2022

Nomad Server: fatal error: concurrent map read and map write #4607

Nomad Server: fatal error: concurrent map read and map write #4607

Comments

jippi commented Aug 22, 2018 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Hotfix

tantra35 commented Aug 23, 2018 • edited Loading

jippi commented Aug 24, 2018

jippi commented Aug 27, 2018

pkrolikowski commented Aug 31, 2018

preetapan commented Aug 31, 2018

pkrolikowski commented Aug 31, 2018

preetapan commented Aug 31, 2018

github-actions bot commented Nov 28, 2022

jippi commented Aug 22, 2018 •

edited

Loading

tantra35 commented Aug 23, 2018 •

edited

Loading