Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Server: fatal error: concurrent map read and map write #4607

Closed
jippi opened this issue Aug 22, 2018 · 8 comments
Closed

Nomad Server: fatal error: concurrent map read and map write #4607

jippi opened this issue Aug 22, 2018 · 8 comments

Comments

@jippi
Copy link
Contributor

jippi commented Aug 22, 2018

Nomad version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

Issue

Nomad server panics with fatal error: concurrent map read and map write

It happened on 2 out of the 3 Nomad servers in the cluster

Restarting the Nomad servers causes them to panic the same way after a short time

Clients seem unaffected

go test -race -run 'TestEval*' ./nomad/ seem to output tons of data race issues in general

Reproduction steps

Was draining a bunch of nodes, not sure if its related or not

Nomad Server logs (if appropriate)

https://gist.github.com/jippi/7b41d871cdbda6e24d202e09f0925438

Hotfix

diff --git a/lib/delay_heap.go b/lib/delay_heap.go
index 155583198..540593e76 100644
--- a/lib/delay_heap.go
+++ b/lib/delay_heap.go
@@ -3,11 +3,14 @@ package lib
 import (
 	"container/heap"
 	"fmt"
+	"sync"
 	"time"
 
 	"github.com/hashicorp/nomad/nomad/structs"
 )
 
+var delayHeapMutex = sync.RWMutex{}
+
 // DelayHeap wraps a heap and gives operations other than Push/Pop.
 // The inner heap is sorted by the time in the WaitUntil field of delayHeapNode
 type DelayHeap struct {
@@ -91,13 +94,21 @@ func (p *DelayHeap) Push(dataNode HeapNode, next time.Time) error {
 		ID:        dataNode.ID(),
 		Namespace: dataNode.Namespace(),
 	}
+
+	delayHeapMutex.RLock()
 	if _, ok := p.index[tuple]; ok {
+		delayHeapMutex.RUnlock()
 		return fmt.Errorf("node %q (%s) already exists", dataNode.ID(), dataNode.Namespace())
 	}
+	delayHeapMutex.RUnlock()
 
 	delayHeapNode := &delayHeapNode{dataNode, next, 0}
 	p.index[tuple] = delayHeapNode
+
+	delayHeapMutex.Lock()
 	heap.Push(&p.heap, delayHeapNode)
+	delayHeapMutex.Unlock()
+
 	return nil
 }
 
@@ -120,6 +131,8 @@ func (p *DelayHeap) Peek() *delayHeapNode {
 		return nil
 	}
 
+	delayHeapMutex.RLock()
+	defer delayHeapMutex.RUnlock()
 	return p.heap[0]
 }
 
@@ -128,7 +141,9 @@ func (p *DelayHeap) Contains(heapNode HeapNode) bool {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+	delayHeapMutex.RLock()
 	_, ok := p.index[tuple]
+	delayHeapMutex.RUnlock()
 	return ok
 }
 
@@ -137,14 +152,17 @@ func (p *DelayHeap) Update(heapNode HeapNode, waitUntil time.Time) error {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+
+	delayHeapMutex.RLock()
 	if existingHeapNode, ok := p.index[tuple]; ok {
+		delayHeapMutex.RUnlock()
 		// Need to update the job as well because its spec can change.
 		existingHeapNode.Node = heapNode
 		existingHeapNode.WaitUntil = waitUntil
 		heap.Fix(&p.heap, existingHeapNode.index)
 		return nil
 	}
-
+	delayHeapMutex.Unlock()
 	return fmt.Errorf("heap doesn't contain object with ID %q (%s)", heapNode.ID(), heapNode.Namespace())
 }
 
@@ -153,6 +171,10 @@ func (p *DelayHeap) Remove(heapNode HeapNode) error {
 		ID:        heapNode.ID(),
 		Namespace: heapNode.Namespace(),
 	}
+
+	delayHeapMutex.Lock()
+	defer delayHeapMutex.Unlock()
+
 	if node, ok := p.index[tuple]; ok {
 		heap.Remove(&p.heap, node.index)
 		delete(p.index, tuple)
@tantra35
Copy link
Contributor

tantra35 commented Aug 23, 2018

We have the same issues, but we still have issues described here #4477, and simple restarts hung nomad servers, after they stuck

Here our logs
nomad_race.logs.txt

But patch that you are provided doesn't looks legal, DelayHeap imho must be threadsafe neutral
and more correct patch must change more higtlevel structures and will be looks like so

diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index 54e32e7a8..59fde90dc 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -763,8 +763,8 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
 			return
 		case <-timerChannel:
 			// remove from the heap since we can enqueue it now
-			b.delayHeap.Remove(&evalWrapper{eval})
 			b.l.Lock()
+			b.delayHeap.Remove(&evalWrapper{eval})
 			b.stats.TotalWaiting -= 1
 			b.enqueueLocked(eval, eval.Type)
 			b.l.Unlock()
@@ -777,11 +777,14 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
 // nextDelayedEval returns the next delayed eval to launch and when it should be enqueued.
 // This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
 func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
+	b.l.RLock()
 	// If there is nothing wait for an update.
 	if b.delayHeap.Length() == 0 {
+		b.l.RUnlock()
 		return nil, time.Time{}
 	}
 	nextEval := b.delayHeap.Peek()
+	b.l.RUnlock()
 
 	if nextEval == nil {
 		return nil, time.Time{} 

It will by very helpful if guys from hashicorp clarify this moment

@jippi
Copy link
Contributor Author

jippi commented Aug 24, 2018

my patch made the cluster non-crashy, so it did what i wanted it to :)

@jippi
Copy link
Contributor Author

jippi commented Aug 27, 2018

@preetapan can you or someone on the nomad team please triage this?

@pkrolikowski
Copy link

Hi,
We are also facing the same problem. Just as @jippi described, Nomad servers are throwing:
fatal error: concurrent map read and map write
and it seems that's related with draining bunch of nodes (that's the only thing we've done).

Nomad version:
Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial

@preetapan
Copy link
Contributor

Sorry for the late response, been traveling this week.

This is on our radar and I'll have a PR up for it by early next week.

@pkrolikowski
Copy link

@preetapan thanks, for info. Any suggestion for some hotfix ? Server restart or sth.
We are not able to interact with our production cluster at the moment and it's crucial for us to solve this problem somehow.

@preetapan
Copy link
Contributor

@pkrolikowski yes restarting the server should work. Draining several nodes at once increases the chances of the race condition occurring again so I would suggest avoiding that till I merge a fix to master. Sorry for the bug,

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants