Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: fix lack of synchronization around ChannelStats #971

Merged
merged 1 commit into from
Jan 8, 2018
Merged

nsqd: fix lack of synchronization around ChannelStats #971

merged 1 commit into from
Jan 8, 2018

Conversation

daroot
Copy link

@daroot daroot commented Dec 8, 2017

  • create inFlightMessages and deferredMessages maps inside mutex.
  • acquire appropriate mutexes before doing len() in NewChannelStats()
  • add test that checks for passing -race when accessing stats while
    serving messages.

Fixes issue #970.

@ploxiln
Copy link
Member

ploxiln commented Dec 8, 2017

Ah, I see - we can't "atomically" get the len() of a map. This change looks good to me.

nsqd/stats.go Outdated
c.inFlightMutex.Lock()
defer c.inFlightMutex.Unlock()
c.deferredMutex.Lock()
defer c.deferredMutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tighten up the critical sections between these locks and move the len(...) reads from out of the struct instantiation to temp variables here?

@ploxiln
Copy link
Member

ploxiln commented Jan 8, 2018

Awesome. Can you squash down to one commit?

@ploxiln
Copy link
Member

ploxiln commented Jan 8, 2018

Your test did seem to expose a remaining data race in go-1.7:

WARNING: DATA RACE
Write at 0x00c4201b4630 by goroutine 178:
  runtime.mapassign1()
      /home/travis/.gimme/versions/go1.7.6.linux.amd64/src/runtime/hashmap.go:442 +0x0
  github.com/nsqio/nsq/nsqd.(*Channel).pushInFlightMessage()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/channel.go:451 +0x184
  github.com/nsqio/nsq/nsqd.(*Channel).StartInFlightTimeout()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/channel.go:424 +0x14e
  github.com/nsqio/nsq/nsqd.TestStatsChannelLocking.func1()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/stats_test.go:130 +0x109
Previous read at 0x00c4201b4630 by goroutine 38:
  github.com/nsqio/nsq/nsqd.NewChannelStats()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/stats.go:62 +0x176
  github.com/nsqio/nsq/nsqd.(*NSQD).GetStats()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/stats.go:164 +0xc2b
  github.com/nsqio/nsq/nsqd.TestStatsChannelLocking.func2()
      /home/travis/gopath/src/github.com/nsqio/nsq/nsqd/stats_test.go:137 +0x6d

Is it just in the test itself possibly? I may have time to take a look later.

(I manually triggered a re-test and it passed :P)

@ploxiln
Copy link
Member

ploxiln commented Jan 8, 2018

Ah, OK, obviously you need to call len() while locking the mutex.

@daroot
Copy link
Author

daroot commented Jan 8, 2018

My bad. I didn't even think about the fact that copying the map is actually a reference to the map. So using a local copy puts you right back to where you started, with the actual map object being accessed from outside the lock. Fix up in a few minutes.

* create inFlightMessages and deferredMessages maps inside mutex.
* acquire appropriate mutexes before doing len() in NewChannelStats()
* add test that checks for passing -race when accessing stats while
  serving messages.

Fixes issue #970.
@daroot
Copy link
Author

daroot commented Jan 8, 2018

Okay, fix is up, and commits squashed.

@mreiferson
Copy link
Member

thanks!

@mreiferson mreiferson merged commit cb83885 into nsqio:master Jan 8, 2018
@daroot daroot deleted the sync_stats_970 branch January 9, 2018 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants