Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: set n.Node and n.isMember to correct value in stop/removeNode #1288

Merged
merged 1 commit into from
Aug 6, 2016

Conversation

runshenzhu
Copy link
Contributor

@runshenzhu runshenzhu commented Aug 1, 2016

while n.IsStopped() returns value by checking if n.Node is nil, we didn't set n.Node correctly in n.stop(). This PR fixes it by using isMember to tell if the node is stopped.

Also, it should fix #1272.

ping @abronan @LK4D4

Signed-off-by: Runshen Zhu runshen.zhu@gmail.com

@codecov-io
Copy link

codecov-io commented Aug 1, 2016

Current coverage is 55.18% (diff: 67.74%)

Merging #1288 into master will decrease coverage by <.01%

@@             master      #1288   diff @@
==========================================
  Files            80         80          
  Lines         12562      12565     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6933       6934     +1   
- Misses         4674       4681     +7   
+ Partials        955        950     -5   

Sunburst

Powered by Codecov. Last update e021d14...e7a941e

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 1, 2016

@runshenzhu can't we just set isMember to 0? Also with mutexes atomic sorta loses its original value, so isMember can be just bool.

@runshenzhu
Copy link
Contributor Author

@LK4D4 I'm not sure if an atomic value could prevent it from race, without using lock.

Suppose that isMember is set to 0 in stop, but n.IsLeader still could call n.Node.Status() after stopping, if lock is not hold.

func (n *Node) IsLeader() bool {
    // n.isMember is 1, but node is stopping
    if !n.IsMember() || n.Node == nil {
        return false
    }

    // node is stopped, and n.isMember is set to 0

    if n.Node.Status().Lead == n.Config.ID {
        return true
    }
    return false
}

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 1, 2016

@runshenzhu I'm pretty sure that atomic can't prevent it, that's why I propose to replace it with simple bool and use it as guard.

@runshenzhu
Copy link
Contributor Author

updated: set n.isMember to 0 in n.stop() and n.applyRemoveNode()

@LK4D4 I tried to follow your suggestion of removing atomic but then go complains about race of accessing n.isMember in n.applyRemoveNode().

And we can't simply use lock to protect it, because if the leaving node is leader, a read lock is already held in Leave(). In this case, we firstly should release read lock temporarily in n.applyRemoveNode() and then to get the write lock.

Here I keep n.isMember as atomic because it simplifies the code in n.applyRemoveNode(). But I could change it to bool if you think bool is better.

@@ -486,7 +492,7 @@ func (n *Node) stop() {

// IsLeader checks if we are the leader or not
func (n *Node) IsLeader() bool {
if !n.IsMember() {
if !n.IsMember() || n.Node == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we still need nil check?

@runshenzhu runshenzhu changed the title raft: set n.Node to nil in stop() raft: set n.Node and n.isMember to correct value in stop/removeNode Aug 1, 2016
@runshenzhu
Copy link
Contributor Author

update: remove the nil check. PTAL @LK4D4 @abronan

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 1, 2016

I still think that those checks need some more thoughts.
This PR LGTM, at least should prevent deadlocks in Status()


if n.Node != nil {
n.Stop()
n.Node = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting Node to nil still necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann I think n.IsStopped() needs it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still confused about this. Now that we set both n.Node = nil and isMember to 0, isn't IsStopped basically the same thing as IsMember? I don't understand why we need both. It looks like the only place we use only IsStopped is in LeaderAddr, but I don't understand why that couldn't use IsMember instead.

@aaronlehmann
Copy link
Collaborator

As @LK4D4 mentioned earlier, I don't understand why isMember is an atomic variable. It seems like it needs a lock around it anyway so there aren't time of check/time of use races. Should it just be a regular bool, and have everything that relies on it be protected by stopMu?

@runshenzhu
Copy link
Contributor Author

@aaronlehmann It will be a little complicated to use n.stopMu in n.applyRemoveNode(). If the leaving node is leader, a read lock is already held in Leave(). In this case, we firstly should release read lock temporarily in n.applyRemoveNode() and then to get the write lock.

Using atomic will simplify the case.

@runshenzhu
Copy link
Contributor Author

update: close n.removeRaftCh in n.applyRemoveNode()

n.removeRaftOnce.Do(func() {
atomic.StoreUint32(&n.isMember, 0)
close(n.removeRaftCh)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably factor the n.removeRaftOnce.Do call into a function, since the same code appears in two places.

@aaronlehmann
Copy link
Collaborator

IsLeader can be called from outside code (see protobuf/plugin/raftproxy/raftproxy.go). Since the lock won't be held in this situation, don't we need something to protect it against a situation where isMember is set to 0 right after being checked?

@runshenzhu
Copy link
Contributor Author

@aaronlehmann updated to use RLock(), and refactor it to remove redundant code.

@aaronlehmann
Copy link
Collaborator

@runshenzhu: Any thoughts on this comment? #1288 (comment)

@runshenzhu
Copy link
Contributor Author

runshenzhu commented Aug 2, 2016

@aaronlehmann As a quick fix, n.IsLeader() could hold n.stopMu.RLock to protect itself. This lock is re-enterable and could solve access race.

@runshenzhu
Copy link
Contributor Author

update: using read lock to protect n.IsLeader.

This is a quick fix for #1288 comment. For a long term, we should figure out a better way to make sure n.IsLeader is running under a safe context.

@aaronlehmann
Copy link
Collaborator

Sounds good. You can just call it isLeader.

@runshenzhu runshenzhu force-pushed the fix-stop branch 3 times, most recently from ac00684 to dcebb13 Compare August 2, 2016 18:49
@runshenzhu
Copy link
Contributor Author

update:

  • add 2 new methods: leader() and isLeader(), which don't hold any locks.
  • grab read lock in Leader() and IsLeader()

@aaronlehmann
Copy link
Collaborator

LGTM

@aaronlehmann
Copy link
Collaborator

ping @LK4D4 @abronan for more review.

Any thoughts on whether this should be included in 1.12.1?

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 4, 2016

It sounds like it can be a serious issue. But I don't know if it should be included in minor release.

@abronan
Copy link
Contributor

abronan commented Aug 4, 2016

Let's include #1310 first and then test this one on top of it.

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 4, 2016

@runshenzhu would you mind to rebase?

@runshenzhu
Copy link
Contributor Author

rebased.

@LK4D4 @aaronlehmann @abronan PTAL

@abronan
Copy link
Contributor

abronan commented Aug 5, 2016

Do not merge yet, seeing a panic when stopping nodes, probably a race on setting the node to nil (1 on 3 crashed):

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x60 pc=0x12a520e]

goroutine 1313 [running]:
panic(0x1b17a20, 0xc82000a0d0)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/docker/swarmkit/manager/state/raft.(*Node).GetVersion(0xc820080480, 0x1)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/manager/state/raft/raft.go:939 +0x3e
github.com/docker/swarmkit/manager/state/store.(*MemoryStore).update(0xc821059260, 0x7f0e08303228, 0xc820080480, 0xc8213394e0, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/manager/state/store/memory.go:222 +0x9d
github.com/docker/swarmkit/manager/state/store.(*MemoryStore).Update(0xc821059260, 0xc8213394e0, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/manager/state/store/memory.go:270 +0x54
github.com/docker/swarmkit/manager/state/store.ViewAndWatch(0xc821059260, 0xc821339820, 0xc820ec3d90, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/manager/state/store/memory.go:703 +0xed
github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).Session(0xc821326000, 0xc820142088, 0x7f0e08240850, 0xc820d42a10, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:801 +0x7b8
github.com/docker/swarmkit/api.(*authenticatedWrapperDispatcherServer).Session(0xc821308ac0, 0xc820142088, 0x7f0e08240850, 0xc820d42a10, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/api/dispatcher.pb.go:207 +0x16c
github.com/docker/swarmkit/api.(*raftProxyDispatcherServer).Session(0xc821315c40, 0xc820142088, 0x7f0e08240850, 0xc820d42a10, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1121 +0xc8
github.com/docker/swarmkit/api._Dispatcher_Session_Handler(0x1c32580, 0xc821315c40, 0x7f0e082407e0, 0xc821ce7880, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/github.com/docker/swarmkit/api/dispatcher.pb.go:667 +0x175
google.golang.org/grpc.(*Server).processStreamingRPC(0xc821328000, 0x7f0e0bd2e5e8, 0xc821cf2240, 0xc8220b36c0, 0xc821308d60, 0x2b34440, 0xc820d3bc80, 0x0, 0x0)
    /go/src/github.com/docker/docker/vendor/src/google.golang.org/grpc/server.go:602 +0x47a
google.golang.org/grpc.(*Server).handleStream(0xc821328000, 0x7f0e0bd2e5e8, 0xc821cf2240, 0xc8220b36c0, 0xc820d3bc80)
    /go/src/github.com/docker/docker/vendor/src/google.golang.org/grpc/server.go:686 +0x114e
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc820d42630, 0xc821328000, 0x7f0e0bd2e5e8, 0xc821cf2240, 0xc8220b36c0)
    /go/src/github.com/docker/docker/vendor/src/google.golang.org/grpc/server.go:348 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/docker/docker/vendor/src/google.golang.org/grpc/server.go:349 +0x9a

@aaronlehmann
Copy link
Collaborator

The panic above is because n.Node is being set to nil, and GetVersion is not protected against this condition.

I'm not convinced that we should set n.Node to nil. I think it's safer and easier to use the atomic variable isMember instead (but I might be missing something).

@runshenzhu runshenzhu force-pushed the fix-stop branch 4 times, most recently from 2b5aca5 to 0fb4cb7 Compare August 5, 2016 17:52
@runshenzhu
Copy link
Contributor Author

runshenzhu commented Aug 5, 2016

One concern of using isMember was that its value is set to 0 before the node is actually stopped.

Updated to use isMember as the indicator and postpone setting its value to the very end of stopping node.

But I'm wondering that should we check if node is stopped in GetVersion? and What's the correct return value of GetVersion() if the node is stopped, nil or &api.Version{}?

@@ -963,10 +990,8 @@ func (n *Node) IsMember() bool {

// IsStopped checks if the raft node is stopped or not
func (n *Node) IsStopped() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we remove this function completely?

Signed-off-by: Runshen Zhu <runshen.zhu@gmail.com>
@runshenzhu
Copy link
Contributor Author

runshenzhu commented Aug 5, 2016

updated to remove n.IsStopped() and protect GetVersion() with lock

defer n.stopMu.RUnlock()

if !n.IsMember() {
return nil
Copy link
Contributor Author

@runshenzhu runshenzhu Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice that if node is stopped, nil is returned. I think this check is necessary, as @abronan's test shown above, this method could be called after node is stopped. In this case, without the check, n.Node.Status() is executed after raft node is stopped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay.

@abronan
Copy link
Contributor

abronan commented Aug 6, 2016

LGTM

1 similar comment
@aaronlehmann
Copy link
Collaborator

LGTM

@runshenzhu
Copy link
Contributor Author

LOL, there is a very long fix/conversation list for this PR. Thanks for everyone's patient help, comment and review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

raft: We need to ensure that all Node.Status() calls finish before Node.Stop()
6 participants