Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

nil pointer dereference at fastdp.go:677 #1661

Closed
dpw opened this issue Nov 9, 2015 · 11 comments
Closed

nil pointer dereference at fastdp.go:677 #1661

dpw opened this issue Nov 9, 2015 · 11 comments
Labels
Milestone

Comments

@dpw
Copy link
Contributor

dpw commented Nov 9, 2015

See https://groups.google.com/a/weave.works/d/msg/weave-users/KBF7Ao_IT8Q/Nn1ZeLINCQAJ

In 1.2.0/1.2.1:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x28 pc=0x4de519]

goroutine 62 [running]:
github.com/weaveworks/weave/router.(*fastDatapathForwarder).sendHeartbeat(0xc8212a98c0)
    /home/bryan/golang/src/github.com/weaveworks/weave/router/fastdp.go:677 +0x5f9
github.com/weaveworks/weave/router.(*fastDatapathForwarder).doHeartbeats(0xc8212a98c0)
    /home/bryan/golang/src/github.com/weaveworks/weave/router/fastdp.go:623 +0xd9
created by github.com/weaveworks/weave/router.(*fastDatapathForwarder).Confirm
    /home/bryan/golang/src/github.com/weaveworks/weave/router/fastdp.go:605 +0x428

Cause: fastDatapathForwarder Forward() can return nil (as of 35141a4) but the invocation in sendHeartbeat wasn't updated to take account of that.

@rade rade added the bug label Nov 9, 2015
@rade rade added this to the 1.2.2 milestone Nov 9, 2015
@rade
Copy link
Member

rade commented Nov 9, 2015

What circumstances trigger this? Presumably something relatively rare, since otherwise we would have seen it earlier.

@rade
Copy link
Member

rade commented Nov 9, 2015

Is there also a problem in MultiFlowOp.Add? It gets invoked with the result of Forward, and it seems a nil value would trip up a subsequent invocation of MultiFlowOp.Process.

@rade
Copy link
Member

rade commented Nov 9, 2015

and I wonder whether router.relay (or rather its maze of call sites) is also affected. It returns the result of Forward. Perhaps it should check for nil and return DiscardingFlowOp instead.

@dpw
Copy link
Contributor Author

dpw commented Nov 9, 2015

Cause: fastDatapathForwarder Forward() can return nil (as of 35141a4) but the invocation in sendHeartbeat wasn't updated to take account of that.

So this isn't quite as silly as it might first appear. If we are sending a fastdp heartbeat to a peer, we must have fastdp, and they must have fastdp, so how can HasShortID be false for either (note that HasShortID remains true even while a short id collision is being resolved)?

Answer: if we are attempting a connection to a 1.2 peer that we heard about from a 1.1 peer.

So there's a direct fix here (sendHeartbeat should tolerate a nil return from Forward). But we should also fix applyUpdate so that it corrects the short id state of the peer when appropriate.

@dpw
Copy link
Contributor Author

dpw commented Nov 9, 2015

Is there also a problem in MultiFlowOp.Add? It gets invoked with the result of Forward, and it seems a nil value would trip up a subsequent invocation of MultiFlowOp.Process.

and I wonder whether router.relay (or rather its maze of call sites) is also affected. It returns the result of Forward. Perhaps it should check for nil and return DiscardingFlowOp instead.

No and no. The fastDatapathForwarder always sits behind an overlaySwitchForwarder, and that never returns nil from its Forward. So the router never sees a nil FlowOp, it's just a special-case signal to overlaySwitchForwarder. And because that's just a compatibility thing, it should go away completely in the fullness of time, so that nil FlowOps become forbidden everywhere.

@baopx
Copy link

baopx commented Nov 10, 2015

Some more information on what happened.

I have 5 nodes to upgrade from 1.1.0 to 1.2:
The first 4 nodes upgraded to 1.2, while the last one had a network issue (aws network) and I could not access it to do the upgrade.

On 3 of the other nodes, they came up and did not join a weave network (this is a problem on my side where the 3 nodes did not know about other weave nodes, different problem).

On 4th node (this node panic), I did a weave reset, enable FastPath (enable by default), and it panics at the given stack trace.

The 5th node, is still on 1.1.0... Not upgradeable.

@rade
Copy link
Member

rade commented Nov 10, 2015

The fastDatapathForwarder always sits behind an overlaySwitchForwarder, and that never returns nil from its Forward

So a fastDatapathForwarder is not behaving as a proper forwarder. That is surprising, and needs documenting.

Or is there perhaps a better way to do this that doesn't break the Forward contract? A special FlowOp perhaps that is handled specially by the overlay switch but otherwise behaves like a DiscardingFlowOp? (I am guessing DiscardingFlowOp cannot be used directly here, since otherwise you would have done that).

@dpw
Copy link
Contributor Author

dpw commented Nov 10, 2015

Or is there perhaps a better way to do this that doesn't break the Forward contract?

The Forward contract explicitly describes the meaning of the nil value:

    // Forward a packet across the connection.  May be called as
    // soon as the forwarder is created, in particular before
    // Confirm().  The return value nil means the key could not be
    // handled by this forwarder.
    Forward(ForwardPacketKey) FlowOp

But OverlaySwitch and SleeveOverlay, which are the only two that the router is directly exposed to, support an extended contract: They can handle any key (this is part of the point of OverlaySwitch). Shockingly, go's type system cannot represent the optionality and subsumption here.

Or is there perhaps a better way to do this that doesn't break the Forward contract? A special FlowOp perhaps that is handled specially by the overlay switch but otherwise behaves like a DiscardingFlowOp?

I'm not sure that would be better. And it's overkill given the context: When 1.1 compat is dropped, the nil case will go away. Furthermore, the router will only ever be exposed to OverlaySwitch, and so the type of its Overlay field can be changed accordingly.

(I am guessing DiscardingFlowOp cannot be used directly here, since otherwise you would have done that).

When Forward returns a DiscardingFlowOp it means "here's how to forward packets with that key: discard them".

But by all means add comments as you consider appropriate.

@rade
Copy link
Member

rade commented Nov 10, 2015

The Forward contract explicitly describes the meaning of the nil value:

Given that comment, I would have expected all Forward call sites to deal with a nil return. Which is evidently not the case.

(adding a special FlowOp) is overkill given the context: When 1.1 compat is dropped, the nil case will go away. Furthermore, the router will only ever be exposed to OverlaySwitch, and so the type of its Overlay field can be changed accordingly.

That is all much more obscure than introducing a special FlowOp and prohibiting Forward from returning nil.

@dpw
Copy link
Contributor Author

dpw commented Nov 10, 2015

The bug is fairly easy to reproduce in a way that confirms my theory:

  • Launch a 1.1 peer
  • Launch a 1.2 peer, passing the address of the 1.1 peer.
  • Launch another 1.2 peer, passing the address of the 1.1 peer.

The 1.2 peers then attempt to connect to each other, triggering the bug

@dpw
Copy link
Contributor Author

dpw commented Nov 10, 2015

So there's a direct fix here (sendHeartbeat should tolerate a nil return from Forward). But we should also fix applyUpdate so that it corrects the short id state of the peer when appropriate.

On second thoughts, that latter fix would be too much complexity for too little benefit. Once all 1.1 routers are gone, connections stuck using sleeve (if they occur) can be addressed simply by restarting the routers once more.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants