-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netlink timeouts #1976
Netlink timeouts #1976
Conversation
@mavenugo can you give a pass to the changes, the risk is that the new vendoring of the netlink is pretty heavy in changes. We should try to reproduce the issue and then apply this patch |
Codecov Report
@@ Coverage Diff @@
## master #1976 +/- ##
=========================================
Coverage ? 38.94%
=========================================
Files ? 137
Lines ? 27390
Branches ? 0
=========================================
Hits ? 10666
Misses ? 15426
Partials ? 1298
Continue to review full report at Codecov.
|
bb38615
to
20c86db
Compare
How do I make sure this PR does the fix? cc @antonybichon17 |
@andrewhsu the deadlock condition triggered on socket closed is tested in the netlink library project. This commit simply uses the same timeout |
You are using the timeout in a different part of the code, triggered by different actions. |
@antonybichon17 for the validation IMHO we should try to reproduce the customer deadlock that would be the best way to have the validation. |
drivers/overlay/ov_network.go
Outdated
@@ -713,14 +719,17 @@ func (n *network) watchMiss(nlSock *nl.NetlinkSocket) { | |||
t := time.Now() | |||
for { | |||
msgs, err := nlSock.Receive() | |||
if err != nil { | |||
// When the receive timeout expires the receive will return EAGAIN | |||
if err == syscall.EAGAIN { | |||
n.Lock() | |||
nlFd := nlSock.GetFd() | |||
n.Unlock() | |||
if nlFd == -1 { | |||
// The netlink socket got closed, simply exit to not leak this goroutine | |||
return | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add a continue here to avoid spam logs
a8dd1c0
to
e78b4ff
Compare
@antonybichon17 test added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
drivers/overlay/ov_network.go
Outdated
@@ -713,14 +719,18 @@ func (n *network) watchMiss(nlSock *nl.NetlinkSocket) { | |||
t := time.Now() | |||
for { | |||
msgs, err := nlSock.Receive() | |||
if err != nil { | |||
// When the receive timeout expires the receive will return EAGAIN | |||
if err == syscall.EAGAIN { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit
if err != nil{
if err == syscall.EAGAIN{
....
}
....
}
ipvs/ipvs.go
Outdated
"github.com/vishvananda/netlink/nl" | ||
"github.com/vishvananda/netns" | ||
) | ||
|
||
var netlinkSocketsTimeout = 3 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding is this a standard timeout ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope is decided from us, the tradeoff is between waking up too ofter or being block too much time. Don't have an opinion on how much is the "right" time here
ipvs/ipvs.go
Outdated
if err := sock.SetReceiveTimeout(&tv); err != nil { | ||
return nil, err | ||
} | ||
if err := sock.SetReceiveTimeout(&tv); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this done twice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch it should be ReceiveTimeout and SendTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't netlinkSendSocketTimeout
be netlinkRecvSocketsTimeout
here ?
ipvs/netlink.go
Outdated
) | ||
|
||
if err := s.Send(req); err != nil { | ||
err := s.Send(req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit :
if err := s.Send(req);err != nil {
.....
}
ipvs/netlink.go
Outdated
@@ -221,6 +218,12 @@ func execute(s *nl.NetlinkSocket, req *nl.NetlinkRequest, resType uint16) ([][]b | |||
done: | |||
for { | |||
msgs, err := s.Receive() | |||
if err == syscall.EAGAIN { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
if err != nil {
if err == syscall.EAGAIN {
....
}
....
}
19f93aa
to
fc7323d
Compare
"github.com/vishvananda/netlink/nl" | ||
"github.com/vishvananda/netns" | ||
) | ||
|
||
const ( | ||
netlinkRecvSocketsTimeout = 3 * time.Second | ||
netlinkSendSocketTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is set to ah higher value to avoid timeout error in case the system is busy. I noticed that the error coming from the write failure to IPVS is not percolated up, so if the addLBBackend backend failed in the configuration, the error is not handled at all. I will follow up on this on a different PR though because the change will be potentially non trivial.
drivers/overlay/ov_network.go
Outdated
n.Unlock() | ||
if nlFd == -1 { | ||
// The netlink socket got closed, simply exit to not leak this goroutine | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt we retain this logic for all cases ?
In case the file descriptor of the netlink socket is closed the recvfrom is not returning. This may create deadlock conditions. The current solution is to make sure that all the netlink socket used have a proper timeout set on them to have the possibility to return Added test to emulate the watchMiss condition Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
- needed the methods to set the proper timeout Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
fc7323d
to
45ea903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add netlink timeouts to all the netlink operations
Vendored netlink library to pick the required methods (temporary my branch)