Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netlink timeouts #1976

Merged
merged 2 commits into from
Dec 4, 2017
Merged

Netlink timeouts #1976

merged 2 commits into from
Dec 4, 2017

Conversation

fcrisciani
Copy link

@fcrisciani fcrisciani commented Oct 10, 2017

Add netlink timeouts to all the netlink operations
Vendored netlink library to pick the required methods (temporary my branch)

@fcrisciani
Copy link
Author

@mavenugo can you give a pass to the changes, the risk is that the new vendoring of the netlink is pretty heavy in changes. We should try to reproduce the issue and then apply this patch

@codecov-io
Copy link

codecov-io commented Oct 10, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@460ac8f). Click here to learn what that means.
The diff coverage is 13.04%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1976   +/-   ##
=========================================
  Coverage          ?   38.94%           
=========================================
  Files             ?      137           
  Lines             ?    27390           
  Branches          ?        0           
=========================================
  Hits              ?    10666           
  Misses            ?    15426           
  Partials          ?     1298
Impacted Files Coverage Δ
ipvs/netlink.go 0% <0%> (ø)
ipvs/ipvs.go 0% <0%> (ø)
drivers/overlay/ov_network.go 2.07% <33.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 460ac8f...e78b4ff. Read the comment docs.

@andrewhsu
Copy link
Member

How do I make sure this PR does the fix? cc @antonybichon17

@fcrisciani
Copy link
Author

@andrewhsu the deadlock condition triggered on socket closed is tested in the netlink library project. This commit simply uses the same timeout

@antonybichon17
Copy link

You are using the timeout in a different part of the code, triggered by different actions.
So from your perspective @fcrisciani no additional tests are needed to cover the changes you made?

@fcrisciani
Copy link
Author

@antonybichon17 for the validation IMHO we should try to reproduce the customer deadlock that would be the best way to have the validation.
I will try to see if there can an easy way to add a test here to but still would not be the case experienced by the customer, but more the same unit test of the library

@@ -713,14 +719,17 @@ func (n *network) watchMiss(nlSock *nl.NetlinkSocket) {
t := time.Now()
for {
msgs, err := nlSock.Receive()
if err != nil {
// When the receive timeout expires the receive will return EAGAIN
if err == syscall.EAGAIN {
n.Lock()
nlFd := nlSock.GetFd()
n.Unlock()
if nlFd == -1 {
// The netlink socket got closed, simply exit to not leak this goroutine
return
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add a continue here to avoid spam logs

@fcrisciani fcrisciani force-pushed the netlink-timeouts branch 2 times, most recently from a8dd1c0 to e78b4ff Compare November 17, 2017 23:16
@fcrisciani
Copy link
Author

@antonybichon17 test added

Copy link
Contributor

@vieux vieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member

ping @mavenugo @abhi PTAL

@@ -713,14 +719,18 @@ func (n *network) watchMiss(nlSock *nl.NetlinkSocket) {
t := time.Now()
for {
msgs, err := nlSock.Receive()
if err != nil {
// When the receive timeout expires the receive will return EAGAIN
if err == syscall.EAGAIN {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit

if err != nil{
    if err == syscall.EAGAIN{
     ....
    }
    ....
}

ipvs/ipvs.go Outdated
"github.com/vishvananda/netlink/nl"
"github.com/vishvananda/netns"
)

var netlinkSocketsTimeout = 3 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding is this a standard timeout ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope is decided from us, the tradeoff is between waking up too ofter or being block too much time. Don't have an opinion on how much is the "right" time here

ipvs/ipvs.go Outdated
if err := sock.SetReceiveTimeout(&tv); err != nil {
return nil, err
}
if err := sock.SetReceiveTimeout(&tv); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this done twice ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch it should be ReceiveTimeout and SendTimeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't netlinkSendSocketTimeout be netlinkRecvSocketsTimeout here ?

ipvs/netlink.go Outdated
)

if err := s.Send(req); err != nil {
err := s.Send(req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit :

 if err := s.Send(req);err != nil {
 .....
}

ipvs/netlink.go Outdated
@@ -221,6 +218,12 @@ func execute(s *nl.NetlinkSocket, req *nl.NetlinkRequest, resType uint16) ([][]b
done:
for {
msgs, err := s.Receive()
if err == syscall.EAGAIN {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

if err != nil {
    if err == syscall.EAGAIN {
    ....
    }
    ....
}

@fcrisciani fcrisciani force-pushed the netlink-timeouts branch 2 times, most recently from 19f93aa to fc7323d Compare November 30, 2017 20:50
"github.com/vishvananda/netlink/nl"
"github.com/vishvananda/netns"
)

const (
netlinkRecvSocketsTimeout = 3 * time.Second
netlinkSendSocketTimeout = 30 * time.Second
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is set to ah higher value to avoid timeout error in case the system is busy. I noticed that the error coming from the write failure to IPVS is not percolated up, so if the addLBBackend backend failed in the configuration, the error is not handled at all. I will follow up on this on a different PR though because the change will be potentially non trivial.

n.Unlock()
if nlFd == -1 {
// The netlink socket got closed, simply exit to not leak this goroutine
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt we retain this logic for all cases ?

Flavio Crisciani added 2 commits December 4, 2017 09:40
In case the file descriptor of the netlink socket is closed
the recvfrom is not returning. This may create deadlock conditions.
The current solution is to make sure that all the netlink socket used
have a proper timeout set on them to have the possibility to return

Added test to emulate the watchMiss condition

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
- needed the methods to set the proper timeout

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
Copy link
Contributor

@mavenugo mavenugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mavenugo mavenugo merged commit 16c634f into moby:master Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants