Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SetReadDeadline error for websocket connections (#1292) #1310

Merged
merged 7 commits into from
Mar 21, 2018

Conversation

aaithal
Copy link
Contributor

@aaithal aaithal commented Mar 21, 2018

Summary

Added error handling for SetReadDeadline(). This should take care or scenarios where stale websocket connections linger (#1292)

Implementation details

  1. Forcefully terminate connection if there's any error with the
    SetReadDeadline() call. Not doing this leads to unpredictable behavior
    with stale websocket connections.

    Since SetWriteDeadline always returns nil, errors from that are not
    handled.

  2. Move WebsocketConn to its own package. This allows us to mock it
    out in the "wsclient" package and write unit tests on it.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: Yes

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

4.9 reached EOL in 2016 and is not available on Dockerhub anymore.
Update the base image to 6.4 to fix this.
We're at least 2 years behind in this repo. Updated gomock to latest
commit.
[1] Forcefully terminate connection if there's any error with the
SetReadDeadline() call. Not doing this leads to unpredictable behavior
with stale websocket connections.

Since SetWriteDeadline always returns `nil`, errors from that are not
handled.

[2] Move WebsocketConn to its own package. This allows us to mock it
out in the "wsclient" package and write unit tests on it.
Invoke 'goimports' while generating sdk artifacts. Generated targets
were missing some imports and this fixes that issue.

Also, refactored the code a bit to reduce duplication.
aaithal added a commit to aaithal/amazon-ecs-agent that referenced this pull request Mar 21, 2018
Added changelog entry for stale websocket connection bug fix
Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm, just few questions.

}
// An unhandled error has occurred while trying to extend read deadline.
// Try asynchronously closing the connection. We don't want to be blocked on stale connections
// taking too long to close. The flip side is that we might start accumulating stale connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly for my understanding - what exactly happens if we "start accumulating stale connections"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point of time, the OS is going to stop the ECS agent from creating new connections if there's a chronic issue because of file descriptor limits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that? Do we have a test that ensures that the expected behavior happens when we hit the limit?

seelog.Warnf("Unable to close websocket connection: %v for %s",
closeErr, cs.URL)
}
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, mainly for my understanding. what exactly happens if we're "unable to close websocket connection" here? seems like we just move along. what side effects in the agent should we expect if we're unable to close lots of these websocket connections here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. lmk if you have more questions.

@jhaynes jhaynes added this to the 1.17.3 milestone Mar 21, 2018
@@ -429,6 +429,7 @@ func newDisconnectionTimer(client wsclient.ClientServer, timeout time.Duration,
if err := client.Close(); err != nil {
seelog.Warnf("Error disconnecting: %v", err)
}
seelog.Info("Disconnected from ACS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think copyright line in this file needs updating too.

@@ -11,7 +11,7 @@
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

FROM gcc:4.9
FROM gcc:6.4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use the latest here, in case we will need to update again in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to use latest as it makes it hard to audit and track changes

}()
ctx, cancel := context.WithTimeout(context.TODO(), wsConnectTimeout)
defer cancel()
for {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this doesn't need to be in the for loop, as both of the cases will return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. will modify that

@@ -36,6 +40,11 @@ import (

const dockerEndpoint = "/var/run/docker.sock"

// Close closes the underlying connection
func (cs *ClientServerImpl) Close() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this implemented in the test file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientServerImpl is a partial implementation of the interface, which lacks the Close method. If this is not there, there'll be NPEs. I can document that to make it more clear.

Copy link
Contributor

@petderek petderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, and good improvement to the test structure!

}()
ctx, cancel := context.WithTimeout(context.TODO(), wsConnectTimeout)
defer cancel()
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a loop if this only runs once? Select blocks forever and both of the cases return.

seelog.Errorf("Stopping redundant reads on closed network connection: %s", cs.URL)
return opErr
}
// An unhandled error has occurred while trying to extend read deadline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things stick out to me here. They're stylistic things, so FFTI:

  • This doesn't follow the standard err != nil pattern where only good things happen on the left side.
  • In my opinion, all of the second half of this should be in a separate method from SetReadDeadline.

Lets say you extracted the server close logic into closeCSAsync. I'd maybe go with a structure like this:

err := cs.Conn.SetReadDeadline()
if err != nil {
   // check for opErr
       // return opErr
   servErr := closeCSAsync()
   // check for servErr
}

seelog.Warnf("Context canceled waiting for termination of websocket connection: %v for %s",
ctx.Err(), cs.URL)
}
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Its a bit annoying to figure out what err is all the way down here. Maybe make it clear which err is getting returned in a comment? (Its the original deadline err, right?)

aaithal added a commit to aaithal/amazon-ecs-agent that referenced this pull request Mar 21, 2018
Added changelog entry for stale websocket connection bug fix
Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for mock updates!

@@ -60,6 +63,8 @@ const (

// Default NO_PROXY env var IP addresses
defaultNoProxyIP = "169.254.169.254,169.254.170.2"

errClosed = "use of closed network connection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the error name more descriptive like errClosedConnection or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errClosed is pretty descriptive, no? I can change it you have a strong preference

seelog.Warnf("Unable to set read deadline for websocket connection: %v for %s", err, cs.URL)
// If we get connection closed error from SetReadDeadline, break out of the for loop and
// return an error
if opErr, ok := err.(*net.OpError); ok && strings.Contains(opErr.Err.Error(), errClosed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a cs.Close() somewhere in this case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since it's already closed.

Added missing copyright headers for a couple of files
@aaithal aaithal merged commit 3f2fecc into aws:dev Mar 21, 2018

gomock "github.com/golang/mock/gomock"
)

// Mock of FileSystem interface
// MockFileSystem is a mock of FileSystem interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: // MockFileSystem is a mock of the FileSystem interface

api "github.com/aws/amazon-ecs-agent/agent/api"
ecs "github.com/aws/amazon-ecs-agent/agent/ecs_client/model/ecs"
gomock "github.com/golang/mock/gomock"
)

// Mock of ECSSDK interface
// MockECSSDK is a mock of ECSSDK interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: *the ECSSDK interface

}
// An unhandled error has occurred while trying to extend read deadline.
// Try asynchronously closing the connection. We don't want to be blocked on stale connections
// taking too long to close. The flip side is that we might start accumulating stale connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that? Do we have a test that ensures that the expected behavior happens when we hit the limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants