Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout for setting up pause container #1358

Merged
merged 2 commits into from
May 1, 2018

Conversation

richardpen
Copy link

@richardpen richardpen commented Apr 23, 2018

Summary

Added timeout for SetupNS.

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@richardpen richardpen requested review from adnxn, aaithal and a team April 23, 2018 21:24
Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did we come up with a timeout value of of 1 minute?


select {
case <-ctx.Done():
return nil, errors.Errorf("cni setup: setup the container namespace timedout")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be errors.New?
also, message can be "cni: container namespace setup timed out"

@@ -61,6 +61,7 @@ const (
labelTaskDefinitionFamily = labelPrefix + "task-definition-family"
labelTaskDefinitionVersion = labelPrefix + "task-definition-version"
labelCluster = labelPrefix + "cluster"
cniSetupTimeoutDuration = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can just be cniSetupTimeout

@richardpen
Copy link
Author

Based on the test, it seems this process is really fast on different instance <5s. But in the cni plugin itself, there are some retry and timeout(maximum is 30s) for retrieving instance metadata and ipv6 gateway, so I doubled it to 1mins.

@richardpen richardpen added this to the 1.17.3 milestone Apr 24, 2018
Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm, just minor edit to error msg wording as @sharanyad called out.

@@ -69,7 +70,24 @@ func NewClient(cfg *Config) CNIClient {

// SetupNS will set up the namespace of container, including create the bridge
// and the veth pair, move the eni to container namespace, setup the routes
func (client *cniClient) SetupNS(cfg *Config) (*current.Result, error) {
func (client *cniClient) SetupNS(cfg *Config, ctx context.Context) (*current.Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that ctx is the first argument here.

@@ -38,7 +39,7 @@ type CNIClient interface {
// Capabilities returns the capabilities supported by a plugin
Capabilities(string) ([]string, error)
// SetupNS sets up the namespace of container
SetupNS(*Config) (*current.Result, error)
SetupNS(*Config, context.Context) (*current.Result, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure that CleanupNS() also has a context wired?

var err error
setupDone := make(chan struct{})
go func() {
result, err = client.setupNS(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this actually times out to the plugin execution path? Will they still be invoked in an orphan process? Can you please add details about how you tested this? What the behavior is? Essentially, how safe is it to do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timeout happens, agent will move the task to stopped but the plugin was still running until it wasn't stuck. Ideally we should stop the process, but we don't know the plugin process id, I have opened the issue in libcni for adding the functionality to stop the plugin process after a timeout. The test I did is having two awsvpc task, the first was timedout during cni setup, and the second task can still progress successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to invoke CleanupNS() in that scenario then (if SetupNS() times out invoke CleanupNS())?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CleanupNS is already called after the setupNS timeout: when setupNS timeout, the task desired status was moved to stop, and agent will try to stop pause container at which time the CleanupNS is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please create an issue in the agent repo to track this as well (and link the cni issue to that)? It doesn't seem super optimal to just leave a process hanging if we decide to give up early on it.

var err error
setupDone := make(chan struct{})
go func() {
result, err = client.setupNS(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's not today, it seems like this could get racy in the future. Instead of this semi-global var pattern, can you instead pass an intermediate struct back in setupDone channel?


select {
case <-ctx.Done():
return nil, errors.New("cni setup: container namespace setup timed out")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also check for ctx.Err() here and wrap that if non nil instead of always assuming that context timed out.


select {
case <-ctx.Done():
return errors.New("cni cleanup: container namespace cleanup timed out")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above regarding ctx.Err()

CHANGELOG.md Outdated
@@ -1,5 +1,8 @@
# Changelog

## 1.17.4-dev
* Bug - Fixed a bug where task stuck in pending due to the cni setup process hangs [#1358](https://github.com/aws/amazon-ecs-agent/pull/1358)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Fixed a bug where tasks could get stuck waiting for execution of CNI scripts?


select {
case <-ctx.Done():
return nil, errors.Wrap(ctx.Err(), "cni setup: container namespace setup failed")
Copy link
Contributor

@aaithal aaithal Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to check if ctx.Err() is non nil. A much better model would be to create the derived context within this method and handle ctx.Err() == nil case as well.

You essentially do

 derivedCtx, cancel := context.WithTimeout(ctx, timeout)
 defer cancel()

 go func(...) {
 }

select {
 case <-ctx.Done():
  if ctc.Err() != nil {
   return nil, errors.Wrap()
 } else {
 return nil, errors.New()
 }
 ...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the document here, I don't think check ctx.Err()!=nil is required.

// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.

A much better model would be to create the derived context within this method

Yes, a timeout instead should be passed in instead of passing the context, will modify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the document here, I don't think check ctx.Err()!=nil is required.

Thanks! make sense.

@@ -110,7 +133,21 @@ func (client *cniClient) SetupNS(cfg *Config) (*current.Result, error) {

// CleanupNS will clean up the container namespace, including remove the veth
// pair and stop the dhclient
func (client *cniClient) CleanupNS(cfg *Config) error {
func (client *cniClient) CleanupNS(ctx context.Context, cfg *Config) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about accepting timeout as a param and dealing with ctx.Err() as above.

@richardpen richardpen merged commit 6f3cf0f into aws:dev May 1, 2018
@aaithal aaithal modified the milestones: 1.17.4, 1.18.0 May 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants