Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eni watcher: add backoff-retry #1148

Closed
wants to merge 2 commits into from

Conversation

aaithal
Copy link
Contributor

@aaithal aaithal commented Dec 21, 2017

Summary

eni watcher: add backoff-retry

Implementation details

  • This commit adds back-off retry logic in:
  • udev watcher: It takes a few milliseconds for the ENI's mac address to
    show up on the instance once udev watcher detects that a new device has
    been attached
  • eni state lookup: If for whatever reason, there's a delay in ACS
    sending the ENI attachment message to the agent, the agent's state
    remains unaware of the ENI
  • changelog entry

This should lead to improved task start latencies of awsvpc tasks.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Manual tests:
  • Time between AttachTaskNetworkInterfacesMessage and PayloadMessage without this change: 23s
  • Time between AttachTaskNetworkInterfacesMessage and PayloadMessage with this change: 11s

New tests cover the changes: yes

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

This commit adds back-off retry logic in:
* udev watcher: It takes a few milliseconds for the ENI's mac address to
show up on the instance once udev watcher detects that a new device has
been attached
* eni state lookup: If for whatever reason, there's a delay in ACS
sending the ENI attachment message to the agent, the agent's state
remains unaware of the ENI

This should lead to improved task start latencies of awsvpc tasks.
@aaithal aaithal force-pushed the eniAttachmentLatencyImprovements branch from 1f3855f to 64377a9 Compare December 21, 2017 21:06
Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments.

if err != nil {
return "", err
}
return link.Attrs().HardwareAddr.String(), err
// RetryWithBackoffCtx returns nil when the context is cancelled. Check if there was

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to check retriever.macAddress != "", though I don't strongly opposite this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't know if it gives us any additional benefits. i like dealing with errors rather than empty strings or nils. So, will keep it this way

@@ -57,6 +88,17 @@ type UdevWatcher struct {
primaryMAC string
}

// unmanagedENIError is used to indicate that the agent found an ENI, but the agent isn't

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to another file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we may need a type.go for the struct and const, to split this file into smaller pieces which are easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a fan of the type.go model. It's much more easier to navigate/organize the code where the struct and the methods are defined in the same file. We could move this into an errors.go file. But, seeing how it's just one type, I'd rather keep it here than to create a new file for it.

Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm, just few questions.

)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how exactly did we determine the durations for these backoff times? same with the eni ones below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with experiments. It took an average of 10ms for the ENI's mac address to show up on the instance. The one for attachment message was based on my conversations with the ECS backend team.

splitDevLink := strings.SplitN(devicePath, "devices/", 2)
if len(splitDevLink) != 2 {
log.Warnf("Cannot determine device validity: %s", devicePath)
seelog.Warnf("Cannot determine device validity: %s", devicePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear to me why len(splitDevLink) != 2 means we cannot determine device validity. can we add some documentation? should we add the splitDevLink to log output as well? although, maybe i'm just missing context.

Copy link
Contributor Author

@aaithal aaithal Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example entries in the comment above are meant to illustrate that:

eth1 -> /devices/pci0000:00/0000:00:05.0/net/eth1

This means that we expect it to be of "devices/some-string" format. Is that not clear enough? It's just additional validation


// Error returns the error string for the unmanagedENIError type
func (err *unmanagedENIError) Error() string {
return fmt.Sprintf("udev watcher send ENI state change: eni not managed by ecs: %s", err.mac)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me what we mean by udev watcher send ENI state change. the second part of the error message is fine though.

@@ -1,4 +1,4 @@
// Copyright 2014-2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// Copyright 2014-2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for documenting all these =]

dev := filepath.Base(retriever.dev)
link, err := retriever.netlinkClient.LinkByName(dev)
if err != nil {
return utils.NewRetriableError(utils.NewRetriable(false), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we should never retry in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, an error means a syscall failed for describing the device. at this point of time, it doesn't make sense to retry it as it really should be visible on the instance

@@ -4,6 +4,7 @@
* Feature - Support a HTTP endpoint for `awsvpc` tasks to query metadata
* Bug - Fixed a bug where `-version` fails due to its dependency on docker client. [#1118](https://github.com/aws/amazon-ecs-agent/pull/1118)
* Bug - Persist container exit code in agent state file [#1125](https://github.com/aws/amazon-ecs-agent/pull/1125)
* Bug - Fixed a bug where the agent could miss sending an ENI attachment to ECS because of address propagation delays [#1148](https://github.com/aws/amazon-ecs-agent/pull/1148)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap this at 80 characters?


// macAddressBackoffJitter specifies the jitter multiple percentage when
// looking for an ENI's mac address on the host
macAddressBackoffJitter = 0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think jitter is really necessary here since you're not making remote calls or accessing a resource with high concurrency. On the other hand, I don't think it really hurts either except to make the timing a bit more unpredictable.

if err != nil {
return "", err
}
return link.Attrs().HardwareAddr.String(), err
// RetryWithBackoffCtx returns nil when the context is cancelled. Check if there was
// a timeout here. TODO: Fix RetryWithBackoffCtx to return ctx.Err() on context Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I think that's on me...

@nmeyerhans nmeyerhans mentioned this pull request Dec 27, 2017
8 tasks
@aaithal
Copy link
Contributor Author

aaithal commented Jan 5, 2018

closing this as this is already merged to the master branch with 1.16.1 release

@aaithal aaithal closed this Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants