Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to poll for container stats instead of constant stream #1646

Merged

Conversation

yhlee-aws
Copy link
Contributor

@yhlee-aws yhlee-aws commented Oct 24, 2018

Summary

Push blocked until next agent release
Allows for container stat polling instead of forcing constant stream. This allows for a significant decrease in CPU usage when a lot of containers are deployed. Duplicate/fix of #1475

addresses issue #1422

Implementation details

merging PR #1475 and fixing conflicts, and errors in unit tests.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: yes

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yhlee-aws yhlee-aws added this to the 1.22.1 milestone Oct 24, 2018
@yhlee-aws yhlee-aws requested a review from a team October 24, 2018 19:46
@yhlee-aws yhlee-aws force-pushed the inmar-add-polling-option-for-getting-metrics branch from 07d1d5d to 3a831a5 Compare October 24, 2018 20:00
Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm, thanks @jcbowman

@yunhee-l: does this need an integ test to test the polling behavior?

Copy link
Contributor

@linkar-ec2 linkar-ec2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

@@ -104,6 +104,8 @@ additional details on each available environment variable.
| `ECS_UPDATES_ENABLED` | <true | false> | Whether to exit for an updater to apply updates when requested. | false | false |
| `ECS_UPDATE_DOWNLOAD_DIR` | /cache | Where to place update tarballs within the container. | | |
| `ECS_DISABLE_METRICS` | <true | false> | Whether to disable metrics gathering for tasks. | false | true |
| `ECS_POLL_METRICS` | <true | false> | Whether to poll or stream when gathering metrics for tasks. | false | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an Agent-specific decision to be able to turn off metrics completely? (How does this affect TACS if two Agents in a cluster poll at different frequencies?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that way, if this is turned off we fall back to the streaming

} else {
seelog.Infof("DockerGoClient: Starting to Poll for metrics")
//Sleep for a random period time up to the polling interval. This will help make containers ask for stats at different times
time.Sleep(time.Second * time.Duration(rand.Intn(int(dg.config.PollingMetricsWaitDuration.Seconds()))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good intuition but I don't think this is needed. The only time all containers of a Task would poll at the same time is if Agent stops and starts while the Task is running. Otherwise, assuming go routines run at approximately the same speed (or even randomly), containers are added to the stats engine iteratively so they will already be staggered to some degree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is value in adding extra jitter here.

Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add unit tests

cfg.platformOverrides()

return nil
}

func (cfg *Config) pollMetricsOverrides() {
if cfg.PollMetrics {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -81,6 +85,14 @@ const (
// 'stuck' in the pull / unpack step. Very small values are unsafe and lead to high failure rate.
minimumImagePullInactivityTimeout = 1 * time.Minute

// minimumPollingMetricsWaitDuration specifies the minimum duration to wait before polling for new stats
// from docker. This is only used when PollMetrics is set to true
minimumPollingMetricsWaitDuration = 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, how did we come up with the min and max values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

customer did, there are some testing/benchmarking info in the original PR

CHANGELOG.md Outdated
@@ -1,4 +1,7 @@
# Changelog
## 1.22.1-dev
* Enhancement - Configurable poll duration for container stats [@jcbowman](https://github.com/jcbowman) [#1646](https://github.com/aws/amazon-ecs-agent/pull/1646)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -184,6 +189,12 @@ func TestInvalidLoggingDriver(t *testing.T) {
assert.Error(t, conf.validateAndOverrideBounds(), "Should be error with invalid-logging-driver")
}

func TestDefaultPollMetricsWithoutECSDataDir(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does WithoutECSDataDir indicate? without config..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what it means (see TestDefaultCheckpointWithoutECSDataDir below), if we do not have config file it should take default value.

}
}()
} else {
seelog.Infof("DockerGoClient: Starting to Poll for metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also log the container id here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

seelog.Infof("DockerGoClient: Unable to retrieve stats for container %s: %v",
id, statsErr)
if !dg.config.PollMetrics {
seelog.Infof("DockerGoClient: Starting to Stream for metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seelog.Info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added container ID to the logging line instead

} else {
seelog.Infof("DockerGoClient: Starting to Poll for metrics")
//Sleep for a random period time up to the polling interval. This will help make containers ask for stats at different times
time.Sleep(time.Second * time.Duration(rand.Intn(int(dg.config.PollingMetricsWaitDuration.Seconds()))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean this could end up sleeping for the whole wait duration, skipping a cycle of poll metrics? not sure if this could lead to any unexpected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point the polling cycle has not started, this is adding jitter before starting it.

@yhlee-aws yhlee-aws force-pushed the inmar-add-polling-option-for-getting-metrics branch from 95e0023 to 9902dd9 Compare October 29, 2018 20:15
@yhlee-aws
Copy link
Contributor Author

yhlee-aws commented Oct 29, 2018

BLOCKED: Holding off on adding unit tests, as sdk migration will end up in more conflicts. I'll update this PR again after migration branch is merged.

@yhlee-aws yhlee-aws modified the milestones: 1.22.1, 1.24.0 Dec 18, 2018
@yhlee-aws yhlee-aws force-pushed the inmar-add-polling-option-for-getting-metrics branch 3 times, most recently from a7ffa9b to 5d2e394 Compare December 20, 2018 00:47
@yhlee-aws yhlee-aws closed this Dec 20, 2018
@yhlee-aws yhlee-aws reopened this Dec 20, 2018
increasing fnl test timeout by 2 mins
@yhlee-aws yhlee-aws force-pushed the inmar-add-polling-option-for-getting-metrics branch from 973d1fc to d079426 Compare December 20, 2018 17:25
@yhlee-aws
Copy link
Contributor Author

Integ and functional tests have been added.

},
}
params.StartTime = aws.Time(RoundTimeUp(time.Now(), time.Minute).UTC())
params.EndTime = aws.Time((*params.StartTime).Add(waitMetricsInCloudwatchDuration).UTC())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the params here are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I shortened this test's checks and didn't remove the set up part. Removed.

func TestStatsEngineWithNewContainersWithPolling(t *testing.T) {
// additional config fields to use polling instead of stream
cfg.PollMetrics = true
cfg.PollingMetricsWaitDuration = 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cfg object is shared by many tests in this package.. we probably should set the value back if we are changing it for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clean up how cfg is shared by all integ tests, but it's a bigger refactoring job. I'll work on it separately, but for this one I reset it to false/default duration at the end of the test.

@yumex93
Copy link
Contributor

yumex93 commented Dec 21, 2018

Is this pr for merging dev to inmar-add-polling-option-for-getting-metrics or the other way?

@yhlee-aws
Copy link
Contributor Author

merging into dev

@fenxiong fenxiong mentioned this pull request Dec 21, 2018
8 tasks
@yhlee-aws yhlee-aws merged commit 20cdb82 into aws:dev Dec 21, 2018
@yhlee-aws yhlee-aws deleted the inmar-add-polling-option-for-getting-metrics branch July 3, 2019 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants