Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed https://github.com/aws/amazon-ecs-agent/issues/506 #652

Merged
merged 1 commit into from
Jan 31, 2017

Conversation

liwenwu-amazon
Copy link
Contributor

@liwenwu-amazon liwenwu-amazon commented Dec 29, 2016

Summary

Fix issue #506

Implementation details

1 . Add a new field 'lastResetTime' to stats Queue struct, Set 'lastResetTime' whenever Reseting the Queue.
2. With 1, the caller can return an error if queue has NOT been filled with expected data points after a resetThreshold period
3. define a new error code engine.EmptyMetricsError, so that the caller can ignore this error when publishing metrics immediately after agent connect to backend. And return/handle this error in other situations.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:

@richardpen @samuelkarp @aaithal

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check why the functional tests failed with this commit?

@@ -35,6 +36,7 @@ import (
const (
containerChangeHandler = "DockerStatsEngineDockerEventsHandler"
listContainersTimeout = 10 * time.Minute
resetThreshold = 2 * ecsengine.StatsInactivityTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to be more descriptive? queueResetThreshold perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -398,6 +401,13 @@ func (engine *DockerStatsEngine) getContainerMetricsForTask(taskArn string) ([]*
engine.doRemoveContainer(container, taskArn)
continue
}

if !container.statsQueue.resetThreshholdElapsed(resetThreshold) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: resetThreshholdElapsed -> resetThresholdElapsed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -133,6 +137,19 @@ func getMemoryUsagePerc(s *UsageStats) float64 {

type getUsageFunc func(*UsageStats) float64

func (queue *Queue) resetThreshholdElapsed(timeOut time.Duration) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be timeout and not timeOut.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


queue.Reset()

threshholdElapsed := queue.resetThreshholdElapsed(2 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: threshholdElapsed -> thresholdElapsed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

queue.Reset()

threshholdElapsed := queue.resetThreshholdElapsed(2 * time.Millisecond)
if threshholdElapsed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use assert.False() instead of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

time.Sleep(3 * time.Millisecond)
threshholdElapsed = queue.resetThreshholdElapsed(2 * time.Millisecond)

if !threshholdElapsed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please use the github.com/stretchr/testify/assert package's True() method here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

queue := NewQueue(queueLength)

enoughDataPoints := queue.enoughDatapointsInBuffer()
if enoughDataPoints {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assert.False() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

enoughDataPoints = queue.enoughDatapointsInBuffer()
if !enoughDataPoints {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assert.True() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -164,6 +170,7 @@ func (cs *clientServer) publishMetricsOnce() {
seelog.Warnf("Error publishing metrics: %v. Request: %v", err, request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return the error immediately as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. done

cs.publishMetricsOnce()
err := cs.publishMetricsOnce()
if err != nil && err != stats.EmptyMetricsError {
seelog.Warnf("Error getting instance metrics: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rephrase the error message? The error is from the the publishMetricsOnce() method. We can't be sure here if its because of publish failing or if there was an error getting metrics.

cs.publishMetricsOnce()
err := cs.publishMetricsOnce()
if err != nil {
seelog.Warnf("Error getting instance metrics: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, can you rephrase the error message?

case <-cs.endPublish:
return
}
}
}

// publishMetricsOnce is invoked by the ticker to periodically publish metrics to backend.
func (cs *clientServer) publishMetricsOnce() {
func (cs *clientServer) publishMetricsOnce() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add/update unit tests to cover this change?

Copy link
Contributor

@aaithal aaithal Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the functional tests for windows seem to have stuck. Can you address that and make sure those pass as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened issue #663 and #662 to track unit-test failure on windows platform

@liwenwu-amazon liwenwu-amazon force-pushed the issue506 branch 3 times, most recently from eb31f63 to 1508e15 Compare January 5, 2017 07:46
Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix your pull request summary formatting? The checkboxes look malformed.

@@ -519,7 +519,7 @@ func TestHandlerReconnectDelayForInactiveInstanceError(t *testing.T) {
mockWsClient.EXPECT().Connect().Do(func() {
reconnectDelay := time.Now().Sub(firstConnectionAttemptTime)
t.Logf("Delay between successive connections: %v", reconnectDelay)
if reconnectDelay < inactiveInstanceReconnectDelay {
if reconnectDelay+1 < inactiveInstanceReconnectDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +1? Do you really mean to say reconnectDelay <= inactiveInstanceReconnectDelay?

The assertion error message also looks wrong. It should probably read "Reconnect delay %v is less than inactive instance reconnect delay %v".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Window platform, we find intermittently that time.Now().Sub(...) can returns 199.9989ms even after code waiting for time.NewTimer(200) ms

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than blindly adding 1, can you:

  • Include this information about intermittent time subtraction issues in an explanatory comment
  • Explicitly allow for some slop in the assertion here (possibly up to 1ms?) by adjusting the expected time rather than the observed time and having a named value instead of a literal (use a constant or variable to describe the modifier).

@@ -398,6 +401,13 @@ func (engine *DockerStatsEngine) getContainerMetricsForTask(taskArn string) ([]*
engine.doRemoveContainer(container, taskArn)
continue
}

if !container.statsQueue.resetThresholdElapsed(queueResetThreshold) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this conditional pretty hard to read. I think it might be easier like this:

if !container.statsQueue.enoughDatapointsInBuffer() &&
	!container.statsQueue.resetThresholdElapsed(queueResetThreshold)

That lets it read more logically as "if there are not enough datapoints and the threshold is not elapsed".

@@ -152,6 +158,18 @@ func TestPublishMetricsRequest(t *testing.T) {

cs.Close()
}
func TestPublishMetricsOnceEmptyStatsError(t *testing.T) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray empty line

@liwenwu-amazon liwenwu-amazon force-pushed the issue506 branch 3 times, most recently from 56721da to 6d41ad1 Compare January 11, 2017 01:53
@@ -519,8 +519,10 @@ func TestHandlerReconnectDelayForInactiveInstanceError(t *testing.T) {
mockWsClient.EXPECT().Connect().Do(func() {
reconnectDelay := time.Now().Sub(firstConnectionAttemptTime)
t.Logf("Delay between successive connections: %v", reconnectDelay)
if reconnectDelay < inactiveInstanceReconnectDelay {
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay)
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/window/Windows/

t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay)
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms.
timeSubFuncSlopAllowed := 1 * time.Millisecond
if reconnectDelay+timeSubFuncSlopAllowed < inactiveInstanceReconnectDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to reconnectDelay < inactiveInstanceReconnectDelay - timeSubFuncSlopAllowed?

}
err := cs.publishMetricsOnce()

if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.NoError(err)

err := cs.publishMetricsOnce()

if err == nil {
t.Error("Failed: expecting publishMerticOnce return err ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Mertic/Metric/

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make sure that copyright headers are up to date with the year 2017 in all of the files?

if reconnectDelay < inactiveInstanceReconnectDelay {
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay)
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms.
timeSubFuncSlopAllowed := 1 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of doing this math here. Can you please use assert.WithinDuration instead? https://godoc.org/github.com/stretchr/testify/assert#Assertions.WithinDuration

Writing less code is always better :)

assert.True(t, thresholdElapsed, "Queue is expected to elapse the threshold")

}
func TestEnoughDatapointsInBuffer(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line needed

if reconnectDelay < inactiveInstanceReconnectDelay-timeSubFuncSlopAllowed {
t.Errorf("Reconnect delay %v is 1 ms less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay)
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms.
if reconnectDelay < inactiveInstanceReconnectDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this if block when using assert.WithinDuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 conditions here

  1. to make sure reconnectDelay is "greater" than inactiveInstanceReconnectDelay
  2. or to make sure it is within duration of the 1 milisecond, if reconnectDelay is "less" than inactiveinstanceReconnectDelay
    The code here is intended for "1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's entirely correct. I think the intention is to ensure that we wait for inactiveInstanceReconnectDelay between disconnects. It's >= inactiveInstanceReconnectDelay. We don't add any jitter when the disconnection is for an inactive instance. So, assert.WithinDuration() should be good enough with a buffer of 2ms.


if !container.statsQueue.enoughDatapointsInBuffer() &&
!container.statsQueue.resetThresholdElapsed(queueResetThreshold) {
seelog.Debug("Stats not ready for container %s", dockerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugf

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are still failing. Please ensure that they succeed with the next revision.

t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay)
}
timeSubFuncSlopAllowed := 2 * time.Millisecond
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @samuelkarp mentioned in one of the earlier comments, there's no "window platform", please modify that. Also, please split the comment across multiple lines to reduce the line width.

queue.Reset()

thresholdElapsed := queue.resetThresholdElapsed(2 * time.Millisecond)
assert.False(t, thresholdElapsed, "Queue is NOT expected to elapsed the threshold right after reset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the phrasing is correct here. The Queue does not elapse any threshold as it's not an entity that is time-bound. I think the better message here is "Queue reset threshold is not expected to elapse right after reset"

time.Sleep(3 * time.Millisecond)
thresholdElapsed = queue.resetThresholdElapsed(2 * time.Millisecond)

assert.True(t, thresholdElapsed, "Queue is expected to elapse the threshold")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I think the better message here is "Queue reset threshold is expected to elapse after waiting"

queue := NewQueue(queueLength)

enoughDataPoints := queue.enoughDatapointsInBuffer()
assert.False(t, enoughDataPoints, "Queue is expected NOT having enough data points right after creation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "Queue is expected to not have enough data ..."


queue.Reset()
enoughDataPoints = queue.enoughDatapointsInBuffer()
assert.False(t, enoughDataPoints, "Queue is expected NOT having enough data points right after RESET")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "Queue is expected to not have enough data ..."

// On windows platform, we found issue with time.Now().Sub(...) reporting 199.9989 even
// after the code has already waited for time.NewTimer(200)ms.
assert.WithinDuration(t, reconnectDelayTime, firstConnectionAttemptTime.Add(inactiveInstanceReconnectDelay), timeSubFuncSlopAllowed,
"Reconnect delay %v is expected within %v of inactive instance reconnect delay %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a message this detailed; the assert library will do this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean

  1. I should remove comment (line 524 to line 525)?
    or
  2. rewording remove line 527?
    thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword line 527 and remove the format string. To see what I mean, you can temporarily make the assertion fail on your local machine and observe the error message that gets printed out.

@aaithal
Copy link
Contributor

aaithal commented Jan 24, 2017

@liwenwu-amazon I think the code changes look good. Can you manually test if this fix has any side-effects on a long-running (1h) agent? We basically want to verify that:

  1. Disconnect/Reconnect to ACS and TACS keep happening on a long running ECS Agent
  2. Service/Cluster stats have a consistent value for a fixed workload on a long running ECS Agent

@liwenwu-amazon
Copy link
Contributor Author

I have this image running for past 6 days. So far, I have not seen any warning messages. Nor I have seen any abnormal stats on "Metrics" dashboard on AWS console.

The CPUUtilization is kept at 0.03% average and few spikes maxed @ 0.2% . The CPU Reservation is stabilized at 50%. The MemoryUtilization is stabilized at 6% and memoryreservation is stabilized at 27%

@liwenwu-amazon liwenwu-amazon merged commit 1be5420 into aws:dev Jan 31, 2017
@samuelkarp samuelkarp added this to the 1.14.1 milestone Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants