-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed https://github.com/aws/amazon-ecs-agent/issues/506 #652
Conversation
03fb48c
to
eb96940
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check why the functional tests failed with this commit?
@@ -35,6 +36,7 @@ import ( | |||
const ( | |||
containerChangeHandler = "DockerStatsEngineDockerEventsHandler" | |||
listContainersTimeout = 10 * time.Minute | |||
resetThreshold = 2 * ecsengine.StatsInactivityTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to be more descriptive? queueResetThreshold
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -398,6 +401,13 @@ func (engine *DockerStatsEngine) getContainerMetricsForTask(taskArn string) ([]* | |||
engine.doRemoveContainer(container, taskArn) | |||
continue | |||
} | |||
|
|||
if !container.statsQueue.resetThreshholdElapsed(resetThreshold) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: resetThreshholdElapsed
-> resetThresholdElapsed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -133,6 +137,19 @@ func getMemoryUsagePerc(s *UsageStats) float64 { | |||
|
|||
type getUsageFunc func(*UsageStats) float64 | |||
|
|||
func (queue *Queue) resetThreshholdElapsed(timeOut time.Duration) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be timeout
and not timeOut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
queue.Reset() | ||
|
||
threshholdElapsed := queue.resetThreshholdElapsed(2 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: threshholdElapsed
-> thresholdElapsed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
queue.Reset() | ||
|
||
threshholdElapsed := queue.resetThreshholdElapsed(2 * time.Millisecond) | ||
if threshholdElapsed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use assert.False()
instead of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
time.Sleep(3 * time.Millisecond) | ||
threshholdElapsed = queue.resetThreshholdElapsed(2 * time.Millisecond) | ||
|
||
if !threshholdElapsed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please use the github.com/stretchr/testify/assert
package's True()
method here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
queue := NewQueue(queueLength) | ||
|
||
enoughDataPoints := queue.enoughDatapointsInBuffer() | ||
if enoughDataPoints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assert.False()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
enoughDataPoints = queue.enoughDatapointsInBuffer() | ||
if !enoughDataPoints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assert.True()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -164,6 +170,7 @@ func (cs *clientServer) publishMetricsOnce() { | |||
seelog.Warnf("Error publishing metrics: %v. Request: %v", err, request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this return the error immediately as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. done
eb96940
to
0dd2e89
Compare
cs.publishMetricsOnce() | ||
err := cs.publishMetricsOnce() | ||
if err != nil && err != stats.EmptyMetricsError { | ||
seelog.Warnf("Error getting instance metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rephrase the error message? The error is from the the publishMetricsOnce()
method. We can't be sure here if its because of publish failing or if there was an error getting metrics.
cs.publishMetricsOnce() | ||
err := cs.publishMetricsOnce() | ||
if err != nil { | ||
seelog.Warnf("Error getting instance metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, can you rephrase the error message?
case <-cs.endPublish: | ||
return | ||
} | ||
} | ||
} | ||
|
||
// publishMetricsOnce is invoked by the ticker to periodically publish metrics to backend. | ||
func (cs *clientServer) publishMetricsOnce() { | ||
func (cs *clientServer) publishMetricsOnce() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add/update unit tests to cover this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the functional tests for windows seem to have stuck. Can you address that and make sure those pass as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eb31f63
to
1508e15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix your pull request summary formatting? The checkboxes look malformed.
@@ -519,7 +519,7 @@ func TestHandlerReconnectDelayForInactiveInstanceError(t *testing.T) { | |||
mockWsClient.EXPECT().Connect().Do(func() { | |||
reconnectDelay := time.Now().Sub(firstConnectionAttemptTime) | |||
t.Logf("Delay between successive connections: %v", reconnectDelay) | |||
if reconnectDelay < inactiveInstanceReconnectDelay { | |||
if reconnectDelay+1 < inactiveInstanceReconnectDelay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +1
? Do you really mean to say reconnectDelay <= inactiveInstanceReconnectDelay
?
The assertion error message also looks wrong. It should probably read "Reconnect delay %v is less than inactive instance reconnect delay %v".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Window platform, we find intermittently that time.Now().Sub(...) can returns 199.9989ms even after code waiting for time.NewTimer(200) ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than blindly adding 1, can you:
- Include this information about intermittent time subtraction issues in an explanatory comment
- Explicitly allow for some slop in the assertion here (possibly up to
1ms
?) by adjusting the expected time rather than the observed time and having a named value instead of a literal (use a constant or variable to describe the modifier).
@@ -398,6 +401,13 @@ func (engine *DockerStatsEngine) getContainerMetricsForTask(taskArn string) ([]* | |||
engine.doRemoveContainer(container, taskArn) | |||
continue | |||
} | |||
|
|||
if !container.statsQueue.resetThresholdElapsed(queueResetThreshold) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this conditional pretty hard to read. I think it might be easier like this:
if !container.statsQueue.enoughDatapointsInBuffer() &&
!container.statsQueue.resetThresholdElapsed(queueResetThreshold)
That lets it read more logically as "if there are not enough datapoints and the threshold is not elapsed".
@@ -152,6 +158,18 @@ func TestPublishMetricsRequest(t *testing.T) { | |||
|
|||
cs.Close() | |||
} | |||
func TestPublishMetricsOnceEmptyStatsError(t *testing.T) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray empty line
56721da
to
6d41ad1
Compare
@@ -519,8 +519,10 @@ func TestHandlerReconnectDelayForInactiveInstanceError(t *testing.T) { | |||
mockWsClient.EXPECT().Connect().Do(func() { | |||
reconnectDelay := time.Now().Sub(firstConnectionAttemptTime) | |||
t.Logf("Delay between successive connections: %v", reconnectDelay) | |||
if reconnectDelay < inactiveInstanceReconnectDelay { | |||
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay) | |||
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/window/Windows/
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay) | ||
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms. | ||
timeSubFuncSlopAllowed := 1 * time.Millisecond | ||
if reconnectDelay+timeSubFuncSlopAllowed < inactiveInstanceReconnectDelay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to reconnectDelay < inactiveInstanceReconnectDelay - timeSubFuncSlopAllowed
?
} | ||
err := cs.publishMetricsOnce() | ||
|
||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.NoError(err)
err := cs.publishMetricsOnce() | ||
|
||
if err == nil { | ||
t.Error("Failed: expecting publishMerticOnce return err ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Mertic/Metric/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make sure that copyright headers are up to date with the year 2017
in all of the files?
if reconnectDelay < inactiveInstanceReconnectDelay { | ||
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay) | ||
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms. | ||
timeSubFuncSlopAllowed := 1 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of doing this math here. Can you please use assert.WithinDuration
instead? https://godoc.org/github.com/stretchr/testify/assert#Assertions.WithinDuration
Writing less code is always better :)
assert.True(t, thresholdElapsed, "Queue is expected to elapse the threshold") | ||
|
||
} | ||
func TestEnoughDatapointsInBuffer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line needed
if reconnectDelay < inactiveInstanceReconnectDelay-timeSubFuncSlopAllowed { | ||
t.Errorf("Reconnect delay %v is 1 ms less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay) | ||
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms. | ||
if reconnectDelay < inactiveInstanceReconnectDelay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this if block when using assert.WithinDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 conditions here
- to make sure reconnectDelay is "greater" than inactiveInstanceReconnectDelay
- or to make sure it is within duration of the 1 milisecond, if reconnectDelay is "less" than inactiveinstanceReconnectDelay
The code here is intended for "1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's entirely correct. I think the intention is to ensure that we wait for inactiveInstanceReconnectDelay
between disconnects. It's >= inactiveInstanceReconnectDelay
. We don't add any jitter when the disconnection is for an inactive instance. So, assert.WithinDuration()
should be good enough with a buffer of 2ms
.
|
||
if !container.statsQueue.enoughDatapointsInBuffer() && | ||
!container.statsQueue.resetThresholdElapsed(queueResetThreshold) { | ||
seelog.Debug("Stats not ready for container %s", dockerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are still failing. Please ensure that they succeed with the next revision.
t.Errorf("Reconnect delay %v is not less than inactive instance reconnect delay %v", reconnectDelay, inactiveInstanceReconnectDelay) | ||
} | ||
timeSubFuncSlopAllowed := 2 * time.Millisecond | ||
// On window platform, we found issue with time.Now().Sub(...) reporting 199.9989 even after the code has already waited for time.NewTimer(200)ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @samuelkarp mentioned in one of the earlier comments, there's no "window platform", please modify that. Also, please split the comment across multiple lines to reduce the line width.
queue.Reset() | ||
|
||
thresholdElapsed := queue.resetThresholdElapsed(2 * time.Millisecond) | ||
assert.False(t, thresholdElapsed, "Queue is NOT expected to elapsed the threshold right after reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the phrasing is correct here. The Queue does not elapse any threshold as it's not an entity that is time-bound. I think the better message here is "Queue reset threshold is not expected to elapse right after reset"
time.Sleep(3 * time.Millisecond) | ||
thresholdElapsed = queue.resetThresholdElapsed(2 * time.Millisecond) | ||
|
||
assert.True(t, thresholdElapsed, "Queue is expected to elapse the threshold") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. I think the better message here is "Queue reset threshold is expected to elapse after waiting"
queue := NewQueue(queueLength) | ||
|
||
enoughDataPoints := queue.enoughDatapointsInBuffer() | ||
assert.False(t, enoughDataPoints, "Queue is expected NOT having enough data points right after creation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: "Queue is expected to not have enough data ..."
|
||
queue.Reset() | ||
enoughDataPoints = queue.enoughDatapointsInBuffer() | ||
assert.False(t, enoughDataPoints, "Queue is expected NOT having enough data points right after RESET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: "Queue is expected to not have enough data ..."
// On windows platform, we found issue with time.Now().Sub(...) reporting 199.9989 even | ||
// after the code has already waited for time.NewTimer(200)ms. | ||
assert.WithinDuration(t, reconnectDelayTime, firstConnectionAttemptTime.Add(inactiveInstanceReconnectDelay), timeSubFuncSlopAllowed, | ||
"Reconnect delay %v is expected within %v of inactive instance reconnect delay %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need a message this detailed; the assert
library will do this for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean
- I should remove comment (line 524 to line 525)?
or - rewording remove line 527?
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reword line 527 and remove the format string. To see what I mean, you can temporarily make the assertion fail on your local machine and observe the error message that gets printed out.
@liwenwu-amazon I think the code changes look good. Can you manually test if this fix has any side-effects on a long-running (
|
I have this image running for past 6 days. So far, I have not seen any warning messages. Nor I have seen any abnormal stats on "Metrics" dashboard on AWS console. The CPUUtilization is kept at 0.03% average and few spikes maxed @ 0.2% . The CPU Reservation is stabilized at 50%. The MemoryUtilization is stabilized at 6% and memoryreservation is stabilized at 27% |
628e68c
to
1be5420
Compare
Summary
Fix issue #506
Implementation details
1 . Add a new field 'lastResetTime' to stats Queue struct, Set 'lastResetTime' whenever Reseting the Queue.
2. With 1, the caller can return an error if queue has NOT been filled with expected data points after a resetThreshold period
3. define a new error code engine.EmptyMetricsError, so that the caller can ignore this error when publishing metrics immediately after agent connect to backend. And return/handle this error in other situations.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Licensing
This contribution is under the terms of the Apache 2.0 License:
@richardpen @samuelkarp @aaithal