-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add log message for task CNI setup, and other log mods #3621
Conversation
4f7b089
to
0a72422
Compare
- add a grace period for cloudwatch metrics warnings - remove unnecessary stat warns/errors when a task is or has exited - remove unnecessary (and somewhat misleading) "network mode initialized" log messages - dont log errors for "skippable" docker events
c9ea33b
to
a5cddf9
Compare
@@ -508,10 +508,6 @@ func (task *Task) initNetworkMode(acsTaskNetworkMode *string) { | |||
field.NetworkMode: aws.StringValue(acsTaskNetworkMode), | |||
}) | |||
} | |||
logger.Info("Task network mode initialized", logger.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message is misleading, the network mode is not actually "initialized" in any functional way here, this is only called when unmarshaling the task payload from ACS. Since it calls this function on every task payload, and ACS will often send redundant payloads, it also is repeated anywhere from 4-10 times for every task, for example:
% cat /var/log/ecs/ecs-agent.log | grep "Task network mode initialized"|grep "XXXXa1e7baf64457b336ddbcacdc1411"
level=info time=2023-03-24T17:10:44Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:46Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:48Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:53Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:55Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
@@ -149,14 +149,14 @@ func NewContainerStateChangeEvent(task *apitask.Task, cont *apicontainer.Contain | |||
} | |||
contKnownStatus := cont.GetKnownStatus() | |||
if !contKnownStatus.ShouldReportToBackend(cont.GetSteadyStateStatus()) { | |||
return event, errors.Errorf( | |||
return event, ErrShouldNotSendEvent{fmt.Sprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents logging these bogus error messages:
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: PULLED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:43Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: NONE"
continue | ||
} | ||
// task stats were not stopped for terminal task, backoff before trying to reconnect | ||
if statStreamErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backing off and logging after we determine if task has exited prevents lots of log messages like this:
level=warn time=2023-03-24T17:32:52Z msg="Error encountered processing metrics stream from host, this may affect cloudwatch metric accuracy" task="815c01e432284155bd0024be207989a7" error="failed to Statfs \"/host/proc/12554/ns/net\": no such file or directory"
} else if terminal { | ||
logger.Warn("Task is terminal, stopping stats collection", logger.Fields{ | ||
logger.Info("Task is terminal, stopping stats collection", logger.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not really a warning as this is expected behavior
age := time.Since(container.containerMetadata.StartedAt) | ||
// gracePeriod is the time that containers are allowed to have missing metrics | ||
// without throwing/logging errors. | ||
gracePeriod := time.Second * 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are often logging these warning messages soon after a container starts up because we don't have enough metrics in the queue yet.
This change adds a "grace period" of 30s when we won't warn about this condition, this prevents messages like below:
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="fc650d7f8817603ba602106383c7402f4c4c7b70c7fdf66be8903cc4ef169ee7" error="need at least 2 data points in queue to calculate CW stats set"
level=error time=2023-03-24T17:39:04Z msg="Error collecting cloudwatch metrics for container" container="b1bc98349145b528b58a9badb4565472f3ca81a2dd4c9e00e34b796eaab7ba85" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="c36eca3d790f6d67ed0771891f0885d078961febc7769b69dd65a5f9aabc69a6" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" container="8f6c44ed82edd31bd15e1a0e9fffe592929b241888db594a53dc44b4d794b26e" error="need at least 2 data points in queue to calculate CW stats set" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition!
@@ -169,62 +169,72 @@ func (queue *Queue) GetMemoryStatsSet() (*ecstcs.CWStatsSet, error) { | |||
func (queue *Queue) GetStorageStatsSet() (*ecstcs.StorageStatsSet, error) { | |||
storageStatsSet := &ecstcs.StorageStatsSet{} | |||
var err error | |||
var errStr string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the errors below are already logged as a warning in the calling function, so removing the duplicate logging here
@@ -1752,6 +1752,15 @@ func (engine *DockerTaskEngine) provisionContainerResourcesAwsvpc(task *apitask. | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below adds a new logging message like below:
level=info time=2023-03-24T17:49:44Z msg="Setting up CNI config for task" task="XXXX13d3c33f4d2ab9bacb00ef38XXXX" cniContainerID="XXXX" cniPluginPath="" cniID="02:64:22:51:ac:81" cniBridgeName="" cniContainerNetNs="/host/proc/24977/ns/net"
level=info time=2023-03-24T17:49:54Z msg="Setting up CNI config for task" cniID="02:05:6d:7e:76:ed" cniBridgeName="" cniContainerNetNs="/host/proc/25476/ns/net" task="XXXX53b676e04a85aefc7de8a4baXXXX" cniContainerID="XXXX" cniPluginPath=""
428ff01
to
a66bf31
Compare
Looks good to me. Could we squash the unit test fix commits together before merging? |
2fc2bae
to
c8e538e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of explaining the details for different sections.
@@ -40,4 +40,4 @@ jobs: | |||
gcc --version | |||
$env:ZZZ_SKIP_WINDOWS_SERVER_VERSION_CHECK_NOT_SUPPORTED_IN_PRODUCTION = 'true' | |||
$packages=go list .\... | Where-Object {$_ -NotMatch 'vendor'} | |||
go test -v -tags unit -timeout=40s $packages | |||
go test -v -tags unit -timeout=120s $packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh finally! 🙌
Summary
Agent logs are very verbose and log quite a lot of red herring warning and error log messages. In some cases it is also logging irrelevant and duplicate messages.
In this change there is also the addition of a log message for when we are setting up the CNI config for a task.
Overall these log messages represent around 16-25% of an agent logfile (higher percent on instances with higher numbers of tasks).
Changes to logging:
Testing
Description for the changelog
Enhancement - logging cleanup for unnecessary warn/error messages
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.