Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log message for task CNI setup, and other log mods #3621

Merged
merged 2 commits into from
Mar 27, 2023

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Mar 24, 2023

Summary

Agent logs are very verbose and log quite a lot of red herring warning and error log messages. In some cases it is also logging irrelevant and duplicate messages.

In this change there is also the addition of a log message for when we are setting up the CNI config for a task.

Overall these log messages represent around 16-25% of an agent logfile (higher percent on instances with higher numbers of tasks).

Changes to logging:

  • add a grace period for cloudwatch metrics warnings
  • remove unnecessary stat warns/errors when a task is or has exited
  • remove unnecessary (and somewhat misleading) "network mode initialized" log messages
  • dont log errors for "skippable" docker events

Testing

Description for the changelog

Enhancement - logging cleanup for unnecessary warn/error messages

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sparrc sparrc requested a review from a team as a code owner March 24, 2023 17:07
@sparrc sparrc force-pushed the cni-setup-logging branch 2 times, most recently from 4f7b089 to 0a72422 Compare March 24, 2023 17:18
- add a grace period for cloudwatch metrics warnings
- remove unnecessary stat warns/errors when a task is or has exited
- remove unnecessary (and somewhat misleading) "network mode initialized" log messages
- dont log errors for "skippable" docker events
@@ -508,10 +508,6 @@ func (task *Task) initNetworkMode(acsTaskNetworkMode *string) {
field.NetworkMode: aws.StringValue(acsTaskNetworkMode),
})
}
logger.Info("Task network mode initialized", logger.Fields{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message is misleading, the network mode is not actually "initialized" in any functional way here, this is only called when unmarshaling the task payload from ACS. Since it calls this function on every task payload, and ACS will often send redundant payloads, it also is repeated anywhere from 4-10 times for every task, for example:

% cat /var/log/ecs/ecs-agent.log | grep "Task network mode initialized"|grep "XXXXa1e7baf64457b336ddbcacdc1411"
level=info time=2023-03-24T17:10:44Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:46Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:48Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:53Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:55Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"

@@ -149,14 +149,14 @@ func NewContainerStateChangeEvent(task *apitask.Task, cont *apicontainer.Contain
}
contKnownStatus := cont.GetKnownStatus()
if !contKnownStatus.ShouldReportToBackend(cont.GetSteadyStateStatus()) {
return event, errors.Errorf(
return event, ErrShouldNotSendEvent{fmt.Sprintf(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents logging these bogus error messages:

level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: PULLED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:43Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: NONE"

continue
}
// task stats were not stopped for terminal task, backoff before trying to reconnect
if statStreamErr != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backing off and logging after we determine if task has exited prevents lots of log messages like this:

level=warn time=2023-03-24T17:32:52Z msg="Error encountered processing metrics stream from host, this may affect cloudwatch metric accuracy" task="815c01e432284155bd0024be207989a7" error="failed to Statfs \"/host/proc/12554/ns/net\": no such file or directory"

} else if terminal {
logger.Warn("Task is terminal, stopping stats collection", logger.Fields{
logger.Info("Task is terminal, stopping stats collection", logger.Fields{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really a warning as this is expected behavior

age := time.Since(container.containerMetadata.StartedAt)
// gracePeriod is the time that containers are allowed to have missing metrics
// without throwing/logging errors.
gracePeriod := time.Second * 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are often logging these warning messages soon after a container starts up because we don't have enough metrics in the queue yet.

This change adds a "grace period" of 30s when we won't warn about this condition, this prevents messages like below:

level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="fc650d7f8817603ba602106383c7402f4c4c7b70c7fdf66be8903cc4ef169ee7" error="need at least 2 data points in queue to calculate CW stats set"
level=error time=2023-03-24T17:39:04Z msg="Error collecting cloudwatch metrics for container" container="b1bc98349145b528b58a9badb4565472f3ca81a2dd4c9e00e34b796eaab7ba85" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="c36eca3d790f6d67ed0771891f0885d078961febc7769b69dd65a5f9aabc69a6" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" container="8f6c44ed82edd31bd15e1a0e9fffe592929b241888db594a53dc44b4d794b26e" error="need at least 2 data points in queue to calculate CW stats set" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

@@ -169,62 +169,72 @@ func (queue *Queue) GetMemoryStatsSet() (*ecstcs.CWStatsSet, error) {
func (queue *Queue) GetStorageStatsSet() (*ecstcs.StorageStatsSet, error) {
storageStatsSet := &ecstcs.StorageStatsSet{}
var err error
var errStr string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the errors below are already logged as a warning in the calling function, so removing the duplicate logging here

@@ -1752,6 +1752,15 @@ func (engine *DockerTaskEngine) provisionContainerResourcesAwsvpc(task *apitask.
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below adds a new logging message like below:

level=info time=2023-03-24T17:49:44Z msg="Setting up CNI config for task" task="XXXX13d3c33f4d2ab9bacb00ef38XXXX" cniContainerID="XXXX" cniPluginPath="" cniID="02:64:22:51:ac:81" cniBridgeName="" cniContainerNetNs="/host/proc/24977/ns/net"
level=info time=2023-03-24T17:49:54Z msg="Setting up CNI config for task" cniID="02:05:6d:7e:76:ed" cniBridgeName="" cniContainerNetNs="/host/proc/25476/ns/net" task="XXXX53b676e04a85aefc7de8a4baXXXX" cniContainerID="XXXX" cniPluginPath=""

@sparrc sparrc changed the title [wip] logging changes Add log message for task CNI setup, and other log mods Mar 24, 2023
@danehlim
Copy link
Contributor

Looks good to me. Could we squash the unit test fix commits together before merging?

danehlim
danehlim previously approved these changes Mar 24, 2023
Copy link
Contributor

@SreeeS SreeeS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of explaining the details for different sections.

@@ -40,4 +40,4 @@ jobs:
gcc --version
$env:ZZZ_SKIP_WINDOWS_SERVER_VERSION_CHECK_NOT_SUPPORTED_IN_PRODUCTION = 'true'
$packages=go list .\... | Where-Object {$_ -NotMatch 'vendor'}
go test -v -tags unit -timeout=40s $packages
go test -v -tags unit -timeout=120s $packages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh finally! 🙌

@sparrc sparrc merged commit 18f2fb9 into aws:dev Mar 27, 2023
@YashdalfTheGray YashdalfTheGray mentioned this pull request Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants