Add log message for task CNI setup, and other log mods #3621

sparrc · 2023-03-24T17:07:43Z

Summary

Agent logs are very verbose and log quite a lot of red herring warning and error log messages. In some cases it is also logging irrelevant and duplicate messages.

In this change there is also the addition of a log message for when we are setting up the CNI config for a task.

Overall these log messages represent around 16-25% of an agent logfile (higher percent on instances with higher numbers of tasks).

Changes to logging:

add a grace period for cloudwatch metrics warnings
remove unnecessary stat warns/errors when a task is or has exited
remove unnecessary (and somewhat misleading) "network mode initialized" log messages
dont log errors for "skippable" docker events

Testing

Description for the changelog

Enhancement - logging cleanup for unnecessary warn/error messages

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- add a grace period for cloudwatch metrics warnings - remove unnecessary stat warns/errors when a task is or has exited - remove unnecessary (and somewhat misleading) "network mode initialized" log messages - dont log errors for "skippable" docker events

sparrc · 2023-03-24T17:30:16Z

agent/api/task/task.go

@@ -508,10 +508,6 @@ func (task *Task) initNetworkMode(acsTaskNetworkMode *string) {
 			field.NetworkMode: aws.StringValue(acsTaskNetworkMode),
 		})
 	}
-	logger.Info("Task network mode initialized", logger.Fields{


This log message is misleading, the network mode is not actually "initialized" in any functional way here, this is only called when unmarshaling the task payload from ACS. Since it calls this function on every task payload, and ACS will often send redundant payloads, it also is repeated anywhere from 4-10 times for every task, for example:

% cat /var/log/ecs/ecs-agent.log | grep "Task network mode initialized"|grep "XXXXa1e7baf64457b336ddbcacdc1411" level=info time=2023-03-24T17:10:44Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc" level=info time=2023-03-24T17:10:46Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc" level=info time=2023-03-24T17:10:48Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc" level=info time=2023-03-24T17:10:53Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc" level=info time=2023-03-24T17:10:55Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"

sparrc · 2023-03-24T17:32:01Z

agent/api/statechange.go

@@ -149,14 +149,14 @@ func NewContainerStateChangeEvent(task *apitask.Task, cont *apicontainer.Contain
 	}
 	contKnownStatus := cont.GetKnownStatus()
 	if !contKnownStatus.ShouldReportToBackend(cont.GetSteadyStateStatus()) {
-		return event, errors.Errorf(
+		return event, ErrShouldNotSendEvent{fmt.Sprintf(


This prevents logging these bogus error messages:

level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: PULLED" level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: CREATED" level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: CREATED" level=error time=2023-03-24T17:30:43Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: NONE"

sparrc · 2023-03-24T17:38:39Z

agent/stats/task.go

+				continue
+			}
+			// task stats were not stopped for terminal task, backoff before trying to reconnect
+			if statStreamErr != nil {


backing off and logging after we determine if task has exited prevents lots of log messages like this:

level=warn time=2023-03-24T17:32:52Z msg="Error encountered processing metrics stream from host, this may affect cloudwatch metric accuracy" task="815c01e432284155bd0024be207989a7" error="failed to Statfs \"/host/proc/12554/ns/net\": no such file or directory"

sparrc · 2023-03-24T17:38:55Z

agent/stats/task.go

 			} else if terminal {
-				logger.Warn("Task is terminal, stopping stats collection", logger.Fields{
+				logger.Info("Task is terminal, stopping stats collection", logger.Fields{


this is not really a warning as this is expected behavior

sparrc · 2023-03-24T17:49:37Z

agent/stats/engine.go

+		age := time.Since(container.containerMetadata.StartedAt)
+		// gracePeriod is the time that containers are allowed to have missing metrics
+		// without throwing/logging errors.
+		gracePeriod := time.Second * 30


We are often logging these warning messages soon after a container starts up because we don't have enough metrics in the queue yet.

This change adds a "grace period" of 30s when we won't warn about this condition, this prevents messages like below:

level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="fc650d7f8817603ba602106383c7402f4c4c7b70c7fdf66be8903cc4ef169ee7" error="need at least 2 data points in queue to calculate CW stats set" level=error time=2023-03-24T17:39:04Z msg="Error collecting cloudwatch metrics for container" container="b1bc98349145b528b58a9badb4565472f3ca81a2dd4c9e00e34b796eaab7ba85" error="need at least 2 data points in queue to calculate CW stats set" level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="c36eca3d790f6d67ed0771891f0885d078961febc7769b69dd65a5f9aabc69a6" error="need at least 2 data points in queue to calculate CW stats set" level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" container="8f6c44ed82edd31bd15e1a0e9fffe592929b241888db594a53dc44b4d794b26e" error="need at least 2 data points in queue to calculate CW stats set" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX"

Nice addition!

sparrc · 2023-03-24T17:50:17Z

agent/stats/queue.go

@@ -169,62 +169,72 @@ func (queue *Queue) GetMemoryStatsSet() (*ecstcs.CWStatsSet, error) {
 func (queue *Queue) GetStorageStatsSet() (*ecstcs.StorageStatsSet, error) {
 	storageStatsSet := &ecstcs.StorageStatsSet{}
 	var err error
+	var errStr string


the errors below are already logged as a warning in the calling function, so removing the duplicate logging here

sparrc · 2023-03-24T17:53:32Z

agent/engine/docker_task_engine.go

@@ -1752,6 +1752,15 @@ func (engine *DockerTaskEngine) provisionContainerResourcesAwsvpc(task *apitask.
 		}
 	}



Below adds a new logging message like below:

level=info time=2023-03-24T17:49:44Z msg="Setting up CNI config for task" task="XXXX13d3c33f4d2ab9bacb00ef38XXXX" cniContainerID="XXXX" cniPluginPath="" cniID="02:64:22:51:ac:81" cniBridgeName="" cniContainerNetNs="/host/proc/24977/ns/net" level=info time=2023-03-24T17:49:54Z msg="Setting up CNI config for task" cniID="02:05:6d:7e:76:ed" cniBridgeName="" cniContainerNetNs="/host/proc/25476/ns/net" task="XXXX53b676e04a85aefc7de8a4baXXXX" cniContainerID="XXXX" cniPluginPath=""

danehlim · 2023-03-24T20:49:00Z

Looks good to me. Could we squash the unit test fix commits together before merging?

SreeeS

I like the idea of explaining the details for different sections.

singholt · 2023-03-27T16:17:04Z

.github/workflows/windows.yml

@@ -40,4 +40,4 @@ jobs:
          gcc --version
          $env:ZZZ_SKIP_WINDOWS_SERVER_VERSION_CHECK_NOT_SUPPORTED_IN_PRODUCTION = 'true'
          $packages=go list .\... | Where-Object {$_ -NotMatch 'vendor'}
-          go test -v -tags unit -timeout=40s $packages
+          go test -v -tags unit -timeout=120s $packages


oh finally! 🙌

sparrc requested a review from a team as a code owner March 24, 2023 17:07

sparrc force-pushed the cni-setup-logging branch 2 times, most recently from 4f7b089 to 0a72422 Compare March 24, 2023 17:18

sparrc force-pushed the cni-setup-logging branch from c9ea33b to a5cddf9 Compare March 24, 2023 17:27

sparrc commented Mar 24, 2023

View reviewed changes

sparrc force-pushed the cni-setup-logging branch from 428ff01 to a66bf31 Compare March 24, 2023 18:17

sparrc changed the title ~~[wip] logging changes~~ Add log message for task CNI setup, and other log mods Mar 24, 2023

sparrc added the bot/test label Mar 24, 2023

amazon-ecs-bot removed the bot/test label Mar 24, 2023

danehlim previously approved these changes Mar 24, 2023

View reviewed changes

sparrc dismissed danehlim’s stale review via 2fc2bae March 24, 2023 20:56

Fix unit tests relying on GetID call from log message

c8e538e

sparrc force-pushed the cni-setup-logging branch from 2fc2bae to c8e538e Compare March 24, 2023 21:47

sparrc added the bot/test label Mar 24, 2023

amazon-ecs-bot removed the bot/test label Mar 24, 2023

danehlim approved these changes Mar 24, 2023

View reviewed changes

SreeeS reviewed Mar 27, 2023

View reviewed changes

SreeeS approved these changes Mar 27, 2023

View reviewed changes

singholt reviewed Mar 27, 2023

View reviewed changes

singholt approved these changes Mar 27, 2023

View reviewed changes

sparrc merged commit 18f2fb9 into aws:dev Mar 27, 2023

YashdalfTheGray mentioned this pull request Mar 28, 2023

Release 1.70.1 #3627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add log message for task CNI setup, and other log mods #3621

Add log message for task CNI setup, and other log mods #3621

sparrc commented Mar 24, 2023 •

edited

Loading

sparrc Mar 24, 2023

sparrc Mar 24, 2023

sparrc Mar 24, 2023

sparrc Mar 24, 2023

sparrc Mar 24, 2023

danehlim Mar 24, 2023

sparrc Mar 24, 2023

sparrc Mar 24, 2023

danehlim commented Mar 24, 2023

SreeeS left a comment

singholt Mar 27, 2023

		@@ -1752,6 +1752,15 @@ func (engine DockerTaskEngine) provisionContainerResourcesAwsvpc(task apitask.
		}
		}

Add log message for task CNI setup, and other log mods #3621

Add log message for task CNI setup, and other log mods #3621

Conversation

sparrc commented Mar 24, 2023 • edited Loading

Summary

Testing

Description for the changelog

Licensing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danehlim commented Mar 24, 2023

SreeeS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sparrc commented Mar 24, 2023 •

edited

Loading