Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log message for task CNI setup, and other log mods #3621

Merged
merged 2 commits into from
Mar 27, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ dockerfree-cni-plugins:
release-agent-internal: dockerfree-certs dockerfree-cni-plugins static
./scripts/build-agent-image

# Default Agent target to build. Pulls cni plugins, builds agent image and save it to disk
# Default Agent target to build. Pulls cni plugins, builds agent image and save it to disk
release-agent: get-cni-sources
$(MAKE) release-agent-internal

Expand Down Expand Up @@ -350,7 +350,7 @@ install-golang:
.get-deps-stamp:
go install github.com/golang/mock/mockgen@v1.6.0
go install golang.org/x/tools/cmd/goimports@v0.2.0
GO111MODULE=on go install github.com/fzipp/gocyclo/cmd/gocyclo@v0.3.1
GO111MODULE=on go install github.com/fzipp/gocyclo/cmd/gocyclo@v0.6.0
GO111MODULE=on go install honnef.co/go/tools/cmd/staticcheck@v0.3.2
touch .get-deps-stamp

Expand All @@ -359,7 +359,7 @@ get-deps: .get-deps-stamp
get-deps-init:
go install github.com/golang/mock/mockgen@v1.6.0
go install golang.org/x/tools/cmd/goimports@v0.2.0
GO111MODULE=on go install github.com/fzipp/gocyclo/cmd/gocyclo@v0.3.1
GO111MODULE=on go install github.com/fzipp/gocyclo/cmd/gocyclo@v0.6.0
GO111MODULE=on go install honnef.co/go/tools/cmd/staticcheck@v0.3.2

amazon-linux-sources.tgz:
Expand Down
8 changes: 4 additions & 4 deletions agent/api/statechange.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,14 +149,14 @@ func NewContainerStateChangeEvent(task *apitask.Task, cont *apicontainer.Contain
}
contKnownStatus := cont.GetKnownStatus()
if !contKnownStatus.ShouldReportToBackend(cont.GetSteadyStateStatus()) {
return event, errors.Errorf(
return event, ErrShouldNotSendEvent{fmt.Sprintf(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents logging these bogus error messages:

level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: PULLED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:35Z msg="Skipping emitting event for container due to error" task="XXXX" container="main" error="create container state change event api: status not recognized by ECS: CREATED"
level=error time=2023-03-24T17:30:43Z msg="Skipping emitting event for container due to error" task="XXXX" container="cpustress" error="create container state change event api: status not recognized by ECS: NONE"

"create container state change event api: status not recognized by ECS: %v",
contKnownStatus)
contKnownStatus)}
}
if cont.GetSentStatus() >= contKnownStatus {
return event, errors.Errorf(
return event, ErrShouldNotSendEvent{fmt.Sprintf(
"create container state change event api: status [%s] already sent for container %s, task %s",
contKnownStatus.String(), cont.Name, task.Arn)
contKnownStatus.String(), cont.Name, task.Arn)}
}
if reason == "" && cont.ApplyingError != nil {
reason = cont.ApplyingError.Error()
Expand Down
4 changes: 0 additions & 4 deletions agent/api/task/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -508,10 +508,6 @@ func (task *Task) initNetworkMode(acsTaskNetworkMode *string) {
field.NetworkMode: aws.StringValue(acsTaskNetworkMode),
})
}
logger.Info("Task network mode initialized", logger.Fields{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message is misleading, the network mode is not actually "initialized" in any functional way here, this is only called when unmarshaling the task payload from ACS. Since it calls this function on every task payload, and ACS will often send redundant payloads, it also is repeated anywhere from 4-10 times for every task, for example:

% cat /var/log/ecs/ecs-agent.log | grep "Task network mode initialized"|grep "XXXXa1e7baf64457b336ddbcacdc1411"
level=info time=2023-03-24T17:10:44Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:46Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:48Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:53Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"
level=info time=2023-03-24T17:10:55Z msg="Task network mode initialized" task="XXXXa1e7baf64457b336ddbcacdc1411" networkMode="awsvpc"

field.TaskID: task.GetID(),
field.NetworkMode: task.NetworkMode,
})
}

func (task *Task) initServiceConnectResources() error {
Expand Down
9 changes: 9 additions & 0 deletions agent/engine/docker_task_engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -1752,6 +1752,15 @@ func (engine *DockerTaskEngine) provisionContainerResourcesAwsvpc(task *apitask.
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below adds a new logging message like below:

level=info time=2023-03-24T17:49:44Z msg="Setting up CNI config for task" task="XXXX13d3c33f4d2ab9bacb00ef38XXXX" cniContainerID="XXXX" cniPluginPath="" cniID="02:64:22:51:ac:81" cniBridgeName="" cniContainerNetNs="/host/proc/24977/ns/net"
level=info time=2023-03-24T17:49:54Z msg="Setting up CNI config for task" cniID="02:05:6d:7e:76:ed" cniBridgeName="" cniContainerNetNs="/host/proc/25476/ns/net" task="XXXX53b676e04a85aefc7de8a4baXXXX" cniContainerID="XXXX" cniPluginPath=""

logger.Info("Setting up CNI config for task", logger.Fields{
field.TaskID: task.GetID(),
"cniContainerID": cniConfig.ContainerID,
"cniPluginPath": cniConfig.PluginsPath,
"cniID": cniConfig.ID,
"cniBridgeName": cniConfig.BridgeName,
"cniContainerNetNs": cniConfig.ContainerNetNS,
})

// Invoke the libcni to config the network namespace for the container
result, err := engine.cniClient.SetupNS(engine.ctx, cniConfig, cniSetupTimeout)
if err != nil {
Expand Down
1 change: 1 addition & 0 deletions agent/stats/container.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ func newStatsContainer(dockerID string, client dockerapi.DockerClient, resolver
DockerID: dockerID,
Name: dockerContainer.Container.Name,
NetworkMode: dockerContainer.Container.GetNetworkMode(),
StartedAt: dockerContainer.Container.GetStartedAt(),
},
ctx: ctx,
cancel: cancel,
Expand Down
21 changes: 18 additions & 3 deletions agent/stats/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -707,6 +707,8 @@ func newDockerContainerMetadataResolver(taskEngine ecsengine.TaskEngine) (*Docke
}

// taskContainerMetricsUnsafe gets all container metrics for a task arn.
//
//gocyclo:ignore
func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*ecstcs.ContainerMetric, error) {
containerMap, taskExists := engine.tasksToContainers[taskArn]
if !taskExists {
Expand All @@ -723,10 +725,20 @@ func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*
if engine.stopTrackingContainerUnsafe(container, taskArn) {
continue
}
// age is used to determine if we should or should not expect missing metrics.
// this is because recently-started containers would normally not have their metrics
// queue filled yet.
age := time.Since(container.containerMetadata.StartedAt)
// gracePeriod is the time that containers are allowed to have missing metrics
// without throwing/logging errors.
gracePeriod := time.Second * 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are often logging these warning messages soon after a container starts up because we don't have enough metrics in the queue yet.

This change adds a "grace period" of 30s when we won't warn about this condition, this prevents messages like below:

level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="fc650d7f8817603ba602106383c7402f4c4c7b70c7fdf66be8903cc4ef169ee7" error="need at least 2 data points in queue to calculate CW stats set"
level=error time=2023-03-24T17:39:04Z msg="Error collecting cloudwatch metrics for container" container="b1bc98349145b528b58a9badb4565472f3ca81a2dd4c9e00e34b796eaab7ba85" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX" container="c36eca3d790f6d67ed0771891f0885d078961febc7769b69dd65a5f9aabc69a6" error="need at least 2 data points in queue to calculate CW stats set"
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx dropped: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx errors: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx packets: need at least 2 data points in the queue to calculate int stats" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network rx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network tx bytes per second: need at least 2 data points in queue to calculate CW stats set" module=queue.go
level=warn time=2023-03-24T17:39:04Z msg="Error getting network stats for container" container="8f6c44ed82edd31bd15e1a0e9fffe592929b241888db594a53dc44b4d794b26e" error="need at least 2 data points in queue to calculate CW stats set" taskARN="arn:aws:ecs:us-west-2:XXXX:task/XXXX/XXXX"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!


// CPU and Memory are both critical, so skip the container if either of these fail.
cpuStatsSet, err := container.statsQueue.GetCPUStatsSet()
if err != nil {
if age < gracePeriod {
continue
}
logger.Error("Error collecting cloudwatch metrics for container", logger.Fields{
field.Container: dockerID,
field.Error: err,
Expand All @@ -735,6 +747,9 @@ func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*
}
memoryStatsSet, err := container.statsQueue.GetMemoryStatsSet()
if err != nil {
if age < gracePeriod {
continue
}
logger.Error("Error collecting cloudwatch metrics for container", logger.Fields{
field.Container: dockerID,
field.Error: err,
Expand All @@ -749,7 +764,7 @@ func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*
}

storageStatsSet, err := container.statsQueue.GetStorageStatsSet()
if err != nil {
if err != nil && age > gracePeriod {
logger.Warn("Error getting storage stats for container", logger.Fields{
field.Container: dockerID,
field.Error: err,
Expand Down Expand Up @@ -777,7 +792,7 @@ func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*
seelog.Debug("Skip adding network stats for pause container in Service Connect enabled task")
} else {
networkStatsSet, err := container.statsQueue.GetNetworkStatsSet()
if err != nil {
if err != nil && age > gracePeriod {
// we log the error and still continue to publish cpu, memory stats
logger.Warn("Error getting network stats for container", logger.Fields{
field.Container: dockerID,
Expand All @@ -795,7 +810,7 @@ func (engine *DockerStatsEngine) taskContainerMetricsUnsafe(taskArn string) ([]*
// do not add network stats for pause container
if dockerContainer.Container.Type != apicontainer.ContainerCNIPause {
networkStats, err := taskStatsMap.StatsQueue.GetNetworkStatsSet()
if err != nil {
if err != nil && age > gracePeriod {
logger.Warn("Error getting network stats for container", logger.Fields{
field.TaskARN: taskArn,
field.Container: dockerContainer.DockerID,
Expand Down
38 changes: 24 additions & 14 deletions agent/stats/queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -169,62 +169,72 @@ func (queue *Queue) GetMemoryStatsSet() (*ecstcs.CWStatsSet, error) {
func (queue *Queue) GetStorageStatsSet() (*ecstcs.StorageStatsSet, error) {
storageStatsSet := &ecstcs.StorageStatsSet{}
var err error
var errStr string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the errors below are already logged as a warning in the calling function, so removing the duplicate logging here

storageStatsSet.ReadSizeBytes, err = queue.getULongStatsSet(getStorageReadBytes)
if err != nil {
seelog.Warnf("Error getting storage read size bytes: %v", err)
errStr += fmt.Sprintf("error getting storage read size bytes: %v - ", err)
}
storageStatsSet.WriteSizeBytes, err = queue.getULongStatsSet(getStorageWriteBytes)
if err != nil {
seelog.Warnf("Error getting storage write size bytes: %v", err)
errStr += fmt.Sprintf("error getting storage write size bytes: %v - ", err)
}
return storageStatsSet, err
var errOut error
if len(errStr) > 0 {
errOut = fmt.Errorf(errStr)
}
return storageStatsSet, errOut
}

// GetNetworkStatsSet gets the stats set for network metrics.
func (queue *Queue) GetNetworkStatsSet() (*ecstcs.NetworkStatsSet, error) {
networkStatsSet := &ecstcs.NetworkStatsSet{}
var err error
var errStr string
networkStatsSet.RxBytes, err = queue.getULongStatsSet(getNetworkRxBytes)
if err != nil {
seelog.Warnf("Error getting network rx bytes: %v", err)
errStr += fmt.Sprintf("error getting network rx bytes: %v - ", err)
}
networkStatsSet.RxDropped, err = queue.getULongStatsSet(getNetworkRxDropped)
if err != nil {
seelog.Warnf("Error getting network rx dropped: %v", err)
errStr += fmt.Sprintf("error getting network rx dropped: %v - ", err)
}
networkStatsSet.RxErrors, err = queue.getULongStatsSet(getNetworkRxErrors)
if err != nil {
seelog.Warnf("Error getting network rx errors: %v", err)
errStr += fmt.Sprintf("error getting network rx errors: %v - ", err)
}
networkStatsSet.RxPackets, err = queue.getULongStatsSet(getNetworkRxPackets)
if err != nil {
seelog.Warnf("Error getting network rx packets: %v", err)
errStr += fmt.Sprintf("error getting network rx packets: %v - ", err)
}
networkStatsSet.TxBytes, err = queue.getULongStatsSet(getNetworkTxBytes)
if err != nil {
seelog.Warnf("Error getting network tx bytes: %v", err)
errStr += fmt.Sprintf("error getting network tx bytes: %v - ", err)
}
networkStatsSet.TxDropped, err = queue.getULongStatsSet(getNetworkTxDropped)
if err != nil {
seelog.Warnf("Error getting network tx dropped: %v", err)
errStr += fmt.Sprintf("error getting network tx dropped: %v - ", err)
}
networkStatsSet.TxErrors, err = queue.getULongStatsSet(getNetworkTxErrors)
if err != nil {
seelog.Warnf("Error getting network tx errors: %v", err)
errStr += fmt.Sprintf("error getting network tx errors: %v - ", err)
}
networkStatsSet.TxPackets, err = queue.getULongStatsSet(getNetworkTxPackets)
if err != nil {
seelog.Warnf("Error getting network tx packets: %v", err)
errStr += fmt.Sprintf("error getting network tx packets: %v - ", err)
}
networkStatsSet.RxBytesPerSecond, err = queue.getUDoubleCWStatsSet(getNetworkRxPacketsPerSecond)
if err != nil {
seelog.Warnf("Error getting network rx bytes per second: %v", err)
errStr += fmt.Sprintf("error getting network rx bytes per second: %v - ", err)
}
networkStatsSet.TxBytesPerSecond, err = queue.getUDoubleCWStatsSet(getNetworkTxPacketsPerSecond)
if err != nil {
seelog.Warnf("Error getting network tx bytes per second: %v", err)
errStr += fmt.Sprintf("error getting network tx bytes per second: %v - ", err)
}
var errOut error
if len(errStr) > 0 {
errOut = fmt.Errorf(errStr)
}
return networkStatsSet, err
return networkStatsSet, errOut
}

func getNetworkRxBytes(s *UsageStats) uint64 {
Expand Down
23 changes: 13 additions & 10 deletions agent/stats/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,22 +57,14 @@ func (taskStat *StatsTask) collect() {
backoff := retry.NewExponentialBackoff(time.Second*1, time.Second*10, 0.5, 2)

for {
err := taskStat.processStatsStream()
statStreamErr := taskStat.processStatsStream()
select {
case <-taskStat.Ctx.Done():
logger.Debug("Stopping stats collection for taskStat", logger.Fields{
field.TaskID: taskId,
})
return
default:
if err != nil {
d := backoff.Duration()
time.Sleep(d)
logger.Debug("Error querying stats for task", logger.Fields{
field.TaskID: taskId,
field.Error: err,
})
}
// We were disconnected from the stats stream.
// Check if the task is terminal. If it is, stop collecting metrics.
terminal, err := taskStat.terminal()
Expand All @@ -83,11 +75,22 @@ func (taskStat *StatsTask) collect() {
field.Error: err,
})
taskStat.StopStatsCollection()
continue
} else if terminal {
logger.Warn("Task is terminal, stopping stats collection", logger.Fields{
logger.Info("Task is terminal, stopping stats collection", logger.Fields{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really a warning as this is expected behavior

field.TaskID: taskId,
})
taskStat.StopStatsCollection()
continue
}
// task stats were not stopped for terminal task, backoff before trying to reconnect
if statStreamErr != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backing off and logging after we determine if task has exited prevents lots of log messages like this:

level=warn time=2023-03-24T17:32:52Z msg="Error encountered processing metrics stream from host, this may affect cloudwatch metric accuracy" task="815c01e432284155bd0024be207989a7" error="failed to Statfs \"/host/proc/12554/ns/net\": no such file or directory"

d := backoff.Duration()
time.Sleep(d)
logger.Debug("Error querying stats for task", logger.Fields{
field.TaskID: taskId,
field.Error: err,
})
}
}
}
Expand Down
7 changes: 4 additions & 3 deletions agent/stats/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,10 @@ type UsageStats struct {

// ContainerMetadata contains meta-data information for a container.
type ContainerMetadata struct {
DockerID string `json:"-"`
Name string `json:"-"`
NetworkMode string `json:"-"`
DockerID string `json:"-"`
Name string `json:"-"`
NetworkMode string `json:"-"`
StartedAt time.Time `json:"-"`
}

// TaskMetadata contains meta-data information for a task.
Expand Down