Non essential container state change bug fix #1026

aaithal · 2017-10-20T22:29:59Z

Summary

Fixes the bug wherein non-essential container state changes were not being submitted to ECS

Implementation details

Added a ticker to drain the event queue every minute. If
there are any container state changes not submitted,
the ticker ensures that they do get submitted.

Testing

Builds on Linux (make release)
Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
Unit tests on Linux (make test) pass
Unit tests on Windows (go test -timeout=25s ./agent/...) pass
Integration tests on Linux (make run-integ-tests) pass
Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
Functional tests on Linux (make run-functional-tests) pass
Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: Yes

Description for the changelog

Will add an entry in CHANGELOG.md

Licensing

This contribution is under the terms of the Apache 2.0 License: Yes

richardpen

code flow looks good, some comments below.

richardpen · 2017-10-20T23:10:34Z

agent/eventhandler/task_handler.go

+		event.taskChange.Task.SetSentStatus(event.taskChange.Status)
+	}
+	for _, container := range event.taskChange.Containers {
+		container.Container.SetSentStatus(event.taskChange.Status.ContainerStatus(container.Container.GetSteadyStateStatus()))


Why not just container.Status instead of event.taskChange.Status.ContainerStatus(container.Container.GetSteadyStateStatus())?

I missed that we have this in ContainerStateChange. I've modified this.

Looks like this is still not fixed, did you miss this in the commit?

Seems so. Working on it.

richardpen · 2017-10-20T23:20:12Z

agent/eventhandler/task_handler_types.go

 		return false // redundant event
 	}
+	if tevent.Task != nil && tevent.Task.GetSentStatus() >= tevent.Status {
+		for _, containerStateChange := range tevent.Containers {
+			container := containerStateChange.Container


Please add some comments here to makes it easier to read. And this part should have a unit test to cover.

This is covered here: https://github.com/aws/amazon-ecs-agent/pull/1026/files/b5b7001414fb5e29023dfa3ecbb9c806f458a220#diff-f100f5d55059012cdd9f615215030187R75

richardpen · 2017-10-20T23:22:05Z

agent/eventhandler/task_handler.go

@@ -188,57 +254,14 @@ func (handler *TaskHandler) SubmitTaskEvents(taskEvents *eventList, client api.E
 			event := eventToSubmit.Value.(*sendableEvent)

 			if event.containerShouldBeSent() {


We probably don't need this part of code now, which can be in another pr.

richardpen · 2017-10-20T23:25:54Z

agent/eventhandler/task_handler.go

+
+	var events []api.TaskStateChange
+	for taskARN, _ := range handler.tasksToContainerStates {
+		if task, ok := handler.state.TaskByArn(taskARN); ok {


I'm not sure if the else situation will happen, but since you are checking here, we may want to remove it from the handler.tasksToContainerStates map if it happened. And probably a log will help.

richardpen · 2017-10-20T23:35:40Z

agent/eventhandler/task_handler.go

+// getBatchedContainerEvents gets a list task state changes for container events that
+// have been batched and not sent beyond the drainEventsFrequency threshold
+func (handler *TaskHandler) getBatchedContainerEvents() []api.TaskStateChange {
+	handler.taskHandlerLock.RLock()


There may be a race condition where we are sending in the ticker and received the task state change also, can you move this lock inside the ticker line 110?

Good catch. As per offline discussion, I don't think moving the lock itself resolves the race condition. However, moving https://github.com/aws/amazon-ecs-agent/blob/dev/agent/eventhandler/task_handler.go#L114 to https://github.com/aws/amazon-ecs-agent/blob/dev/agent/eventhandler/task_handler.go#L154 does resolve it I think. I'll modify that.

samuelkarp · 2017-10-23T07:30:31Z

agent/eventhandler/task_handler.go

@@ -58,16 +67,73 @@ type TaskHandler struct {
 	// stateSaver is a statemanager which may be used to save any
 	// changes to a task or container's SentStatus
 	stateSaver statemanager.Saver
+
+	drainEventsFrequency time.Duration


Looks like this needs gofmt.

Why do you think so? It is formatted code.

I...have no idea what I was looking at. Please ignore.

jhaynes · 2017-10-23T16:16:37Z

agent/eventhandler/task_handler.go

+			ticker.Stop()
+			return
+		case <-ticker.C:
+			seelog.Info("TaskHandler tick")


Should this be Debug level? (and potentially line 105 too?) If we keep it at Info, I think we should make it more descriptive.

That's a miss on my part. It should be removed.

jhaynes · 2017-10-23T16:19:32Z

agent/eventhandler/task_handler.go

+		case <-ticker.C:
+			seelog.Info("TaskHandler tick")
+			for _, event := range handler.getBatchedContainerEvents() {
+				seelog.Infof(


Same question as above. Should this be Debug?

it makes sense for this to be Info as it's a trigger for submitting state changes to the backend and we'd want to know about it without having to elevate log level to debug

jhaynes · 2017-10-23T16:20:30Z

agent/eventhandler/task_handler.go

+
+	// Extract the wrapped event from the list element
+	event := eventToSubmit.Value.(*sendableEvent)
+	seelog.Infof("TaskHandler, Sending %s change: %s", eventType, event.String())


This changed from the prior if blocks from Debug to Info. I like this increase in verbosity better than some of the other Info level additions. However, I might flip the level on to Debug and log only the success message (line 309) as Info.

richardpen · 2017-10-24T21:34:10Z

agent/eventhandler/task_handler.go

+	defer handler.lock.RUnlock()
+
+	var events []api.TaskStateChange
+	for taskARN, _ := range handler.tasksToContainerStates {


If only the keys are required, you can simply use: for taskARN := range handler.tasksToContainerStates.

Do you want to modify this?

richardpen · 2017-10-24T21:34:52Z

agent/eventhandler/task_handler.go

 	}
+	go taskHandler.startDrainEventsTicker()


Can we explictly call this when start the eventhandler in agent/app/agent.gocode. Call this in a new function seems confused.

I think we have this pattern elsewhere as well. An object creation method doing all the things for completing the object creation doesn't seem like an unwanted thing to me. I don't have strong opinions here. I can change it if you really want me to.

richardpen · 2017-10-24T21:57:26Z

agent/eventhandler/task_handler.go

@@ -117,142 +198,218 @@ func (handler *TaskHandler) flushBatch(event *api.TaskStateChange) {
 // addEvent prepares a given event to be sent by adding it to the handler's appropriate
 // eventList and remove the entry in tasksToEvents map
 func (handler *TaskHandler) addEvent(change *sendableEvent, client api.ECSClient) {
-	handler.taskHandlerLock.Lock()


I guess there is another race case here, we may want to move this lock into AddStateChangeEvent. But I need to confirm with you.

Yeah, that's a valid race. Please look at the latest revision where this has been fixed as well.

richardpen · 2017-10-24T22:14:33Z

agent/eventhandler/task_handler.go

+		event.taskChange.Task.SetSentStatus(event.taskChange.Status)
+	}
+	for _, container := range event.taskChange.Containers {
+		container.Container.SetSentStatus(event.taskChange.Status.ContainerStatus(container.Container.GetSteadyStateStatus()))


Looks like this is still not fixed, did you miss this in the commit?

richardpen

LGTM as long as all the tests pass. Some minor comments on the structure

richardpen · 2017-10-25T21:28:58Z

agent/eventhandler/task_handler.go

-	defer handler.taskHandlerLock.Unlock()
+// startDrainEventsTicker starts a ticker that periodically drains the events queue
+// by submitting state change events to the ECS backend
+func (handler *TaskHandler) startDrainEventsTicker() {


nit: I'd love to move this function before AddStateChangeEvent.

I wanted the exported functions to be at the top of the file as that's a reasonable way to organize the code.

richardpen · 2017-10-25T21:30:06Z

agent/eventhandler/task_handler.go

+	defer handler.lock.RUnlock()
+
+	var events []api.TaskStateChange
+	for taskARN, _ := range handler.tasksToContainerStates {


Do you want to modify this?

richardpen · 2017-10-25T21:39:48Z

agent/eventhandler/task_handler.go

+// sendChange adds the change to the sendable events list. If the
+// event for the task is in the process of being sent, it doesn't trigger
+// the async submit state api method
+func (taskEvents *taskSendableEvents) sendChange(change *sendableEvent,


Same as above, can you move this above submitTaskEvents.

The order is TaskHandler methods and then taskEvents methods (exported first, non exported afterwards), which is again an organization structure that we've followed everywhere in the code base

This commit refactors the 'SubmitTaskEvents' method to consolidate the duplicate code in event submission logic into a single method.

Fixed a bug where upon submitting the task state change, the container sent state was not being updated. This would cause spurious container state changes in the submission queue of task handler upon agent restart.

Added a ticker to drain the event queue every minute. If there are any container state changes not submitted, the ticker ensures that they do get submitted.

This commit fixes a bunch of things in the task handler code: 1. Breaking up the really long submitTaskEvents method: The callback for the the backoff has been moved into its own named method from the in-place anonymous method (send()) 2. The backoff callback method now returns an error when there's an error submitting state change to ECS backend. This fixes a bug with the old version of the code where the backoff would never happen on errors because the error was never propagated all the way up 3. The race-condition between startDrainEventsTicker() and AddStateChangeEvent() when both are trying to submit changes to the same task is fixed as the tasksToEvents is used to group events belonging to the same task and only cleaned up when events are submitted instead of being cleaned before invoking the submitTaskEvents() routine

There was another race condition where, if an event was already being marked for submission, the batched container changes would be lost if another submission trigger came in for the same task from the ticker. This commit makes it so that the two maps in task handler are updated in a single lock scope to avoid such a race. This commit also refactors the code to get rid of most methods that were not tied to any struct. Instead, each struct has its own methods to deal with sending state changes to ECS

adnxn

code lgtm, thanks for cleaning this up. just few questions for my understanding.

adnxn · 2017-10-26T15:24:17Z

agent/eventhandler/task_handler.go

+	//eventsListLock locks both the list and sending bool
+	lock sync.Mutex
+	// createdAt is a timestamp for when the event list was created
+	createdAt time.Time


why do we need a createdAt field?

both createdAt and taskARN fields are to help with debugging/logging.

adnxn · 2017-10-26T15:29:50Z

agent/eventhandler/task_handler.go

-func (handler *TaskHandler) batchContainerEvent(event api.ContainerStateChange) {
-	handler.taskHandlerLock.Lock()
-	defer handler.taskHandlerLock.Unlock()
+// startDrainEventsTicker starts a ticker that periodically drains the events queue


nit: maybe add to comment that intended use is to call during object creation.

adnxn · 2017-10-26T15:38:30Z

agent/eventhandler/task_handler.go

-func (handler *TaskHandler) flushBatch(event *api.TaskStateChange) {
-	handler.taskHandlerLock.Lock()
-	defer handler.taskHandlerLock.Unlock()
+// batchContainerEventUnsafe collects container state change events for a given task arn


so i understand - why are we getting rid of the locks in batchContainerEventUnsafe and flushBatchUnsafe?

we're moving it up to AddStateChangeEvent. More details in the commit message: d959f31

vsiddharth

LGTM 🚀

samuelkarp · 2017-10-27T18:05:19Z

agent/eventhandler/task_handler.go

-}
+	// drainEventsFrequency is the frequency at the which unsent events batched
+	// by the task handler are sent to the backend
+	drainEventsFrequency = time.Minute


Should we go faster than a minute? A minute seems pretty slow to me.

As per offline conversation, i'll change this to 20s

samuelkarp · 2017-10-27T18:09:22Z

agent/eventhandler/task_handler.go

+// sendChange adds the change to the sendable events list. If the
+// event for the task is in the process of being sent, it doesn't trigger
+// the async submit state api method
+func (taskEvents *taskSendableEvents) sendChange(change *sendableEvent,


Can you add a bit to this comment here explaining that the actual task submission (API request) occurs asynchronously, so that this function doesn't end up causing the lock to be held for a long time?

samuelkarp · 2017-10-27T18:22:29Z

agent/eventhandler/task_handler.go

 	}
+	go taskHandler.startDrainEventsTicker()


I'd like a comment somewhere explaining the flow for batched events. This is roughly what I got through reading this:

startDrainEventsTicker-> getBatchedContainerEvents-> read from tasksToContainerStates and returns TaskEvent AddStateChangeEvent-> flushBatchUnsafe

The part that took me a while is that AddStateChangeEvent also calls batchContainerUnsafe for ContainerEvents and the key difference is that getBatchedContainerEvents returns a TaskEvent even though it has "ContainerEvents" in the name.

I'll rename getBatchedContainerEvents -> taskStateChangesToSend. Will add some comments in startDrainEventsTicker as well.

samuelkarp · 2017-10-27T18:23:26Z

agent/eventhandler/task_handler.go


 	// Mirror events.sending, but without the need to lock since this is local
 	// to our goroutine
 	done := false
-
+	// TODO: wire in the context here. Else, we have go routine leaks in tests


Is this TODO expected to be fixed before merging? goroutine leaks in tests can cause flaky tests.

Not with this PR :(

samuelkarp · 2017-10-27T18:28:53Z

agent/eventhandler/task_handler.go

+		taskEvents.sending = true
+		go handler.submitTaskEvents(taskEvents, client, change.taskArn())
+	} else {
+		seelog.Debugf("TaskHandler: Not submitting change as the task is already being sent: %s", change.toString())


So I think there's a race here, but the impact of the race is probably only a delay of drainEventsFrequency. If we evaluate taskEvents.sending as true we won't cause a new goroutine to be created for handler.submitTaskEvents. However, if the existing goroutine is just about done sending, it's possible it won't pick up the new event that we just added on line 263. I think this is a reasonable tradeoff here, but it would be worth recording it somewhere that the maximum length of time we can have an unsubmitted change is really effectively 2*drainEventsFrequency.

aaithal · 2017-10-27T19:56:03Z

@samuelkarp can you take a look at the latest commit? I think I've addressed your comments. Thanks!

aaithal requested review from adnxn, samuelkarp and richardpen October 20, 2017 22:29

aaithal changed the title ~~Non essential container state change bug~~ Non essential container state change bug fix Oct 20, 2017

richardpen reviewed Oct 20, 2017

View reviewed changes

samuelkarp reviewed Oct 23, 2017

View reviewed changes

jhaynes reviewed Oct 23, 2017

View reviewed changes

aaithal force-pushed the nonEssentialContainerStateChangeBug branch from d5c41a2 to b96d157 Compare October 24, 2017 21:09

richardpen reviewed Oct 24, 2017

View reviewed changes

aaithal force-pushed the nonEssentialContainerStateChangeBug branch 2 times, most recently from e04e7f9 to 814b098 Compare October 25, 2017 21:16

richardpen approved these changes Oct 25, 2017

View reviewed changes

samuelkarp added this to the 1.15.0 milestone Oct 26, 2017

aaithal added 5 commits October 26, 2017 08:10

eventhandler: remove duplicate code in TaskHandler

417ba56

This commit refactors the 'SubmitTaskEvents' method to consolidate the duplicate code in event submission logic into a single method.

eventhandler: container sent state bug fix

41c8327

Fixed a bug where upon submitting the task state change, the container sent state was not being updated. This would cause spurious container state changes in the submission queue of task handler upon agent restart.

eventhandler: periodic submit-state event drain

9df9c1f

Added a ticker to drain the event queue every minute. If there are any container state changes not submitted, the ticker ensures that they do get submitted.

changelog entry for non-essential containers state change

997454a

aaithal force-pushed the nonEssentialContainerStateChangeBug branch from 814b098 to 2f3ce92 Compare October 26, 2017 15:11

aaithal force-pushed the nonEssentialContainerStateChangeBug branch from 2f3ce92 to d959f31 Compare October 26, 2017 15:21

adnxn reviewed Oct 26, 2017

View reviewed changes

vsiddharth approved these changes Oct 26, 2017

View reviewed changes

samuelkarp reviewed Oct 27, 2017

View reviewed changes

eventhandler: Reduce drain events frequency to 20s

ccee6eb

samuelkarp approved these changes Oct 27, 2017

View reviewed changes

aaithal merged commit 4d9cdf2 into aws:dev Oct 27, 2017

vsiddharth mentioned this pull request Oct 30, 2017

Error getting per-container exist status #1021

Closed

jhaynes mentioned this pull request Oct 31, 2017

V1.15.0 release #1018

Merged

		@@ -188,57 +254,14 @@ func (handler TaskHandler) SubmitTaskEvents(taskEvents eventList, client api.E
		event := eventToSubmit.Value.(*sendableEvent)

		if event.containerShouldBeSent() {

Non essential container state change bug fix #1026

Non essential container state change bug fix #1026

Conversation

aaithal commented Oct 20, 2017 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

richardpen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adnxn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsiddharth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal commented Oct 27, 2017

aaithal commented Oct 20, 2017 •

edited

Loading