Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the progresscontainer independent of each other #1306

Merged
merged 3 commits into from
Mar 27, 2018

Conversation

richardpen
Copy link

@richardpen richardpen commented Mar 19, 2018

Summary

Change the waitForContainerTransitions to only wait for one transition instead of all of the transitions so that each container can still move forward on its own without waiting for other transitions.

Implementation details

Remove the logic of waiting for all transitions in waitForContainerTransitions, and added a transition status to indicate the container is in the middle of a processing to avoid duplicate action on the same container.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Manual test with task that has multiple containers, containers are progressed independently.
    • Task with multiple containers
    • Task that uses volume (including empty volume and non-empty volume)
    • Task that uses volumesFrom
    • Task that uses links
    • Set a cluster to periodically run a bunch of tasks

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a first pass and i have a few questions.

  • since this is on a critical code path, how do we plan on ensuring it doesn't introduce any regressions? are we doing anything else outside scope of existing functional tests? i'm mainly wondering what kind of additional validation we should be doing for changes like this.

  • what happens if the agent restarts? we don't care about the state of the container's ContainerTransitioningStatus when we synchronize state? mtasks will just continue to progress containers based on KnownStatus/DesiredStatus?

@@ -76,6 +106,14 @@ func (healthStatus ContainerHealthStatus) BackendStatus() string {
}
}

// Done returns whether the transitioing is done base don the container known status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: "....based on the container..."


// Check if the container transition has already finished
if containerTransitioningStatusMapping[c.transitioningStatus] <= knownStatus {
c.transitioningStatus = ContainerNotTransition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a lock for this? also should we just be using the setter here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be called within SetKnownStatus, which has already held the lock. Setter is holding a lock where we don't need a lock here, that's why the name xxxUnsafe.

@richardpen
Copy link
Author

@adnxn For question 1: functional tests only test the happy path, I will need to test the failure case where a container failed at different stage:

  • Task with multiple containers
  • Task that uses volume (including empty volume and non-empty volume)
  • Task that uses volumesFrom
  • Task that uses links
  • Set a cluster to periodically run a bunch of tasks

Let me know if you can think of other test scenario that was missed here.

For 2, we don't need to save the state, as agent has already has the logic to handle duplicate create/start, for pull/stop agent will have to pull/stop again on agent restart.

@richardpen richardpen added this to the 1.18.0 milestone Mar 21, 2018
@@ -548,3 +553,38 @@ func (c *Container) GetHealthStatus() HealthStatus {

return copyHealth
}

// UpdateTransitioningStatusUnsafe updates the container transitioning status
func (c *Container) UpdateTransitioningStatusUnsafe(knownStatus ContainerStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be scoped to the package: updateTransitioningStatusUnsafe

@@ -619,7 +619,7 @@ func (mtask *managedTask) progressContainers() {
// complete, but keep reading events as we do. in fact, we have to for
// transitions to complete
mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer)
seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn)
seelog.Debugf("Managed task [%s]: done transitioning container", mtask.Arn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also log the name of the container here?

@@ -166,6 +166,9 @@ type Container struct {
// handled properly so that the state storage continues to work.
SentStatusUnsafe ContainerStatus `json:"SentStatus"`

// transitioningStatus is the status of the container
transitioningStatus ContainerTransitioningStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from AppliedStatus ? Also, can you add more details in the documentation about what this field is supposed to do? Why do we need? How will it be used etc?

Another question is do we need to persist it? If not, why not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I don't know there is already AppliedStatus for the same purpose. After some investigation, I think we can just use the AppliedStatus and the Transitioning status related should be removed. I will update the PR.

}

// GetTransitioning returns the transitioning status of container
func (c *Container) GetTransitioning() ContainerTransitioningStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please rename this to TransitionStatus()?

@@ -49,9 +49,28 @@ const (
ContainerUnhealthy
)

const (
// ContainerNotTransition means the container isn't in the middle of a transition
ContainerNotTransition ContainerTransitioningStatus = iota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be ContainerTransitionNone?

// ContainerNotTransition means the container isn't in the middle of a transition
ContainerNotTransition ContainerTransitioningStatus = iota
// ContainerPulling means the container is in the process of pulling
ContainerPulling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure that ContainerTransition is a prefix for all of these? Example: ContainerTransitionPulling

// ContainerStatus is an enumeration of valid states in the container lifecycle
type ContainerStatus int32

// ContainerTransitioningStatus is an enumeration of valid container processing
// status which indicates which status the container is being processed
type ContainerTransitioningStatus int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all of Transition statuses and methods to a separate file?


// SetTransitioningStatus sets the transitioning status of container and returns whether
// the container is already in a transition
func (c *Container) SetTransitioningStatus(status ContainerTransitioningStatus) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateTransitioningTo might be a better name here.

Status: transition.nextState,
},
})
go func(cont *api.Container, status api.ContainerStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this logic changed? Can you please add documentation/comments for that as well?

break
}
}
// Wait for one transition/acs/docker message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. Why was this changed? What's the impact? Can we please add documentation explaining the change?

Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we manually test this when two or more containers are present in the task?

@@ -241,12 +243,14 @@ func (c *Container) GetKnownStatus() ContainerStatus {
return c.KnownStatusUnsafe
}

// SetKnownStatus sets the known status of the container
// SetKnownStatus sets the known status of the container and update the container
// transitioning status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: container applied status

}

// Check if the container transition has already finished
if c.AppliedStatus <= knownStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the condition c.AppliedStatus > knownStatus ever be true here? We set the knownStatus only when a transition is complete, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the agent received duplicate docker events for some reason, this could happen. As the event can be a past status of the container.

@@ -763,39 +776,23 @@ func (mtask *managedTask) onContainersUnableToTransitionState() {
mtask.emitTaskEvent(mtask.Task, taskUnableToTransitionToStoppedReason)
// TODO we should probably panic here
} else {
seelog.Criticalf("Managed task [%s]: voving task to stopped due to bad state", mtask.Arn)
seelog.Criticalf("Managed task [%s]: moving task to stopped due to bad state", mtask.Arn)
mtask.handleDesiredStatusChange(api.TaskStopped, 0)
}
}

func (mtask *managedTask) waitForContainerTransitions(transitions map[string]api.ContainerStatus,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this method, since it does not wait for all the container transitions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's still waiting for container transition. Let me know if you have a better suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it just makes it look like it is waiting for ALL the containers' transitions, while it is waiting for just one to finish. may be just call it waitForContainerTransition?

// a goroutine so that it won't block here before the waitForContainerTransitions
// was called after this function.
go func(cont *api.Container, status api.ContainerStatus) {
mtask.dockerMessages <- dockerContainerChange{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding - how do we still handle the container status changes here that handleContainerChange does when no action is required for the transition ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now. Is that what you are asking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now

It'd be awesome if you added that as code comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardpen Got it, thanks

@aaithal
Copy link
Contributor

aaithal commented Mar 21, 2018

This requires a CHANGELOG entry as well. Also, should this be tagged for 1.17.3 milestone?

@aaithal aaithal closed this Mar 21, 2018
@aaithal aaithal reopened this Mar 21, 2018
// a goroutine so that it won't block here before the waitForContainerTransitions
// was called after this function.
go func(cont *api.Container, status api.ContainerStatus) {
mtask.dockerMessages <- dockerContainerChange{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now

It'd be awesome if you added that as code comment

@@ -60,7 +60,8 @@ const (
// 9) Add 'ipToTask' map to state file
// 10) Add 'healthCheckType' field in 'api.Container'
// 11) Add 'PrivateDNSName' field to 'api.ENI'
ECSDataVersion = 11
// 12) Remove `AppliedStatus` field form 'api.Container'
ECSDataVersion = 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be 11. 10 is what's released out there.

@richardpen richardpen modified the milestones: 1.18.0, 1.17.3 Mar 21, 2018
// There could be multiple transitions, but we just need to wait for one of them
// to ensure that there is at least one container can be processed in the next
// progressContainers. This is done by waiting for one transition/acs/docker message.
if mtask.waitEvent(transitionChange) {
Copy link
Contributor

@sharanyad sharanyad Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not wait for a container transition here, next call of progressContainers would happen, which may be a no-op. is that the reason why we are waiting here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, otherwise it will exhaust the cpu resource with an empty for loop.

@richardpen richardpen force-pushed the parallel_container_process branch 3 times, most recently from 144c3c2 to 51bcc58 Compare March 22, 2018 18:50
Previously the waitForContainerTransition will wait for all the
transitions to be done before the container can move to next state. This
PR changes the waitForContainerTransition to only wait for no more than
one transition to be done so that the first container that done the
transition doesn't need to wait for other containers.
Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for answering my questions above. ive got one more minor comment. otherwise code lgtm.

also lol that we were writing AppliedStatus to statefile this whole time too. nice that its seeing some use now 😁

defer c.lock.Unlock()

if c.AppliedStatus != ContainerStatusNone {
// return false to indicate the set operation failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me why returning false here will indicate the set operation failed. do you mean the SetKnownStatus and then updateAppliedStatusUnsafe path has failed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, failed means the container is already in a transition(where the appliedstatus isn't none). This ensures that agent won't call the same API(pull/create/start/stop) twice for the same container.

allWaitingOnPulled = false
}
}
if allWaitingOnPulled {
Copy link
Contributor

@sharanyad sharanyad Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this condition handled in the new change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previously the pull will block the whole transitions, that's why we want to break the wait if allWaitingOnPulled. Where the new change makes it non-blocked, which is handled by default.

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments. Also, can you make sure that we have a unit test that's testing the behavior where the order of events is being inspected? For example, a task with two containers, where one container (c1) takes a long time to pull or get created vs another one that doesn't (c2) and ensuring that c2 transitions to RUNNING before c1?

// to ensure that there is at least one container can be processed in the next
// progressContainers. This is done by waiting for one transition/acs/docker message.
if mtask.waitEvent(transitionChange) {
changedContainer := <-transitionChangeContainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: transitionedContainer is a better name for this

// There could be multiple transitions, but we just need to wait for one of them
// to ensure that there is at least one container can be processed in the next
// progressContainers. This is done by waiting for one transition/acs/docker message.
if mtask.waitEvent(transitionChange) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: transition is a better name for this

mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer)
seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn)
mtask.waitForContainerTransition(transitions, transitionChange, transitionChangeContainer)
seelog.Debugf("Managed task [%s]: wait for container transition done", mtask.Arn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this line. You're logging in waitForContainerTransition() anyway

// There could be multiple transitions, but we just need to wait for one of them
// to ensure that there is at least one container can be processed in the next
// progressContainers. This is done by waiting for one transition/acs/docker message.
if mtask.waitEvent(transitionChange) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs an else part, where we can log that we were interrupted or the transition did not finish or something like that.

@richardpen richardpen merged commit 2ee395f into aws:dev Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants