Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge branch 'dev' into eni-trunking #1915

Merged
merged 75 commits into from
Mar 4, 2019
Merged

Conversation

fenxiong
Copy link
Contributor

Summary

Merge branch 'dev' into eni-trunking. Only conflict is the data file version in agent/statemanager/statemanager.go where i adjusted the version accordingly.

Implementation details

Testing

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ubhattacharjya and others added 30 commits January 30, 2019 13:31
Merge 'dev' branch into 'container-ordering' branch
* start timeout is governed by the agent and is the context timeout for
the StartContainer api call

* stop timeout is the parameter passed to StopContainer and is time the
docker daemon waits after a StopContainer call to issue a SIGKILL to the
container
go vet fixes for unkeyed fields and copying lock value
Mongo images are replaced with redis and crux images. Using Mongo images were leading to unpredictable results in functional and integration tests everytime Mongo images were updated on docker hub.
'Success' condition on a dependency container will allow a target container to start only  when the dependency container has exitted successfully with exitcode 0.
'Complete' condition on a dependency container will allow a target container to start only  when the dependency container has exitted.
'Healthy' condition on a dependency container will allow a target container to start only  when the dependency container is reported to be healthy.
Prior to this commit, we only tracked state we explicitly tried to
change when the task was starting. We did not respond to the event
stream or any other source of information from Docker. This means
that when we are waiting for certain dependency conditions
("SUCCESS", "COMPLETE", or "HEALTHY") the task progression logic
does not update the agent-internal model of container state.

Since we rely on that state for determining when conditions are
met, tasks would get stuck in infinite startup loops. This change
adds a call to engine.checkTaskState(), which explicity updates
any changed container state. We only call this function if we know
that we are waiting on the aforementioned subset of dependency
conditions.

Co-authored-by: Utsa Bhattacharjya <utsa@amazon.com>
engine: adding poll function during progressTask
We now apply shutdown order in any dependency case, including
dependsOn directives, links, or volumes. What this means is that
agent will now make a best effort attempt to shutdown containers
in the inverse order they were created.

For example, a container using a link for communication will wait
until the linked container has terminated before terminating
itself. Likewise, a container named in another container's
dependsOn directive will wait until that dependent container
terminates.

One note about the current implementation is that the dependencies
aren't assumed to be transitive, so if a chain exists such that:

A -> B -> C

Container "C" will shutdown before "B", but it won't check status
against container "A" explicity. If A depends on C, we expect:

A -> B -> C
     A -> C

The lack of transitive dependency logic applies is consistent with
startup order as of this commit.
The link / volume dependency tests are now affected by shutdown
order, so the tests now take longer. Previously, it would take a
max of 30s (the default docker stop timeout for agent). Now, since
the containers stop in order, it will take a max of 30s * n, where
n is the number of containers.

Increasing the test timeout is a short term fix until we have
granular start/stop timeouts plumbed through the ecs service.
Instead of explicitly checking against many conditions, we now
validate that the expected condition has progressed beyond started

This mirrors prior behavior in the codebase, and reduces cyclo
complexity.
The ‘StartTimeout’ now will only serve as the the time duration after which if a container has a dependency on another container and the conditions are ‘SUCCESS’, ‘HEALTHY’ and ‘COMPLETE’, then the dependency will not be resolved. For example:

•	If a container A has a dependency on container B and the condition is ‘START’, the StartTimeout for container B will roughly be the time required for it to exit successfully with exit code 0
•	If a container A has a dependency on container B and the condition is ‘COMPLETE’, the StartTimeout for container B will roughly be the time required for it to exit.
•	If a container A has a dependency on container B and the condition is ‘HEALTHY’, the StartTimeout for container B will roughly be the time required for it to emit a ‘HEALTHY’ status.

If the StartTimeout exceeds in any of the above cases, container A will not be able to transition to ‘CREATED’ status.

It effectively reverts the implementation of StartTimeout in commit: 79bd517
This is the first batch of integration tests for container ordering. The
tests handle the basic use cases for each of the conditions that
introduces new behavior into agent (HEALTHY,COMPLETE,SUCCESS).
* "START" Dependency condition has been changed to "CREATE" as it waits for the dependency to atleast get created
* "RUNNING" Dependency Condition has been changed to "START" as it waits for the dependency to get started.
Here, the time duration(StartTimeout) mentioned by the user for a container is expired or not is checked before resolving the dependency for target container.

For example,
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'SUCCESS', then the dependency will not be resolved if B times out before exitting successfully with exit code 0.
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'COMPLETE', then the dependency will not be resolved if B times out before exitting.
* if a target container 'A' has dependency on container 'B' and the dependency condiiton is 'HEALTHY', then the dependency will not be resolved if B times out before emtting 'Healthy' status.

The advantage of this is that the user will get to know that something is wrong with the task if the task is stuck in pending..
Dependency Condition Naming change:
Remove the functionality of StartTimeout as Docker API Start Timeout
* Remove need to pull 'latest' server core

By removing the :latest tag from all windowsservercore containers,
we will have the tests use the container thats already baked into
the AMI.

* Remove depdency on golang and python containers

We are removing the need to use any containers other than servercore
and nanoserver. This reduces the number of downloads needed and the
number of builds that happen before the tests start running.

* Explicit timeouts on order tests

The ordering tests are broken at the moment, so we are capping them
with a fixed timeout.
yumex93 and others added 15 commits February 27, 2019 22:30
Some tests use "0" cpu and this value makes it possible for us to test
them.
aws#1890 was merged using rebase and merge which is wrong, and dev is still "56 commits ahead, 1 commit behind master." Fixing by merging again.
Fix merge master to dev "update to 1.25.3"
We only need to verify sha for vpc cni plugins, because amazon-vpc-cni-plugins
does not have the same version setup as amazon-ecs-cni-plugins
Fix amazon-vpc-cni-plugins sha check.
Adding functional tests for container ordering
increase linux integ tests timeout
update amazon-vpc-cni-plugins sha
@suneyz
Copy link
Contributor

suneyz commented Feb 28, 2019

looks like you have conflicts to resolve

@fenxiong
Copy link
Contributor Author

fenxiong commented Mar 1, 2019

#1918

fenxiong and others added 6 commits March 1, 2019 07:48
TestOOMContainer runs a task that's expected to fail fast because the contain tries to use too much memory. The test had a check where it was waiting for the task to reach running, but since the task stops quickly, it's possible that when the test describes the task, the task has already stopped and the test fails. Fixing by removing the wait running check.
This timeout was fine when working on a dev box but its not quite as
happy in automated testing.
There was a race condition in setting up credential port forwarding in which the adapter has multiple addresses before the new ip is assigned. Allowing this race condition can cause the netsh interface portproxy setup to result in the network adapter in an unrecoverable bad state where the proxy is unable to listen on port 80 because the port is in use by a non-functional proxy rule.

This race condition happened rarely before, but started becoming a problem when changes were made to let windows test run faster (aws#1886), which let TestV3TaskEndpointDefaultNetworkMode, TestV3TaskEndpointTags and TestV3TaskEndpointDefaultNetworkMode have a high failure rate because they need the credential port forwarding.
This invocation was removed by accident but didn't break anything since hostsetup.ps1 was invoked elsewhere. But still we should add this back since otherwise we are assuming hostsetup.ps1 will be invoked elsewhere beforehand.
Fix Windows Functional Tests
@fenxiong
Copy link
Contributor Author

fenxiong commented Mar 4, 2019

Re-pushing to bring in test fix from #1919

@fenxiong fenxiong merged commit 5f65e9c into aws:eni-trunking Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants