Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement transition function for MANIFEST_PULLED state #4152

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

amogh09
Copy link
Contributor

@amogh09 amogh09 commented Apr 23, 2024

Summary

This PR implements a transition function for container MANIFEST_PULLED internal state. This state was added in #4137 but the transition function was left unimplemented. This PR fills that gap.

The transition function for MANIFEST_PULLED state tries to resolve the image manifest digest for the container image using the following logic and sets the container.ImageDigest field with it.

  1. If the container is an internally managed container then skip digest resolution as digest exists only for images pulled from a registry while images used by internally managed containers are not pulled from a registry.
  2. If a digest exists in the image reference in task payload then use that.
  3. Otherwise, if image pull is not required (due to the configured image pull behavior setting) then inspect the locally available image and get its manifest digest (aka repo digest).
  4. Otherwise, if the Docker engine version on the host supports API version 1.35 or later then pull image manifest from the image registry and get its digest. The dependency on API version 1.35 is because Distribution API used for fetching manifest from registries is only supported on API version 1.30 and does not support v1 registries. Support for v1 registries was removed with Docker Engine v17.12 which was released with API version 1.35. So, we are restricting this feature to Docker engines 17.12 and later to ensure that we are reliably able to get manifests from registries. A new timeout that can be configured with ECS_MANIFEST_PULL_TIMEOUT is used when calling the registry.

In case of a failure during transition to MANIFEST_PULLED state a suitable error is generated which will become the container's ApplyingError as is the case with any other state transition failures. Except for timeout and authentication errors, CannotPullImageManifestError error is used to wrap errors during transition to MANIFEST_PULLED state. This aligns with the use of CannotPullContainerError to wrap errors during transition to PULLED state. For example, "CannotPullImageManifestError: Error response from daemon: manifest unknown: Requested image not found" would be the reason for container transition failure if transition to MANIFEST_PULLED state failed due to the image reference in the task payload being invalid.

As noted above, a new configuration setting ECS_MANIFEST_PULL_TIMEOUT is introduced in this PR that allows users to configure a timeout to be used when calling image registries to pull image manifests. The default value for this timeout is 1 minute and the minimum configurable value is 30 seconds.

This change adds additional time to task starts as transition to MANIFEST_PULLED state can make an additional blocking call to image registry. In my test environment the additional time ranges from 300ms (ECR) to 900ms (Dockerhub). This is expected. In an upcoming PR we will change image pulls to use the resolved digest which will make image pulls a bit faster and the long-term impact on task start will but cut to 150ms (ECR) to 500ms (Dockerhub). For this reason, this PR is being merged to a short-lived feature branch so that we merge both changes to dev branch as a single commit.

Implementation details

  • Implement *DockerTaskEngine.pullContainerManifest which is the transition function for MANIFEST_PULLED state as described above. A new helper method *DockerTaskEngine.setRegistryCredentials is added which is used to resolve private registry credentials. Code for this method is taken from the existing pullAndUpdateContainerReference method that is also refactored to call setRegistryCredentials method.
  • Update *dockerImageManager.RecordContainerReference method that is responsible for setting ImageDigest on containers. Now the method sets ImageDigest only if it's not pre-populated by the transition to MANIFEST_PULLED state.
  • Introduce a new ManifestPullTimeout setting in Config and parse its value from ECS_MANIFEST_PULL_TIMEOUT environment variable if it is available. Update relevant tests.
  • Existing engine tests are updated to take into account the new interactions with docker during transition to MANIFEST_PULLED state.
  • A new PutASMDockerAuthConfig method is added to *ASMAuthResource so that the resource may be set up with fake docker auth config for testing purposes. The method is not used outside of testing.

Testing

Many new unit and integration tests are added.

Unit tests -

  • TestPullContainerManifest - Tests *DockerTaskEngine.pullContainerManifest method for all image pull behavior and container type cases. Checks all interactions with docker.
  • TestManifestPullTaskShouldContinue - Tests task engine's handling of tasks with a focus on transition to MANIFEST_PULLED state. This test covers success and failure cases for the transition when the task should continue its lifecycle. If image pull behavior is default or prefer-cached then the task should continue its lifecycle even if there was an error during transition to MANIFEST_PULLED state which is the same behavior as transition to PULLED state.
  • TestManifestPullFailuresTaskShouldStop - Similar to TestManifestPullTaskShouldContinue above but covers failure cases for the transition when the task should be stopped due to the transition failure. If image pull behavior is always or once then the task should be stopped if there was an error during transition to MANIFEST_PULLED state which is the same behavior as transition to PULLED state.
  • TestImagePullRequired - Tests *DockerTaskEngine.imagePullRequired method which is missing unit tests.
  • TestSetRegistryCredentials - Tests the newly added *DockerTaskEngine.setRegistryCredentials method. Ensures that the method sets registry auth credentials on the container and returns a function that can be called to clear the credentials when they are no longer needed.

Integration tests -

  • TestManifestPulledDoesNotDependOnContainerOrdering - This existing test for MANIFEST_PULLED state is updated to check that ImageDigest field is set for containers after they have transitioned to MANIFEST_PULLED state.
  • TestPullContainerManifestInteg - This new test checks that *DockerTaskEngine.pullContainerManifest method works as expected for all image pull behaviors. That is, the method sets ImageDigest if success is expected and returns the right error if failure is expected.

I also performed the following manual tests.

  • Ran tasks with images in public ECR, private ECR, public dockerhub, and private dockerhub registries and checked that Agent is able to fetch image manifest digest for the images during transition to MANIFEST_PULLED state.
  • Ran tasks on Agent configured with once, default, always, and prefer-cached image pull behaviors. Verified in the logs that Agent gets contacts the registry for default and always image pull behaviors and gets the digest from local image for once and prefer-cached image pull behaviors if the image is cached.
  • Ran ten tasks with a ECR and then with Dockerhub image with an Agent with changes in this PR and an Agent without the changes in this PR and computed the average task start latency for the tasks. Both Agents were configured to use always image pull behavior. Observed an increase in task start time ranging from 300ms (for ECR) to 900ms (for Dockerhub) owing to the new image registry call during transition to MANIFEST_PULLED state. This increase in task start time is expected given that a new blocking network call is made to fetch image manifest before the container can be started.
  • Ran a task on an environment with Docker 17.03. Transition to MANIFEST_PULLED skipped digest resolution with Failed to find a supported API version that supports manifest pulls. Skipping digest resolution. warning message printed in the logs as expected.

New tests cover the changes: yes

Description for the changelog

Feature - Resolve image manifest digest for task containers during transition to MANIFEST_PULLED state before container image pulls. In future this early resolution of manifest digest will be used to expedite the reporting of manifest digests to ECS backend.

Does this PR include breaking model changes? If so, Have you added transformation functions?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@amogh09 amogh09 force-pushed the manifest-transition branch 2 times, most recently from 3bb91b8 to 3d299fb Compare April 23, 2024 04:06
@amogh09 amogh09 changed the title Manifest transition Implement transiton function for MANIFEST_PULLED state Apr 23, 2024
@amogh09 amogh09 changed the title Implement transiton function for MANIFEST_PULLED state Implement transition function for MANIFEST_PULLED state Apr 23, 2024
@amogh09 amogh09 marked this pull request as ready for review April 23, 2024 21:46
@amogh09 amogh09 requested a review from a team as a code owner April 23, 2024 21:46
@amogh09 amogh09 changed the base branch from dev to feature/digest-resolution April 24, 2024 17:55
agent/engine/docker_task_engine.go Outdated Show resolved Hide resolved
agent/engine/docker_task_engine.go Show resolved Hide resolved
@amogh09 amogh09 merged commit 9accad7 into aws:feature/digest-resolution Apr 29, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants