Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: introduce a new env var to distinct pull image behavior #1348

Merged
merged 1 commit into from
May 4, 2018

Conversation

haikuoliu
Copy link
Contributor

@haikuoliu haikuoliu commented Apr 17, 2018

Summary

This PR is to fix issue #413, it introduces a new environment variable ECS_AGENT_PULL_BEHAVIOR to allow users to define different pull image behaviors, please see the design doc in the comments of issue #413, and usage of the ECS_AGENT_PULL_BEHAVIOR in README.md.

Implementation details

  1. Add an enum environment variable to config file.
  2. Add a new variable PullSucceeded in ImageState, it will be set to true when the pull succeeds.
  3. For prefer-cached behavior, check if there is cached image by inspecting image, if inspecting image fails, then pull image, otherwise just update container reference. Make task fail when pull fails.
  4. For never behavior, do nothing but update container reference when pulling container.
  5. For once behavior, check if the image has been pulled by inspecting the IsPullSuccess of the image state, do nothing when the image has been pulled, otherwise pull as usual. If the pull image API call fails, make the task fail.
  6. Abstract the update container reference out in docker_task_engine.go.
  7. Some other refactor/bug fix.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: yes, unit tests, integ tests and some manual tests.

Description for the changelog

Feature - Introduce a new environment variable ECS_AGENT_PULL_BEHAVIOR to make agent pull behavior configurable

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@haikuoliu haikuoliu added this to the 1.18.0 milestone Apr 17, 2018
CHANGELOG.md Outdated
@@ -1,5 +1,8 @@
# Changelog

## 1.18.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention is to use 1.18.0-dev before release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior {
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v",
mtask.Arn, container.Name, event.Error)
mtask.SetDesiredStatus(api.TaskStopped)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the task will be moved to STOPPED even if the container is not essential. Is that what we are implying by using the config variable? If yes, we might want to document that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essential or non-essential is only effective for RUNNING containers. It shouldn't have a bearing on Pulled transition failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task will stop for both essential or non-essential containers. I will add some comments for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, when a container fails with desired CREATED , we move the task to STOPPED only if it is an essential container. We should do the same for PULLED too.
If not, we might want to change the logic to stop the task for failures during CREATED, irrespective of the essential status of containers.

@aaithal
Copy link
Contributor

aaithal commented Apr 18, 2018

@haikuoliu can you please fix failing tests?

@haikuoliu
Copy link
Contributor Author

@aaithal It's the timeout test TestStartContainerTimeout and not related to this PR. The timeout tests have a small change to fail, because the timeout value is too small and some methods don't get to execute.

We talked about this in the wire context PR #1329: to fix it either use a mock context (implemented in the initial PR), or use a cancel context to simulate a timeout context (in one of the revised PR), or don't fix it at all (in final PR), we choose the last method.

@aaithal
Copy link
Contributor

aaithal commented Apr 18, 2018

or don't fix it at all (in final PR), we choose the last method.

That doesn't seem right. I think you 'fix' it by adding MaxTimes(1) to the mock docker client call. Can you please make that change it it hasn't already been made somewhere else?

CHANGELOG.md Outdated
@@ -1,5 +1,8 @@
# Changelog

## 1.18.0
* Enhancement - Introduce a new environment variable `ECS_AGENT_PULL_BEHAVIOR` to make agent pull behavior configurable [#to be added](to be added)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can actually add the to be added info now. Also, this falls under the Feature bucket, not Enhancement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

README.md Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable.
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m |
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h |
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 |
| `ECS_AGENT_PULL_BEHAVIOR` | <default | always | never | once> | Behavior used to customize the pull image process. default: pull image remotely, use cached image if the pull fails; always: only pull image remotely, the task fails when the pull fails; never: only use cached image; once: pull image remotely only if this image has never been pulled before, use cached image otherwise. | default | default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please get @nrdlngr to review this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will ask review from him.

@@ -86,6 +86,23 @@ const (
// performing image cleanup.
minimumNumImagesToDeletePerCycle = 1

// DefaultAgentPullBehavior specficies the behavior that if an image pull API call fails,
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed.
DefaultAgentPullBehavior AgentPullBehaviorType = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please modify this be to in its own const block and use iota to be more idiomatic: https://golang.org/ref/spec#Iota

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure.

@@ -20,6 +20,8 @@ import (
cnitypes "github.com/containernetworking/cni/pkg/types"
)

type AgentPullBehaviorType int8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

default:
// If the agent pull behavior is never, we don't try to pull the image,
// try to use local image cache instead.
if engine.cfg.AgentPullBehavior == config.NeverAgentPullBehavior {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please refactor these into a method? May be something that returns a boolean to indicate if you should return dockerapi.DockerContainerMetadata{Error: nil}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will abstract it out.

Containers []*api.Container `json:"-"`
PulledAt time.Time
LastUsedAt time.Time
IsPulledSuccess bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PullSucceeded seems like a better name for this. Also, please add documentation for this and all other fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure.

mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior {
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v",
mtask.Arn, container.Name, event.Error)
mtask.SetDesiredStatus(api.TaskStopped)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essential or non-essential is only effective for RUNNING containers. It shouldn't have a bearing on Pulled transition failing

@aaithal
Copy link
Contributor

aaithal commented Apr 18, 2018

@haikuoliu you also need to update the version and documentation for the agent's state file in

@haikuoliu
Copy link
Contributor Author

haikuoliu commented Apr 18, 2018

@aaithal My bad, thanks for pointing out. I will fix it. Will also update the version and documentation for state_manager.go

@haikuoliu
Copy link
Contributor Author

Update: fix the timeout test, add more comments, update change log, and some other refactor.

Still waiting for review of README.md.

CHANGELOG.md Outdated
@@ -1,5 +1,8 @@
# Changelog

## 1.18.0-dev
* Feature - Introduce a new environment variable `ECS_AGENT_PULL_BEHAVIOR` to make agent pull behavior configurable [#1348](https://github.com/aws/amazon-ecs-agent/pull/1348)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature - Configurable agent pull behavior [#1348].. seems like a better description for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -325,6 +344,7 @@ func environmentConfig() (Config, error) {
imageCleanupInterval := parseEnvVariableDuration("ECS_IMAGE_CLEANUP_INTERVAL")
numImagesToDeletePerCycleEnvVal := os.Getenv("ECS_NUM_IMAGES_DELETE_PER_CYCLE")
numImagesToDeletePerCycle, err := strconv.Atoi(numImagesToDeletePerCycleEnvVal)
agentPullBehavior := getAgentPullBehavior()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this variable. Just set AgentPullBehavior to getAgentPullBehavior().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be consistent with other code. I just saw your refactor config commit, will change accordingly.

@@ -455,6 +476,24 @@ func getContainerStartTimeout() time.Duration {
return containerStartTimeout
}

func getAgentPullBehavior() AgentPullBehaviorType {
var agentPullBehavior AgentPullBehaviorType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get rid of this variable as well:

func getAgentPullBehavior() AgentPullBehaviorType {
 switch agentPullBehaviorString {
  case "always":
   return AlwaysAgentPullBehavior
  ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -369,6 +371,14 @@ func TestImageCleanupMinimumNumImagesToDeletePerCycle(t *testing.T) {
assert.Equal(t, cfg.NumImagesToDeletePerCycle, DefaultNumImagesToDeletePerCycle, "Wrong value for NumImagesToDeletePerCycle")
}

func TestInvalidAgentPullBehavior(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for just getAgentPullBehavior() as well? Make it a table driven test so that you can test that all 4 valid values and an invalid value for ECS_AGENT_PULL_BEHAVIOR return the expected output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -741,16 +776,23 @@ func (engine *DockerTaskEngine) pullAndUpdateContainerReference(task *api.Task,
if container.IsInternal() {
return metadata
}
PullSucceeded := metadata.Error == nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this PullSucceeded and not pullSucceeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be pullSucceeded.

// (the image can be prepopulated with the AMI and never be pulled).
imageState := engine.imageManager.GetImageStateFromImageName(container.Image)
if imageState != nil && imageState.PullSucceeded {
seelog.Infof("Task engine [%s]: image %s has been pulled, use it directly for container %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest changing this slightly to :image %s has been pulled once, not pulling it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

taskArn string) bool {
// If the agent pull behavior is never, we don't try to pull the image,
// try to use local image cache instead.
if agentPullBehavior == config.NeverAgentPullBehavior {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this'd be more readable with a switch and cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// otherwise pull the image as usual, regardless whether the image exists or not
// (the image can be prepopulated with the AMI and never be pulled).
imageState := engine.imageManager.GetImageStateFromImageName(container.Image)
if imageState != nil && imageState.PullSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all references to PullSucceeded need to be protected with a lock. Please add setters and getters in image state struct for that use those methods everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// where there is no local image cache will be handled when creating container.
if mtask.cfg.AgentPullBehavior == config.AlwaysAgentPullBehavior ||
mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior {
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably log the container image here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -455,6 +476,24 @@ func getContainerStartTimeout() time.Duration {
return containerStartTimeout
}

func getAgentPullBehavior() AgentPullBehaviorType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should rebase on top my pr #1353 once it's merged so that this gets moved to parse.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

const (
// DefaultAgentPullBehavior specficies the behavior that if an image pull API call fails,
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed.
DefaultAgentPullBehavior AgentPullBehaviorType = iota

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImagePullDefault ImagePullAlwasy ImagePullNever ImagePullOnce seems to be better than this, can you change the name to be more readable? Also, can you move this into the types.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can change the name, but I think they are more of constants like DefaultClusterName and DefaultTaskCleanupWaitDuration, so they should be kept together with other constants defined in config.go.

func getAgentPullBehavior() AgentPullBehaviorType {
var agentPullBehavior AgentPullBehaviorType
agentPullBehaviorString := os.Getenv("ECS_AGENT_PULL_BEHAVIOR")
switch agentPullBehaviorString {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to format the string a little bit before comparing, like trim space, convert all character to lowercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I think this has already been done by func (cfg *Config) trimWhitespace()

@@ -154,6 +154,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con
Image: sourceImage,
PulledAt: time.Now(),
LastUsedAt: time.Now(),
// The PullSucceeded filed is false by default,
// one has to explicitly set it to be true when the pull image succeeds.
PullSucceeded: false,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help me understand: the addContainerReferenceToNewImageState is called after the image is pulled succeed, why PullSucceeded was set to false, what will happen if it was set to true here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because even if the image pull fails, the container reference will still be updated, so we set it to be false here. We check if there is no error with image pull, we then set it to be true.

seelog.Infof("Task engine [%s]: use cached image directly for container %s",
taskArn, container.Name)
// It's not an internal container, so it's safe to add container reference.
engine.updateContainerReference(false, container, taskArn)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, do we even need to track the image state? If it's not pulled by agent, we shouldn't remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardpen and I discussed about the problem we may face about image cleaning. If we use never or once agent behavior setup, the image cleanup process could remove the image, and I think for never and once we should keep the image cache, right?

We think what we can do:

  1. Let customer know that there may be conflicts between image cleaning and agent pull behavior setup, and suggest them when using never or once, set NumImagesToDeletePerCycle to be 0 (or some other image cleanup settings that can prevent image from being removed).

  2. Disable image cleanup internally when using never or once, add some documentation on that so that user knows.

@sharanyad @richardpen @aaithal what do you think about this?

mtask.Arn, container.Name, event.Error)
// The task should be stopped regardless of whether this container is
// essential or non-essential.
mtask.SetDesiredStatus(api.TaskStopped)
Copy link

@richardpen richardpen Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you emitting the STOPPED event, the task may not be STOPPED at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, don't notify ACS right now is better.

// useLocalImageCache returns true if local image cache should be used, or return false to continue
// pulling image, by inspecting the agent pull behavior variable defined in config. The caller has
// to make sure the container passed in is not an internal container.
func (engine *DockerTaskEngine) useLocalImageCache(agentPullBehavior config.AgentPullBehaviorType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this to ShouldUseLocalImage might be more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// (the image can be prepopulated with the AMI and never be pulled).
imageState := engine.imageManager.GetImageStateFromImageName(container.Image)
if imageState != nil && imageState.PullSucceeded {
seelog.Infof("Task engine [%s]: image %s has been pulled, use it directly for container %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call updateContainerReference in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right, we should.

@@ -741,16 +776,23 @@ func (engine *DockerTaskEngine) pullAndUpdateContainerReference(task *api.Task,
if container.IsInternal() {
return metadata
}
PullSucceeded := metadata.Error == nil
engine.updateContainerReference(PullSucceeded, container, task.Arn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be handled here -


i.e, depending on InspectImage, set PullSucceeded value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inspect image and pull image are two different things, when you inspect an image, the image can be there because it' reconfigured with the AMI, instead of being pulled from user's ECR, this matters when the setting is once.

assert.Equal(t, dockerapi.DockerContainerMetadata{}, metadata, "expected empty metadata")
}

func TestPullImageWithOnceAgentPullBehavior(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test where config is once and PullSucceeded is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@haikuoliu
Copy link
Contributor Author

Update: address the comments. Introduce a new behavior "prefer_cached". Will update README, add comments and tests for the new behavior and remove "never" behavior due to the change of the requirement.

@haikuoliu
Copy link
Contributor Author

Update: Update README, add comments and tests for "prefer_cached" behavior and remove "never" behavior due to the change of the requirement. Also do a rebase.

}

func TestParseImagePullBehavior(t *testing.T) {
testcases := []struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -117,6 +117,22 @@ func parseNumImagesToDeletePerCycle() int {
return numImagesToDeletePerCycle
}

func parseImagePullBehavior() ImagePullBehaviorType {
ImagePullBehaviorString := os.Getenv("ECS_IMAGE_PULL_BEHAVIOR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we .ToLowerCase() the string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use trimWhitespace to format every env var string and we should not do extra work outside of this method:

func (cfg *Config) trimWhitespace() {

@@ -20,6 +20,10 @@ import (
cnitypes "github.com/containernetworking/cni/pkg/types"
)

// ImagePullBehaviorType is an enum variable type corresponding to different agent pull
// behaviors including default, always, never and once.
type ImagePullBehaviorType int8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why an int8 over an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's just an enum variable, which doesn't require a large range, there is usage of int8 elsewhere in agent:

// If the image pull behavior is prefer cached, don't clean up the image,
// because the cached image is needed.
if imageManager.imagePullBehavior != config.ImagePullPreferCachedBehavior {
// passing the cleanup interval as argument which would help during testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[do this!]
log that you are disabling this.

[nit]
I think I'd prefer positive comparison instead of !=.

if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior {
    seelog.Info("Pull behavior is set to always use cache. Disabling cleanup")
    return
}
imageManager.performPeriodicImageCleanup(ctx, imageManager.imageCleanupTimeInterval) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure.

case config.ImagePullPreferCachedBehavior:
// If the behavior is prefer_cached, don't pull if we found cached image
// by inspecting the image.
_, err := engine.client.InspectImage(container.Image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the docker client API does not need a timeout for inspecting an image.

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove this from the 1.17.4 milestone? We can sync offline about reasons to do so.

return ImagePullAlwaysBehavior
case "once":
return ImagePullOnceBehavior
case "prefer_cached":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change this to preferCached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there are Example values: ["awslogs","fluentd","gelf","json-file","journald","splunk"] and Example values: {"custom attribute": "custom_attribute_value"} for a var with multiple words from official docs, but I did not find a camelCase one, I will double check with Joel about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are examples. These are key words that we're defining. All of our APIs are camlCased and don't have underscroes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I don't think the APIs matter because this is a value rather than a parameter. My opinion is to use "prefer-cached".

func (engine *DockerTaskEngine) shouldUseLocalImage(ImagePullBehavior config.ImagePullBehaviorType,
container *api.Container,
taskArn string) bool {
switch ImagePullBehavior {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function param names should never begin with upper case letters. For a second, I got freaked out that you were comparing a const here. Please modify that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// otherwise pull the image as usual, regardless whether the image exists or not
// (the image can be prepopulated with the AMI and never be pulled).
imageState := engine.imageManager.GetImageStateFromImageName(container.Image)
if imageState != nil && imageState.GetPullSucceeded() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil check for existence of something in a map somewhere isn't a good paradigm. Please change the return type of GetImageStateFromImageName to be (ImageState, bool) instead of just *ImageState and use the bool here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

// shouldUseLocalImage returns true if local image cache should be used, or return false to continue
// pulling image, by inspecting the agent pull behavior variable defined in config. The caller has
// to make sure the container passed in is not an internal container.
func (engine *DockerTaskEngine) shouldUseLocalImage(ImagePullBehavior config.ImagePullBehaviorType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selective action that a caller needs to perform when calling this method is to pull the image if this returns a negative value. Essentially !if shouldUseLocalImage() { // pull.. }. It's much more intuitive if we operated in the reverse manner. Instead of shouldUseLocalImage, please make this imagePullRequired so that you can do if imagePullRequired() { // pull.. }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

}
imageState := engine.imageManager.GetImageStateFromImageName(container.Image)
if imageState != nil && pullSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, as per above comment, please use the boolean returned from GetImageStateFromImageName to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// PullSucceeded defines whether this image has been pulled successfully before,
// this should be set to true when one of the pull image call succeeds.
PullSucceeded bool
updateLock sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this to just lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

const (
// ImagePullDefaultBehavior specifies the behavior that if an image pull API call fails,
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed.
ImagePullDefaultBehavior ImagePullBehaviorType = iota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just ImagePullDefault

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ImagePullDefault is more general, it describes the ImagePull but does not specify which aspect it's describing. I wouldn't know if it's ImagePull default timeout, or ImagePull default behavior, etc. So ImagePullDefaultBehavior makes more sense to me. Let me know if you have more comments on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImagePullDefault would be of the type ImagePullBehaviorType which would convey the meaning. Behavior suffix can be removed. Also the type can be just ImagePullType
something like:

ImagePullDefault ImagePullType = iota
ImagePullAlways

and the environment variable can be ECS_IMAGE_PULL_TYPE. But I would suggest consulting Joel for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah my concern is:
If I name it ImagePullDefault, although it's of ImagePullBehaviorType , but if some day in the future, we have something like ImagePullTimeoutType (just for example) as an enum type, and it also has a default value, so it should be named as ImagePullDefault to be consistent, which conflicts what we have now, does this make sense to you?


// ImagePullAlwaysBehavior specifies the behavior that if an image pull API call fails,
// the task fails instead of using cached image.
ImagePullAlwaysBehavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just ImagePullAlways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

// ImagePullOnceBehavior specifies the behavior that agent will only attempt to pull
// the same image once, once an image is pulled, local image cache will be used
// for all the containers.
ImagePullOnceBehavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just ImagePullOnce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.


// ImagePullPreferCachedBehavior specifies the behavior that agent will only attempt to pull
// the image if there is no cached image.
ImagePullPreferCachedBehavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just ImagePullPreferCached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@haikuoliu haikuoliu removed this from the 1.17.4 milestone Apr 27, 2018
@@ -155,6 +156,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con
Image: sourceImage,
PulledAt: time.Now(),
LastUsedAt: time.Now(),
// The PullSucceeded filed is false by default,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -155,6 +156,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con
Image: sourceImage,
PulledAt: time.Now(),
LastUsedAt: time.Now(),
// The PullSucceeded filed is false by default,
// one has to explicitly set it to be true when the pull image succeeds.
PullSucceeded: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool is false by default. We need not explicitly set it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// don't want to use cached image for both cases.
if mtask.cfg.ImagePullBehavior == config.ImagePullAlwaysBehavior ||
mtask.cfg.ImagePullBehavior == config.ImagePullOnceBehavior {
seelog.Criticalf("Managed task [%s]: Error while pulling container %s and image %s, task will fail: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be Errorf and the message can be Managed task [%s]: Error while pulling image %s for container %s , moving task to STOPPED:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@haikuoliu
Copy link
Contributor Author

Update: rebase and refactor to address the comments. Currently I am not sure about when this feature should be released, so I keep it 1.17.4 in changelog, will update the changelog before merged.

README.md Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable.
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m |
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h |
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 |
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reviewed by @nrdlngr? Also for once it should be: the image will be pulled remotely if it has not been pulled or it has been removed by image cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README and env vars is reviewed by @joelbrandenburg. I think the image cleanup related stuff should be reflected in the tech doc, I will let @joelbrandenburg know about this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be reflected here, otherwise the description here is not correct and could cause confusion.

// If the image pull behavior is prefer cached, don't clean up the image,
// because the cached image is needed.
if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior {
seelog.Info("Pull behavior is set to always use cache. Disabling cleanup")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point this out in the readme file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already told @joelbrandenburg about this, this should be also included in tech doc I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be valuable to add a link to the docs in README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ask @joelbrandenburg if it's ok to have a link in readme. I will attach the link in readme if he has the link ready before I merge, otherwise I will attach it some time after I merge.

} else {
t.Logf("Found image state for %s", test1Image2Name)
t.Fatalf("Could not find image state for %s", test1Image1Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are here, can you change these to use assert or require?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

return engine.serialPull(task, container)
if engine.imagePullRequired(engine.cfg.ImagePullBehavior, container, task.Arn) {
// Record the pullStoppedAt timestamp
defer func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case what's the value of task PullStartedAt and pullStoppedAt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite get the question, the logic in imagePullRequired is to pull image, instead of using local cache, hence recording PullStartedAt and pullStoppedAt is necessary. (previously it's shouldUseLocalImage, and this is changed to imagePullRequired as per @aaithal's comments).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if engine.imagePullRequired is false, then the image pullstartedAt and pullStoppedAt won't be set, right? Can you run a task in this scenario and check the pullstoppedAt and pullstartedat from describetask API that it isn't set to something unexpected?

Copy link
Contributor Author

@haikuoliu haikuoliu May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I see what you mean, I saw the comment here:

// it won't be set if the pull never happens

And I test the describe task API (which you have seen), it works as expected, the pull started at and pull stopped at will not show up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pull started at and pull stopped at will now show up.

If image is not pulled, these values shouldn't show up right?

Copy link
Contributor Author

@haikuoliu haikuoliu May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's not show up, typed wrong word.


// GetPullSucceeded safely returns the PullSucceeded of the imageState
func (imageState *ImageState) GetPullSucceeded() bool {
imageState.lock.Lock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a read lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@haikuoliu
Copy link
Contributor Author

haikuoliu commented May 2, 2018

Update: Address @richardpen 's comments, and squash the commits.

Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there!

README.md Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable.
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m |
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h |
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 |
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to specify that the cached image in the instance will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// If the image pull behavior is prefer cached, don't clean up the image,
// because the cached image is needed.
if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior {
seelog.Info("Pull behavior is set to always use cache. Disabling cleanup")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be valuable to add a link to the docs in README.

return engine.serialPull(task, container)
if engine.imagePullRequired(engine.cfg.ImagePullBehavior, container, task.Arn) {
// Record the pullStoppedAt timestamp
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pull started at and pull stopped at will now show up.

If image is not pulled, these values shouldn't show up right?

// don't want to use cached image for both cases.
if mtask.cfg.ImagePullBehavior == config.ImagePullAlwaysBehavior ||
mtask.cfg.ImagePullBehavior == config.ImagePullOnceBehavior {
seelog.Errorf("Managed task [%s]: Error while pulling image %s for container %s , moving task to STOPPED: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Managed task [%s]: error while
same for below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

CHANGELOG.md Outdated
@@ -1,6 +1,7 @@
# Changelog

## 1.17.4-dev
* Feature - Configurable agent pull behavior [#1348](https://github.com/aws/amazon-ecs-agent/pull/1348)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change this to Configurable container image pull behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise lgtm. Also please make sure all these behaviors are manually tested.

README.md Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable.
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m |
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h |
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 |
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be reflected here, otherwise the description here is not correct and could cause confusion.

@haikuoliu
Copy link
Contributor Author

Update: address @richardpen and @sharanyad 's comments.

@@ -155,6 +156,8 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con
Image: sourceImage,
PulledAt: time.Now(),
LastUsedAt: time.Now(),
// PullSucceeded should be set to true when the pull image succeeds.
PullSucceeded: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool value is false by default. This can be removed.

mtask.SetDesiredStatus(api.TaskStopped)
return false
}
// If the agent pull behavior is prefer_cached, we receive the error because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer-cached

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants