Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: Add update/rollback order #30261

Merged
merged 2 commits into from
Apr 7, 2017
Merged

Conversation

aaronlehmann
Copy link
Contributor

@aaronlehmann aaronlehmann commented Jan 18, 2017

Update order is a new parameter to control the order of operations when rolling out an updated task. Either the old task is stopped before starting the new one, or the new task is started first, and the running tasks will briefly overlap.

Previously, stop_first was the only supported behavior. The behavior of start_first could be emulated by temporarily scaling up the service, so for awhile we hesitated to add it. However, there are some cases where it is awkward to change the number of replicas in a service just to support temporary overlap during updates. It doesn't work for some new features we have in mind like automatic preemption.

This PR adds Order to the API, and --update-order / --rollback-order flags to the CLI. It adds an integration test that performs a rolling update in the new start_first mode.

cc @aluzzardi @dongluochen

@LK4D4
Copy link
Contributor

LK4D4 commented Jan 27, 2017

Still depends on swarmkit PR

@aaronlehmann
Copy link
Contributor Author

Rebased. This is still waiting for the swarmkit PR, but I've updated it to match a change in the swarmkit PR. --update-rollout was renamed to --update-order.

An additional option called --rollback-order will be needed. Where this is done will depend on whether this PR or #31108 is merged first.

@aaronlehmann
Copy link
Contributor Author

Rebased and updated to support the rollback case. Still waiting for the swarmkit PR.

@aaronlehmann aaronlehmann changed the title [WIP] swarm: Add rollout mode swarm: Add update/rollback order Mar 21, 2017
@aaronlehmann
Copy link
Contributor Author

Swarmkit PR was merged

Updated this one to bring it in line, such as Order replacing RolloutMode.

This should finally be ready to go now.

@thaJeztah
Copy link
Member

Looks like CI is failing @aaronlehmann

@aaronlehmann aaronlehmann added the status/failing-ci Indicates that the PR in its current state fails the test suite label Mar 21, 2017
@aaronlehmann
Copy link
Contributor Author

CI will be fixed when #31714 is merged. Until then, vendoring swarmkit is blocked. Let's try to make that happen.

@aluzzardi
Copy link
Member

nit: Underscores vs dashes. I believe in the CLI we favor dashes (e.g. on-failure, unless-stopped, ...) while this PR is using underscores (start_first).

@aaronlehmann
Copy link
Contributor Author

@aluzzardi: I think you're right about dashes vs. underscores. @thaJeztah can you please confirm?

@thaJeztah
Copy link
Member

Oh, correct, yes. And, actually; the same is used for the API; https://github.com/docker/docker/blob/b36ce6f2f6cfdee4d376b58137dde2bc20fc4312/api/swagger.yaml#L304-L313

@aaronlehmann
Copy link
Contributor Author

Updated to use start-first and stop-first in the API and CLI

@@ -494,6 +496,8 @@ func addServiceFlags(flags *pflag.FlagSet, opts *serviceOptions) {
flags.StringVar(&opts.update.onFailure, flagUpdateFailureAction, "pause", `Action on update failure ("pause"|"continue"|"rollback")`)
flags.Var(&opts.update.maxFailureRatio, flagUpdateMaxFailureRatio, "Failure rate to tolerate during an update")
flags.SetAnnotation(flagUpdateMaxFailureRatio, "version", []string{"1.25"})
flags.StringVar(&opts.update.order, flagUpdateOrder, "start-first", `Update order ("start-first"|"stop-first")`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be stop-first. stop-first is usually safe as tasks would not compete for resources, like CPU/mem/TCP ports, while start-first is not.

https://github.com/docker/swarmkit/blob/master/cmd/swarmctl/service/flagparser/flags.go#L39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that was a mistake.

Copy link
Contributor

@dongluochen dongluochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@@ -74,6 +74,7 @@ Options:
--rollback-max-failure-ratio float Failure rate to tolerate during a rollback
--rollback-monitor duration Duration after each task rollback to monitor for failure
(ns|us|ms|s|m|h) (default 0s)
--rollback-order string Rollback order ("start-first"|"stop-first") (default "stop-first")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stating default in service update command is a little misleading. If you don't specify the flag, it keeps existing value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, however the other --update-* flags have the same problem.

      --update-delay duration              Delay between updates (ns|us|ms|s|m|h) (default 0s)
      --update-failure-action string       Action on update failure ("pause"|"continue"|"rollback") (default "pause")
      --update-monitor duration            Duration after each task update to monitor for failure (ns|us|ms|s|m|h)
      --update-parallelism uint            Maximum number of tasks updated simultaneously (0 to update all at once) (default 1)

Let's deal with this in a separate PR. Would you mind filing an issue?

@@ -175,6 +175,116 @@ func (s *DockerSwarmSuite) TestAPISwarmServicesUpdate(c *check.C) {
map[string]int{image1: instances})
}

func (s *DockerSwarmSuite) TestAPISwarmServicesUpdateStartFirst(c *check.C) {
const nodeCount = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeCount is not used here.

@aaronlehmann
Copy link
Contributor Author

I've tried to rebuild rebuild/windowsRS1 several times and it always fails. Is this a known issue?

@aaronlehmann
Copy link
Contributor Author

ping @docker/core-swarmkit-maintainers

Any chance of getting this in before the freeze?

@dperny
Copy link
Contributor

dperny commented Apr 6, 2017

I am not a maintainer, but it LGTM.

@vieux
Copy link
Contributor

vieux commented Apr 7, 2017

LGTM ping @aluzzardi

case types.UpdateOrderStartFirst:
converted.Order = swarmapi.UpdateConfig_START_FIRST
default:
return nil, fmt.Errorf("unrecongized update order %s", updateConfig.Order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo unrecongized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Copy link
Contributor

@nishanttotla nishanttotla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (IANAM)

This parameter controls the order of operations when rolling out an
update task. Either the old task is stopped before starting the new one,
or the new task is started first, and the running tasks will briefly
overlap.

This commit adds Rollout to the API, and --update-order / --rollback-order
flags to the CLI.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.