Rolling-apply to instances, or individual instance application #2896

mwitkow · 2015-07-30T18:42:15Z

I've been experimenting with applying changes to individual instances in Terraform a'la:

$ ~/bin/terraform plan -module-depth=2 -target=module.services.google_compute_instance.frontend_production_instances.3 -out increase.plan
There are warnings and/or errors related to your configuration. Please
fix these before continuing.

Errors:

  * Unexpected value for InstanceType field: "3"

The goal would be to implement a rolling update/reprovision for certain types of changes (e.g. changing machine size).

Is that something that you guys are planning or, or possibly would accept patches for?

The text was updated successfully, but these errors were encountered:

mwitkow · 2015-08-04T09:03:57Z

Actually, it is entirely possible to apply to a single instance. Unfortunately, the syntax for addressing the -target is different from the addresses output by plan.

For example:

plan outputs:
module.eu1-staging.google_compute_instance.core_instances.1

however, for -target you need to pass
module.eu1-staging.google_compute_instance.core_instances[1]
or
module.eu1-staging.google_compute_instance.core_instances.pimary[1]

Still not sure whether Terraform could apply changes to at most X nodes in a graph of a given type in parallel.

mwitkow · 2015-08-04T10:30:26Z

It seems that Terraform Context has support for Parallelism option.

terraform/terraform/context.go

Line 98 in 9de673b

par := opts.Parallelism

Is this settable somehow on the command line? @mitchellh

phinze · 2015-10-12T12:56:14Z

Hi @mwitkow-io,

Parallelism is now exposed via the UI in #3365

We'd definitely like to explore support for rolling update behavior - there's some related conversation in #1552

mwitkow · 2015-10-24T23:52:13Z

@phinze thanks for coming back to me. The use case we have is a rolling update of instances that host an Etcd cluster.

The problem with parallelism is that it first destroys all the instances (granted, 1 at a time), and only then creates them again (again, 1 at a time). This means that there is no sufficient Etcd quorum left :(

What we thought parallelism would do would be apply complete (down+up) changes to resources of a given target.

In the mean time we wrote a wrapper around terraform that reads the plan paths and issues terraform per-resource with the explicit primary[id]. However, it's flimsy and would really prefer a more streamlined solution.

The discussion in #1552 seems to be purely around AWS autoscaling, and it isn't clear whether it'd be applicable to groups of normal instances (in our case GCE). Could you clarify?

pmoust · 2015-11-09T16:41:10Z

@mwitkow I am in the same boat. I resorted in building plumbing around our etcd cluster to ensure that quorum is always met.
Perhaps OT but, try to decouple your etcd cluster from your 'application' instances. It will help you maintain sanity.
My 2c.

apparentlymart · 2015-11-09T17:46:24Z

@mwitkow if you mark your instances as create_before_destroy, does setting the parallelism work better then? Still just a workaround of course, but maybe it helps.

For a similar problem (rolling upgrades to a Consul cluster) I've sadly been just creating an entirely new Terraform resource alongside the old, applying to create both of them, manually fixing up the cluster, and then removing the old ones from the Terraform config. It's clunky to have to work in multiple steps like that, so I'd love a better solution.

mwitkow · 2015-11-10T09:36:38Z

@pmoust thanks for the tip. It would have been viable if etcd was the only thing that needed to survive a rolling-restart. We also have API servers and monitoring instances that require at least one of them to be up during an update.

@apparentlymart, we'll take a look as to whether create_before_destroy helps.

@phinze, is a rolling update mechanism something that is considered a feature for Terraform?

phinze · 2015-11-25T01:30:05Z

@mwitkow we'd definitely love to figure out some sort of story for rolling updates in Terraform, it's just a matter of figuring out how to model the feature in a clean way that preserves our declarative model. That part is anything but straightforward.

Tagging this thread thinking since it's definitely something we think/talk about on a regular basis, the path is simply unclear.

discordianfish · 2015-12-03T11:46:30Z

@mwitkow Even if it would be possible to create a new, then delete the old instance you still need to make sure the cluster is in sync before terminating the old instance and proceed with the next one.
On AWS, all this is supported (blogged about it), but I wanted to give GCP a spin and now also wondering how to do that there.. Apparently the deployment-manager doesn't support it and apparently terraform can't help either..

phinze · 2015-12-04T14:41:26Z

Nice blog post @discordianfish!

Any rolling update facility we bake into Terraform wouldn't be ASG specific, so I'm going to merge this thread back down with #1552 and shift that issue to cover the general concept of rolling applies. 👍

phinze · 2015-12-04T14:43:16Z

Ah just reviewed that issue once more and realized that because we'd need to interact with the instances inside the ASG, that it might actually be reasonable to expect an ASG-specific feature there rather than one generic feature. So we'll use this thread to track "generic rolling updates to resources with count > 1" and leave that to the ASG story.

discordianfish · 2016-02-18T15:01:20Z

Just looked into this again and what make most sense to me (although I'm not very experienced with terraform, so let me know if this doesn't make sense) would be to create a terraform-side success condition. Every resource could have a attribute which defines a condition built on top of 'remote-exec' which needs to return true for a resource update to success:

resource TYPE NAME {
  success_on {
    inline = [
      "IP=$(ip addr show dev eth0|awk '/inet /{print $2}'|cut -d/ -f1)",
      "while ! curl -s http://localhost:8500/v1/status/peers | grep -q $IP:; do echo Waiting for consul; sleep 1; done",
    ]
  ...
}

terraform apply should block until the specified block returns without errors.
If I understand everything correctly, that should solve my as well as @mwitkow's use case.

discordianfish · 2016-02-18T15:08:58Z

Actually now I'm wondering if we could use provisioner "remote-exec" already. Will terraform block while the provisioner is running? And can I make it abort if the provisioner fails? Then we don't need any changes.

sgotti · 2016-03-02T16:07:20Z

If someone is interested I blogged on how we do immutable infrastructure and rolling upgrades (also of stateful services) here. There's also an example of immutable infrastructure with consul and rolling upgrades in this github repository.

It can be obviously improved but we are quite happy with it. Between instance upgrades we have to do various pre and post actions and tests (for example to correctly handle #2957) so I'm not sure how all the different needs will fit inside terraform without breaking its declarative model.

discordianfish · 2016-03-02T16:43:36Z

FYI: I've tried to use remote-exec provisioner and it doesn't work. Even if executed with --parallelism=1 it's not working because terraform doesn't abort if the provisioner fails so it continues and brings down the whole cluster if something isn't working.

@sgotti Have you saw my blog article? https://5pi.de/2015/04/27/cloudformation-driven-consul-in-autoscalinggroup/

CloudFormation is declarative as well but it supports a "WaitCondition"[1]. It's pretty straight forward: You tell a ASG it should only take down one instance at a time and waits for a success signal from the replacement instance before it continues with replacing the next instance.

Similarly a on_sucess attribute could mean: Wait for that shell fragment to return. If it's non-zero, abort TF. If it's zero, continue with whatever is next in the TF plan. Seems straight forward to me.

@phinze: Is this something you would consider having in TF? Who ever ends up implementing it.

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html

sgotti · 2016-03-02T19:37:21Z

@discordianfish Yeah I read it when you wrote your comment (since I'm subscribed to this issue). Very interesting. We took another road due to different ideas and requirements.
I think that everyone wants to use terraform as the global infrastructure orchestrator instead of relying on something on top of it like we have to do right now. I'm just not sure if all the use cases can be covered without hardly hooking inside its execution plan (or also mutating it based on some conditions).

@discordianfish @phinze At the end, doesn't on_success is a subset of pre/post task in the different points of the lifecycle (something like #649 but not only related to provisioners) ?

discordianfish · 2016-03-04T11:24:46Z

@sgotti It definitely sounds similar - if it can be made sure that terraform actually aborts if such hook fails.

mitchellh · 2016-12-03T23:55:32Z

This is a fairly open ended issue so I'm going to make some closing remarks and then close it.

I'm supportive of Terraform having some sort of functionality to assist with rolling deploys. In many cases, rolling deploys make sense at a higher layer (such as a scheduler, autoscaling group, etc.). However, some of the examples shown here such as targeting a specific instance should be completely possible.

discordianfish · 2016-12-05T16:57:48Z

@mitchellh Not sure I understand your answer: Are you saying that implementing the desired behavior is already possible or that you prefer a more specific discussion?

Specifically I'm thinking about cloud providers like DigitalOcean which don't support something like ASG to coordinate rolling upgrades. Without additional wrapper scripts, I don't see a way to do rolling upgrades there(*).

But even if you have a scheduler or autoscaling group to do rolling deploys, you still might want TF to apply changes in a rolling way. That is of course, only if you actually use TF to update your infrastructure and not only have it for disaster recovery and initial setup which it seem a lot people are doing..

*) I ended up with a wrapper script to apply changes to each instance individually but that defeats the purpose of TFs dependency management and rollout planning as described here already.

mitchellh · 2016-12-05T17:01:26Z

I'm saying that I prefer a more specific feature request/discussion if there is one. We've had the "rolling deploy" conversation a lot over the past couple years and a vast majority of the time the answer is: use a scheduler.

However, I'm also open to partial implementations in Terraform: things like -target help with this.

But, it may be the case that scripting Terraform is the best approach, realistically. Terraform's goal is: current state to desired state. As long as your desired state is the next step in a rolling deploy, Terraform can already do it! (with scripts above orchestrating updating the desired state to the next step)

tad-lispy · 2018-06-17T18:41:53Z

@mitchellh What about rolling updates to schedulers, like Nomad servers?

I think the examples of Consul or Etcd (mentioned above) are also pretty compelling. If I run Nomad backed by Consul, what scheduler should I run it on? It feels like my stack is high enough already 🤕

Quick idea: maybe parallelism option could be added to lifecycle stanza to control how many instances of a given resource are destroyed / created at once? Then for resources with count > 1 we could use provisioners (including destroy-time provisioners) to make sure the update goes smoothly. This could give us zero-downtime upgrades of infra, even with stateful services like Consul or Nomad.

eliaoggian · 2018-07-05T13:13:46Z

In my case I am trying to achieve rolling upgrade of the hosts when a new template has to be used to create the VMs. The problem is exactly what @mwitkow commented on Oct 25, 2015.

Is there any update on this?
Thanks

discordianfish · 2018-07-10T16:48:05Z

Just to provide some context, this is one of the reasons why I still much prefer the cloud provider's native declarative infrastructure tooling (e.g Cloudformation). I know it's not trivial to get right but for me lacking control over the instance lifecycle during updates is the reason why I ever only used TF partially.

ghost · 2020-04-04T01:56:17Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

phinze added question core labels Oct 12, 2015

phinze added the thinking label Nov 25, 2015

apparentlymart mentioned this issue Dec 2, 2015

Partial/Progressive Configuration Changes #4149

Closed

phinze closed this as completed Dec 4, 2015

phinze reopened this Dec 4, 2015

mitchellh closed this as completed Dec 3, 2016

marcossegovia mentioned this issue Apr 24, 2018

Allowing create and destroy before moving to the next target when running parallelism=1 #17924

Closed

ghost locked and limited conversation to collaborators Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling-apply to instances, or individual instance application #2896

Rolling-apply to instances, or individual instance application #2896

mwitkow commented Jul 30, 2015

mwitkow commented Aug 4, 2015

mwitkow commented Aug 4, 2015

phinze commented Oct 12, 2015

mwitkow commented Oct 24, 2015

pmoust commented Nov 9, 2015

apparentlymart commented Nov 9, 2015

mwitkow commented Nov 10, 2015

phinze commented Nov 25, 2015

discordianfish commented Dec 3, 2015

phinze commented Dec 4, 2015

phinze commented Dec 4, 2015

discordianfish commented Feb 18, 2016

discordianfish commented Feb 18, 2016

sgotti commented Mar 2, 2016

discordianfish commented Mar 2, 2016

sgotti commented Mar 2, 2016

discordianfish commented Mar 4, 2016

mitchellh commented Dec 3, 2016

discordianfish commented Dec 5, 2016

mitchellh commented Dec 5, 2016

tad-lispy commented Jun 17, 2018

eliaoggian commented Jul 5, 2018

discordianfish commented Jul 10, 2018

ghost commented Apr 4, 2020

Rolling-apply to instances, or individual instance application #2896

Rolling-apply to instances, or individual instance application #2896

Comments

mwitkow commented Jul 30, 2015

mwitkow commented Aug 4, 2015

mwitkow commented Aug 4, 2015

phinze commented Oct 12, 2015

mwitkow commented Oct 24, 2015

pmoust commented Nov 9, 2015

apparentlymart commented Nov 9, 2015

mwitkow commented Nov 10, 2015

phinze commented Nov 25, 2015

discordianfish commented Dec 3, 2015

phinze commented Dec 4, 2015

phinze commented Dec 4, 2015

discordianfish commented Feb 18, 2016

discordianfish commented Feb 18, 2016

sgotti commented Mar 2, 2016

discordianfish commented Mar 2, 2016

sgotti commented Mar 2, 2016

discordianfish commented Mar 4, 2016

mitchellh commented Dec 3, 2016

discordianfish commented Dec 5, 2016

mitchellh commented Dec 5, 2016

tad-lispy commented Jun 17, 2018

eliaoggian commented Jul 5, 2018

discordianfish commented Jul 10, 2018

ghost commented Apr 4, 2020