Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling-apply to instances, or individual instance application #2896

Closed
mwitkow opened this issue Jul 30, 2015 · 24 comments
Closed

Rolling-apply to instances, or individual instance application #2896

mwitkow opened this issue Jul 30, 2015 · 24 comments

Comments

@mwitkow
Copy link

mwitkow commented Jul 30, 2015

I've been experimenting with applying changes to individual instances in Terraform a'la:

$ ~/bin/terraform plan -module-depth=2 -target=module.services.google_compute_instance.frontend_production_instances.3 -out increase.plan
There are warnings and/or errors related to your configuration. Please
fix these before continuing.

Errors:

  * Unexpected value for InstanceType field: "3"

The goal would be to implement a rolling update/reprovision for certain types of changes (e.g. changing machine size).

Is that something that you guys are planning or, or possibly would accept patches for?

@mwitkow
Copy link
Author

mwitkow commented Aug 4, 2015

Actually, it is entirely possible to apply to a single instance. Unfortunately, the syntax for addressing the -target is different from the addresses output by plan.

For example:

plan outputs:
module.eu1-staging.google_compute_instance.core_instances.1

however, for -target you need to pass
module.eu1-staging.google_compute_instance.core_instances[1]
or
module.eu1-staging.google_compute_instance.core_instances.pimary[1]

Still not sure whether Terraform could apply changes to at most X nodes in a graph of a given type in parallel.

@mwitkow
Copy link
Author

mwitkow commented Aug 4, 2015

It seems that Terraform Context has support for Parallelism option.

par := opts.Parallelism

Is this settable somehow on the command line? @mitchellh

@phinze
Copy link
Contributor

phinze commented Oct 12, 2015

Hi @mwitkow-io,

Parallelism is now exposed via the UI in #3365

We'd definitely like to explore support for rolling update behavior - there's some related conversation in #1552

@mwitkow
Copy link
Author

mwitkow commented Oct 24, 2015

@phinze thanks for coming back to me. The use case we have is a rolling update of instances that host an Etcd cluster.

The problem with parallelism is that it first destroys all the instances (granted, 1 at a time), and only then creates them again (again, 1 at a time). This means that there is no sufficient Etcd quorum left :(

What we thought parallelism would do would be apply complete (down+up) changes to resources of a given target.

In the mean time we wrote a wrapper around terraform that reads the plan paths and issues terraform per-resource with the explicit primary[id]. However, it's flimsy and would really prefer a more streamlined solution.

The discussion in #1552 seems to be purely around AWS autoscaling, and it isn't clear whether it'd be applicable to groups of normal instances (in our case GCE). Could you clarify?

@pmoust
Copy link
Contributor

pmoust commented Nov 9, 2015

@mwitkow I am in the same boat. I resorted in building plumbing around our etcd cluster to ensure that quorum is always met.
Perhaps OT but, try to decouple your etcd cluster from your 'application' instances. It will help you maintain sanity.
My 2c.

@apparentlymart
Copy link
Contributor

@mwitkow if you mark your instances as create_before_destroy, does setting the parallelism work better then? Still just a workaround of course, but maybe it helps.

For a similar problem (rolling upgrades to a Consul cluster) I've sadly been just creating an entirely new Terraform resource alongside the old, applying to create both of them, manually fixing up the cluster, and then removing the old ones from the Terraform config. It's clunky to have to work in multiple steps like that, so I'd love a better solution.

@mwitkow
Copy link
Author

mwitkow commented Nov 10, 2015

@pmoust thanks for the tip. It would have been viable if etcd was the only thing that needed to survive a rolling-restart. We also have API servers and monitoring instances that require at least one of them to be up during an update.

@apparentlymart, we'll take a look as to whether create_before_destroy helps.

@phinze, is a rolling update mechanism something that is considered a feature for Terraform?

@phinze
Copy link
Contributor

phinze commented Nov 25, 2015

@mwitkow we'd definitely love to figure out some sort of story for rolling updates in Terraform, it's just a matter of figuring out how to model the feature in a clean way that preserves our declarative model. That part is anything but straightforward.

Tagging this thread thinking since it's definitely something we think/talk about on a regular basis, the path is simply unclear.

@discordianfish
Copy link

@mwitkow Even if it would be possible to create a new, then delete the old instance you still need to make sure the cluster is in sync before terminating the old instance and proceed with the next one.
On AWS, all this is supported (blogged about it), but I wanted to give GCP a spin and now also wondering how to do that there.. Apparently the deployment-manager doesn't support it and apparently terraform can't help either..

@phinze
Copy link
Contributor

phinze commented Dec 4, 2015

Nice blog post @discordianfish!

Any rolling update facility we bake into Terraform wouldn't be ASG specific, so I'm going to merge this thread back down with #1552 and shift that issue to cover the general concept of rolling applies. 👍

@phinze phinze closed this as completed Dec 4, 2015
@phinze
Copy link
Contributor

phinze commented Dec 4, 2015

Ah just reviewed that issue once more and realized that because we'd need to interact with the instances inside the ASG, that it might actually be reasonable to expect an ASG-specific feature there rather than one generic feature. So we'll use this thread to track "generic rolling updates to resources with count > 1" and leave that to the ASG story.

@phinze phinze reopened this Dec 4, 2015
@discordianfish
Copy link

Just looked into this again and what make most sense to me (although I'm not very experienced with terraform, so let me know if this doesn't make sense) would be to create a terraform-side success condition. Every resource could have a attribute which defines a condition built on top of 'remote-exec' which needs to return true for a resource update to success:

resource TYPE NAME {
  success_on {
    inline = [
      "IP=$(ip addr show dev eth0|awk '/inet /{print $2}'|cut -d/ -f1)",
      "while ! curl -s http://localhost:8500/v1/status/peers | grep -q $IP:; do echo Waiting for consul; sleep 1; done",
    ]
  ...
}

terraform apply should block until the specified block returns without errors.
If I understand everything correctly, that should solve my as well as @mwitkow's use case.

@discordianfish
Copy link

Actually now I'm wondering if we could use provisioner "remote-exec" already. Will terraform block while the provisioner is running? And can I make it abort if the provisioner fails? Then we don't need any changes.

@sgotti
Copy link

sgotti commented Mar 2, 2016

If someone is interested I blogged on how we do immutable infrastructure and rolling upgrades (also of stateful services) here. There's also an example of immutable infrastructure with consul and rolling upgrades in this github repository.

It can be obviously improved but we are quite happy with it. Between instance upgrades we have to do various pre and post actions and tests (for example to correctly handle #2957) so I'm not sure how all the different needs will fit inside terraform without breaking its declarative model.

@discordianfish
Copy link

FYI: I've tried to use remote-exec provisioner and it doesn't work. Even if executed with --parallelism=1 it's not working because terraform doesn't abort if the provisioner fails so it continues and brings down the whole cluster if something isn't working.

@sgotti Have you saw my blog article? https://5pi.de/2015/04/27/cloudformation-driven-consul-in-autoscalinggroup/

CloudFormation is declarative as well but it supports a "WaitCondition"[1]. It's pretty straight forward: You tell a ASG it should only take down one instance at a time and waits for a success signal from the replacement instance before it continues with replacing the next instance.

Similarly a on_sucess attribute could mean: Wait for that shell fragment to return. If it's non-zero, abort TF. If it's zero, continue with whatever is next in the TF plan. Seems straight forward to me.

@phinze: Is this something you would consider having in TF? Who ever ends up implementing it.

  1. http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html

@sgotti
Copy link

sgotti commented Mar 2, 2016

@discordianfish Yeah I read it when you wrote your comment (since I'm subscribed to this issue). Very interesting. We took another road due to different ideas and requirements.
I think that everyone wants to use terraform as the global infrastructure orchestrator instead of relying on something on top of it like we have to do right now. I'm just not sure if all the use cases can be covered without hardly hooking inside its execution plan (or also mutating it based on some conditions).

@discordianfish @phinze At the end, doesn't on_success is a subset of pre/post task in the different points of the lifecycle (something like #649 but not only related to provisioners) ?

@discordianfish
Copy link

@sgotti It definitely sounds similar - if it can be made sure that terraform actually aborts if such hook fails.

@mitchellh
Copy link
Contributor

This is a fairly open ended issue so I'm going to make some closing remarks and then close it.

I'm supportive of Terraform having some sort of functionality to assist with rolling deploys. In many cases, rolling deploys make sense at a higher layer (such as a scheduler, autoscaling group, etc.). However, some of the examples shown here such as targeting a specific instance should be completely possible.

@discordianfish
Copy link

@mitchellh Not sure I understand your answer: Are you saying that implementing the desired behavior is already possible or that you prefer a more specific discussion?

Specifically I'm thinking about cloud providers like DigitalOcean which don't support something like ASG to coordinate rolling upgrades. Without additional wrapper scripts, I don't see a way to do rolling upgrades there(*).

But even if you have a scheduler or autoscaling group to do rolling deploys, you still might want TF to apply changes in a rolling way. That is of course, only if you actually use TF to update your infrastructure and not only have it for disaster recovery and initial setup which it seem a lot people are doing..

*) I ended up with a wrapper script to apply changes to each instance individually but that defeats the purpose of TFs dependency management and rollout planning as described here already.

@mitchellh
Copy link
Contributor

I'm saying that I prefer a more specific feature request/discussion if there is one. We've had the "rolling deploy" conversation a lot over the past couple years and a vast majority of the time the answer is: use a scheduler.

However, I'm also open to partial implementations in Terraform: things like -target help with this.

But, it may be the case that scripting Terraform is the best approach, realistically. Terraform's goal is: current state to desired state. As long as your desired state is the next step in a rolling deploy, Terraform can already do it! (with scripts above orchestrating updating the desired state to the next step)

@tad-lispy
Copy link

@mitchellh What about rolling updates to schedulers, like Nomad servers?

I think the examples of Consul or Etcd (mentioned above) are also pretty compelling. If I run Nomad backed by Consul, what scheduler should I run it on? It feels like my stack is high enough already 🤕

Quick idea: maybe parallelism option could be added to lifecycle stanza to control how many instances of a given resource are destroyed / created at once? Then for resources with count > 1 we could use provisioners (including destroy-time provisioners) to make sure the update goes smoothly. This could give us zero-downtime upgrades of infra, even with stateful services like Consul or Nomad.

@eliaoggian
Copy link

In my case I am trying to achieve rolling upgrade of the hosts when a new template has to be used to create the VMs. The problem is exactly what @mwitkow commented on Oct 25, 2015.

Is there any update on this?
Thanks

@discordianfish
Copy link

Just to provide some context, this is one of the reasons why I still much prefer the cloud provider's native declarative infrastructure tooling (e.g Cloudformation). I know it's not trivial to get right but for me lacking control over the instance lifecycle during updates is the reason why I ever only used TF partially.

@ghost
Copy link

ghost commented Apr 4, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants