*: Add `ignore_changes` to lifecycle block for OS images #126

justaugustus · 2018-02-14T12:54:25Z

Configures TF lifecycle block to ignore changes to the OS image on each platform.
Should fix the problem referenced here: #76 (comment)

This wasn't extensively tested, outside of the fact that I've used the lifecycle block and ignore_changes before in other projects.

@dghubble let me know if this looks sane to you.

dghubble · 2018-02-14T19:48:28Z

I think this is only needed on AWS, but I'm not sure yet. I don't want to add sections that aren't fully understood.

AWS

I suspect it is needed on AWS (alluded to in https://github.com/poseidon/terraform-render-bootkube/issues/37#issuecomment-358354328) because the AMI is always dynamically resolved in Terraform. A user chooses a channel (such as the default "stable"), deploys a cluster, retains the cluster until a new CL release occurs, and re-runs terraform apply. At that point Terraform re-evaluates the mapping from channel to actual AMI id, detects a change, and decides the instances should be replaced. This is undesired since controller nodes should not be replaced (destroys etcd data), plus node upgrades can already be handled by the CLUO addon.

"ami-65269d1d" => "ami-fc01ba84"

I've left an AWS cluster running with CL alpha to be able to verify this buggy behavior and test just the AMI fix.

Others

On Digital Ocean, images like "coreos-stable" are created via the DO API (no dynamic resolution) so I've yet to see a case where DO will destroy existing nodes upon new images becoming available in a channel.

On Google Cloud, docs currently recommend choosing exact images (gcloud compute images list | grep coreos) which avoids any issues. Its possible to specify "coreos-stable" to the GCP API for use in the controller image and worker instance template. I'm testing this out. I've yet to see GCP destroy existing nodes upon new images becoming available in a channel, but will report if its safe to switch from exact images to "coreos-stable", etc.

I suppose there is a case we could ignore every field that can't safely be changed in place. But that's rather a lot of fields (most of them).

dghubble · 2018-02-16T04:22:51Z

There is indeed a bug on AWS (regression, since old controller ASGs had the ignore_changes). The AMI id changes since the channel is evaluated client-side. I've taken just the AWS controller fix in c831375.

I'm not able to demonstrate this fixes the issue because going from a hash without the ignore_change to a hash with the ignore_changes after the AMI change still shows a diff that causes controller replacement. I suspect ignore_changes has to be initially present. I've deployed a fresh test cluster to catch the next alpha release cycle to verify.

dghubble · 2018-02-16T04:23:20Z

Merged c831375

*: Add ignore_changes to lifecycle block for OS images

4f9b3f6

dghubble closed this Feb 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: Add `ignore_changes` to lifecycle block for OS images #126

*: Add `ignore_changes` to lifecycle block for OS images #126

justaugustus commented Feb 14, 2018 •

edited

Loading

dghubble commented Feb 14, 2018 •

edited

Loading

dghubble commented Feb 16, 2018

dghubble commented Feb 16, 2018

*: Add ignore_changes to lifecycle block for OS images #126

*: Add ignore_changes to lifecycle block for OS images #126

Conversation

justaugustus commented Feb 14, 2018 • edited Loading

dghubble commented Feb 14, 2018 • edited Loading

AWS

Others

dghubble commented Feb 16, 2018

dghubble commented Feb 16, 2018

*: Add `ignore_changes` to lifecycle block for OS images #126

*: Add `ignore_changes` to lifecycle block for OS images #126

justaugustus commented Feb 14, 2018 •

edited

Loading

dghubble commented Feb 14, 2018 •

edited

Loading