Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Add ignore_changes to lifecycle block for OS images #126

Closed
wants to merge 1 commit into from

Conversation

justaugustus
Copy link
Contributor

@justaugustus justaugustus commented Feb 14, 2018

Configures TF lifecycle block to ignore changes to the OS image on each platform.
Should fix the problem referenced here: #76 (comment)

This wasn't extensively tested, outside of the fact that I've used the lifecycle block and ignore_changes before in other projects.

@dghubble let me know if this looks sane to you.

@dghubble
Copy link
Member

dghubble commented Feb 14, 2018

I think this is only needed on AWS, but I'm not sure yet. I don't want to add sections that aren't fully understood.

AWS

I suspect it is needed on AWS (alluded to in https://github.com/poseidon/terraform-render-bootkube/issues/37#issuecomment-358354328) because the AMI is always dynamically resolved in Terraform. A user chooses a channel (such as the default "stable"), deploys a cluster, retains the cluster until a new CL release occurs, and re-runs terraform apply. At that point Terraform re-evaluates the mapping from channel to actual AMI id, detects a change, and decides the instances should be replaced. This is undesired since controller nodes should not be replaced (destroys etcd data), plus node upgrades can already be handled by the CLUO addon.

"ami-65269d1d" => "ami-fc01ba84"

I've left an AWS cluster running with CL alpha to be able to verify this buggy behavior and test just the AMI fix.

Others

On Digital Ocean, images like "coreos-stable" are created via the DO API (no dynamic resolution) so I've yet to see a case where DO will destroy existing nodes upon new images becoming available in a channel.

On Google Cloud, docs currently recommend choosing exact images (gcloud compute images list | grep coreos) which avoids any issues. Its possible to specify "coreos-stable" to the GCP API for use in the controller image and worker instance template. I'm testing this out. I've yet to see GCP destroy existing nodes upon new images becoming available in a channel, but will report if its safe to switch from exact images to "coreos-stable", etc.

I suppose there is a case we could ignore every field that can't safely be changed in place. But that's rather a lot of fields (most of them).

@dghubble
Copy link
Member

There is indeed a bug on AWS (regression, since old controller ASGs had the ignore_changes). The AMI id changes since the channel is evaluated client-side. I've taken just the AWS controller fix in c831375.

I'm not able to demonstrate this fixes the issue because going from a hash without the ignore_change to a hash with the ignore_changes after the AMI change still shows a diff that causes controller replacement. I suspect ignore_changes has to be initially present. I've deployed a fresh test cluster to catch the next alpha release cycle to verify.

@dghubble dghubble closed this Feb 16, 2018
@dghubble
Copy link
Member

Merged c831375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants