Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance_group_manager marked tainted if healthcheck failing #9631

Closed
ghost opened this issue Jul 25, 2021 · 23 comments · Fixed by GoogleCloudPlatform/magic-modules#5057, hashicorp/terraform-provider-google-beta#3531 or #9832
Assignees
Labels
forward/review In review; remove label to forward persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/compute-managed size/s

Comments

@ghost
Copy link

ghost commented Jul 25, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.3

Affected Resource(s)

  • google_compute_region_instance_group_manager
  • google_compute_instance_group_manager

Terraform Configuration Files

I'm deploying a typical MIG, but with wait_for_instances = true:

resource "google_compute_instance_template" "my_app" {
  project = google_compute_subnetwork.primary_region.project
  region  = google_compute_subnetwork.primary_region.region

  name_prefix  = "my-app-"
  machine_type = "n1-standard-1"

  disk {
    boot         = true
    source_image = "cos-cloud/cos-stable"
    disk_type    = "pd-ssd"
    disk_size_gb = 40
  }

  network_interface {
    subnetwork = google_compute_subnetwork.primary_region.self_link
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_health_check" "my_app" {
  project = google_compute_instance_template.my_app.project
  name    = "my-app"

  check_interval_sec  = 10
  timeout_sec         = 5
  unhealthy_threshold = 5

  http_health_check {
    port         = 80
    request_path = "/-/health"
  }
}

resource "google_compute_region_instance_group_manager" "my_app" {
  project = google_compute_instance_template.my_app.project
  region  = google_compute_instance_template.my_app.region

  name               = "my-app"
  base_instance_name = "my-app"

  version {
    instance_template = google_compute_instance_template.my_app.id
  }

  target_size        = 1
  wait_for_instances = false

  named_port {
    name = "http"
    port = 80
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.my_app.self_link
    initial_delay_sec = 30
  }
}

Debug Output

https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd

Expected Behavior

If there is an issue with the MIG such as a bad health check or faulty instance config that prevents the MIG from reaching a healthy state, terraform should be able to refresh the resource and allow code changes to fix the MIG.

Actual Behavior

Terraform hangs on the refresh phase of the MIG resource, waiting for the MIG to become healthy which never happens. The only solution is manual intervention, preventing a GitOps model with changes being made through code.

Steps to Reproduce

  1. Deploy a MIG with wait_for_instances = true and a health check that will fail
  2. The terraform run will timeout waiting for the MIG to become healthy
  3. Run terraform apply again which will timeout during the refresh phase
@ghost ghost added the bug label Jul 25, 2021
@edwardmedia edwardmedia self-assigned this Jul 26, 2021
@edwardmedia
Copy link
Contributor

@dv-stephen do you have the debug log available?

@ghost
Copy link
Author

ghost commented Jul 27, 2021

@ghost
Copy link
Author

ghost commented Jul 27, 2021

It appears it's the same case with or without wait_for_instances. I was hoping I could write a custom provisioner to query the state of the MIG myself and let the terraform resource itself succeed so it can be refreshed, but it still doesn't get past the refresh phase. Any ideas or workarounds?

@edwardmedia
Copy link
Contributor

@dv-stephen from the log you provided, I found below line that seems not right. It contains 2 projects. It should be just one. How did you provide the data for project in the config? When you say cannot refresh, not sure if that is related.

GET /compute/beta/projects/projects/strong-bad-o1xy/regions/us-central1/instanceGroupManagers/my-app?alt=json&prettyPrint=false HTTP/1.1

@ghost
Copy link
Author

ghost commented Jul 28, 2021

@edward2a Oh interesting, I didn't catch that. I updated the issue with the actual test code and not just a trivial example. I excluded the project/network resources as they're just boilerplate but can add them if needed.

I didn't completely track it down, but the compute client appends projects to its base URL and I'm guessing the resource(s) are also adding projects.

computeBetaClientBasePath := c.ComputeBetaBasePath + "projects/"

@edwardmedia
Copy link
Contributor

@dv-stephen once you fix the data for the project field, is this still an issue?

@ghost
Copy link
Author

ghost commented Jul 29, 2021

@edward2a Couple of updates:

  1. I was wrong about wait_for_instances = false having the same behavior. It does ignore the state of the MIG and does allow me to write a custom shell provisioner to query myself. It's still a problem that the wait_for_instances = true option prevents the resource from being refreshed and requires manual intervention.

  2. The projects/projects appears to be a bug in the provider. I hard-coded the project field of all the resources but the debug log still shows API requests with projects/projects. I believe it's related to the code in my previous comment -- double logic to add projects to the URL. I suspect this is happening to a lot more resources than the ones I'm using, perhaps all of the compute resources.

@edwardmedia
Copy link
Contributor

@dv-stephen how did you hard-code? I assume you were running directly on the provider resources (not modules), right? Can you share your code?

hard-coded the project field of all the resources but the debug log still shows API requests with projects/projects

@ghost
Copy link
Author

ghost commented Jul 30, 2021

@edward2a Sure, I set the project field of each resource to a static string:

resource "google_compute_instance_template" "my_app" {
  project = "strong-bad-6nx4"
  region  = google_compute_subnetwork.primary_region.region

  name_prefix  = "my-app-"
  machine_type = "n1-standard-1"

  disk {
    boot         = true
    source_image = "cos-cloud/cos-stable"
    disk_type    = "pd-ssd"
    disk_size_gb = 40
  }

  network_interface {
    subnetwork = google_compute_subnetwork.primary_region.self_link
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_health_check" "my_app" {
  project = "strong-bad-6nx4"
  name    = "my-app"

  check_interval_sec  = 10
  timeout_sec         = 5
  unhealthy_threshold = 5

  http_health_check {
    port         = 80
    request_path = "/-/health"
  }
}

resource "google_compute_region_instance_group_manager" "my_app" {
  project = "strong-bad-6nx4"
  region  = google_compute_instance_template.my_app.region

  name               = "my-app"
  base_instance_name = "my-app"

  version {
    instance_template = google_compute_instance_template.my_app.id
  }

  wait_for_instances = true
  target_size        = 1

  named_port {
    name = "http"
    port = 80
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.my_app.self_link
    initial_delay_sec = 30
  }
}

There is no terraform reusable or child module, just the root. The provider is not configured with a project -- I specify the project for each resource.

@edwardmedia
Copy link
Contributor

@dv-stephen can you post the debug log?

@ghost
Copy link
Author

ghost commented Jul 30, 2021

@edward2a I'll work on getting you a test project for the MIG issues, but wanted to note that projects/projects appears to be a known bug? #9657 (comment)

@nphilbrook
Copy link
Contributor

@edward2a I'll work on getting you a test project for the MIG issues, but wanted to note that projects/projects appears to be a known bug? #9657 (comment)

Based on my experience filing the bug you referenced here, there is a pretty common pattern that if you access the project or project_id attribute of a TF google resource, it will return it in the format projects/{{ project_id }}, but this format does NOT work as an input for most resources.

@edwardmedia
Copy link
Contributor

edwardmedia commented Aug 3, 2021

projects/projects is not the reason causing this issue. If you look at the result, actually all resources are created at the end. But likely due to the failure from the health check, google_compute_region_instance_group_manager is marked as tainted which leads to the timeout.

@edwardmedia edwardmedia changed the title Cannot refresh instance_group_manager if operation in progress instance_group_manager marked tainted if healthcheck failing Aug 3, 2021
@ghost
Copy link
Author

ghost commented Aug 4, 2021

@edward2a Tainting would be the expected behavior here as the MIG should not be considered successfully deployed if health checks are failing. Consider the following two scenarios:

Failure on initial deployment

  1. I deploy the MIG with a bad health check
  2. The resource fails because the MIG does not reach a STABLE state
  3. Terraform taints the resource
  4. I correct the code to fix the failing health check
  5. When I run terraform plan or terraform apply, terraform can never "refresh" the MIG because it wants it to be in a healthy state. I am stuck as I cannot run terraform any longer. I must fix the MIG and/or health check manually.

Successful deployment, unhealthy later

  1. I deploy the MIG successfully and everything is healthy
  2. Something happens in production and the cluster becomes unhealthy
  3. I fix the problem in the code
  4. When I run terraform plan or terraform apply, terraform can never "refresh" the MIG because it wants to be in a healthy state.

The main issue is that terraform plan|apply cannot be executed at all if the MIG is not in a healthy state when using the wait_for_instances option.

Do you still need a new debug log or other information?

@rileykarson
Copy link
Collaborator

@dv-stephen: While #9657 isn't directly related (the problem there is that the user specified the wrong format and the API is behaving badly) you're correct that Terraform is incorrectly sending requests to projects/projects/{{project}} despite the correct value being in use in your config.

I've spun out #9722 to cover investigating that. We hadn't noticed the issue because the API was behaving correctly despite that- I suspect a change to the client library we use is the root cause. That said, I don't believe that error has an effect on the instance group manager behaviour here, so we can probably isolate the two discussions / fixes.

@edwardmedia
Copy link
Contributor

@ScottSuarez how do you want to resolve this?

@ScottSuarez
Copy link
Collaborator

Revision of my previous ask after reading more into the resource.

Could you help me understand how to know when an instance is unhealthy and will never reach a stable state. I'm not very familiar with the health checks but it seems to be apart of autoHealingPolicies[].healthCheck which seems to infer some automatic resolution to this unless there is somewhere else a healthCheck is visible.

If we did do such an early preempting with a warning or otherwise I would want to know for certain that it will never reach a stable state. How can I tell that from the resource? Otherwise if I stop the poll when stable may eventually be reached wait_for_status seems to be a pointless flag.

@ScottSuarez
Copy link
Collaborator

If I understand correctly

I would need to iterate over each managed instance and verify that every single instance is in an unhealthy state.
https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listManagedInstances
https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs

Would there need to be a buffer here between making the change to IGM and verifying that all downstream instances are unhealthy? This isn't really very tenable, and I would recommend asking the api to add such a feature. If indeed there are scenarios where all downstream VMs will never reach healthy states based on an IGM configuration (which it seems like there are from your ask). I do not believe it should be terraform's requirement to make such a complicated analysis.

I believe instanceGroupManager api should surface such a status
https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers

@ScottSuarez ScottSuarez added persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work and removed bug labels Aug 6, 2021
@ghost
Copy link
Author

ghost commented Aug 6, 2021

@ScottSuarez I think the problem is that the create function invokes the read function -- they share the same logic. I believe the create logic works as expected: the resource should time out if the desired MIG state is never reached (in my case, STABLE). However, the read function should not require the MIG to be in any particular state, but should just read the current state of the MIG.

Here's where create calls read: https://github.com/hashicorp/terraform-provider-google/blob/master/google/resource_compute_instance_group_manager.go#L462.

I think a sensible solution is to simply change the resourceComputeInstanceGroupManagerRead function to read the current state rather than waiting for in-progress operations to complete:

.

@ScottSuarez
Copy link
Collaborator

ScottSuarez commented Aug 6, 2021

read normally doesn't require a state. wait_for_instances and wait_for_instances_status is what enables it to wait for the state to be STABLE or UPDATED. If you don't use wait_for_instances and set it to false it should just shallowly apply the op and read the resource without waiting for the downstream manager to hit any particular state.

reading the resource after create is a normal pattern for terraform resources. It's how we refresh the state of the resource to reconcile what we did with what is present. The operation polling is separate. This is about the actually apply of the initial http call ensuring that the resource was created as this is an asynchronous action.

@ghost
Copy link
Author

ghost commented Aug 6, 2021

@ScottSuarez Yeah, that mostly makes sense; however, read should never prevent a terraform resource from being refreshed. Even with wait_for_instances, read shouldn't be polling and waiting for a resource state -- it should read the current state, regardless of what it is. I agree the reading/polling is necessary for create and update. Does that make more sense?

@rileykarson rileykarson added this to the Sprint 16 (2021) milestone Aug 9, 2021
@ScottSuarez
Copy link
Collaborator

@dv-stephen agreed. We shouldn't be doing this polling during read since it can result a broken refresh. I'll make the change to do this polling during create/update with an increased timeout.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 17, 2021
@github-actions github-actions bot added forward/review In review; remove label to forward service/compute-managed labels Jan 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.