instance_group_manager marked tainted if healthcheck failing #9631

ghost · 2021-07-25T23:07:27Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.3

Affected Resource(s)

google_compute_region_instance_group_manager
google_compute_instance_group_manager

Terraform Configuration Files

I'm deploying a typical MIG, but with wait_for_instances = true:

resource "google_compute_instance_template" "my_app" {
  project = google_compute_subnetwork.primary_region.project
  region  = google_compute_subnetwork.primary_region.region

  name_prefix  = "my-app-"
  machine_type = "n1-standard-1"

  disk {
    boot         = true
    source_image = "cos-cloud/cos-stable"
    disk_type    = "pd-ssd"
    disk_size_gb = 40
  }

  network_interface {
    subnetwork = google_compute_subnetwork.primary_region.self_link
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_health_check" "my_app" {
  project = google_compute_instance_template.my_app.project
  name    = "my-app"

  check_interval_sec  = 10
  timeout_sec         = 5
  unhealthy_threshold = 5

  http_health_check {
    port         = 80
    request_path = "/-/health"
  }
}

resource "google_compute_region_instance_group_manager" "my_app" {
  project = google_compute_instance_template.my_app.project
  region  = google_compute_instance_template.my_app.region

  name               = "my-app"
  base_instance_name = "my-app"

  version {
    instance_template = google_compute_instance_template.my_app.id
  }

  target_size        = 1
  wait_for_instances = false

  named_port {
    name = "http"
    port = 80
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.my_app.self_link
    initial_delay_sec = 30
  }
}

Debug Output

https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd

Expected Behavior

If there is an issue with the MIG such as a bad health check or faulty instance config that prevents the MIG from reaching a healthy state, terraform should be able to refresh the resource and allow code changes to fix the MIG.

Actual Behavior

Terraform hangs on the refresh phase of the MIG resource, waiting for the MIG to become healthy which never happens. The only solution is manual intervention, preventing a GitOps model with changes being made through code.

Steps to Reproduce

Deploy a MIG with wait_for_instances = true and a health check that will fail
The terraform run will timeout waiting for the MIG to become healthy
Run terraform apply again which will timeout during the refresh phase

The text was updated successfully, but these errors were encountered:

edwardmedia · 2021-07-26T15:40:46Z

@dv-stephen do you have the debug log available?

ghost · 2021-07-27T15:24:23Z

@edwardmedia Yes, here's the debug log: https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd

ghost · 2021-07-27T15:40:00Z

It appears it's the same case with or without wait_for_instances. I was hoping I could write a custom provisioner to query the state of the MIG myself and let the terraform resource itself succeed so it can be refreshed, but it still doesn't get past the refresh phase. Any ideas or workarounds?

edwardmedia · 2021-07-28T20:50:29Z

@dv-stephen from the log you provided, I found below line that seems not right. It contains 2 projects. It should be just one. How did you provide the data for project in the config? When you say cannot refresh, not sure if that is related.

GET /compute/beta/projects/projects/strong-bad-o1xy/regions/us-central1/instanceGroupManagers/my-app?alt=json&prettyPrint=false HTTP/1.1

ghost · 2021-07-28T22:30:54Z

@edward2a Oh interesting, I didn't catch that. I updated the issue with the actual test code and not just a trivial example. I excluded the project/network resources as they're just boilerplate but can add them if needed.

I didn't completely track it down, but the compute client appends projects to its base URL and I'm guessing the resource(s) are also adding projects.

terraform-provider-google/google/config.go

Line 517 in bafb3b9

computeBetaClientBasePath := c.ComputeBetaBasePath + "projects/"

edwardmedia · 2021-07-29T00:31:43Z

@dv-stephen once you fix the data for the project field, is this still an issue?

ghost · 2021-07-29T03:49:50Z

@edward2a Couple of updates:

I was wrong about wait_for_instances = false having the same behavior. It does ignore the state of the MIG and does allow me to write a custom shell provisioner to query myself. It's still a problem that the wait_for_instances = true option prevents the resource from being refreshed and requires manual intervention.
The projects/projects appears to be a bug in the provider. I hard-coded the project field of all the resources but the debug log still shows API requests with projects/projects. I believe it's related to the code in my previous comment -- double logic to add projects to the URL. I suspect this is happening to a lot more resources than the ones I'm using, perhaps all of the compute resources.

edwardmedia · 2021-07-30T15:53:33Z

@dv-stephen how did you hard-code? I assume you were running directly on the provider resources (not modules), right? Can you share your code?

hard-coded the project field of all the resources but the debug log still shows API requests with projects/projects

ghost · 2021-07-30T16:01:58Z

@edward2a Sure, I set the project field of each resource to a static string:

resource "google_compute_instance_template" "my_app" {
  project = "strong-bad-6nx4"
  region  = google_compute_subnetwork.primary_region.region

  name_prefix  = "my-app-"
  machine_type = "n1-standard-1"

  disk {
    boot         = true
    source_image = "cos-cloud/cos-stable"
    disk_type    = "pd-ssd"
    disk_size_gb = 40
  }

  network_interface {
    subnetwork = google_compute_subnetwork.primary_region.self_link
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_health_check" "my_app" {
  project = "strong-bad-6nx4"
  name    = "my-app"

  check_interval_sec  = 10
  timeout_sec         = 5
  unhealthy_threshold = 5

  http_health_check {
    port         = 80
    request_path = "/-/health"
  }
}

resource "google_compute_region_instance_group_manager" "my_app" {
  project = "strong-bad-6nx4"
  region  = google_compute_instance_template.my_app.region

  name               = "my-app"
  base_instance_name = "my-app"

  version {
    instance_template = google_compute_instance_template.my_app.id
  }

  wait_for_instances = true
  target_size        = 1

  named_port {
    name = "http"
    port = 80
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.my_app.self_link
    initial_delay_sec = 30
  }
}

There is no terraform reusable or child module, just the root. The provider is not configured with a project -- I specify the project for each resource.

edwardmedia · 2021-07-30T16:06:54Z

@dv-stephen can you post the debug log?

ghost · 2021-07-30T17:25:48Z

@edward2a I'll work on getting you a test project for the MIG issues, but wanted to note that projects/projects appears to be a known bug? #9657 (comment)

nphilbrook · 2021-08-02T19:48:04Z

@edward2a I'll work on getting you a test project for the MIG issues, but wanted to note that projects/projects appears to be a known bug? #9657 (comment)

Based on my experience filing the bug you referenced here, there is a pretty common pattern that if you access the project or project_id attribute of a TF google resource, it will return it in the format projects/{{ project_id }}, but this format does NOT work as an input for most resources.

edwardmedia · 2021-08-03T22:56:29Z

projects/projects is not the reason causing this issue. If you look at the result, actually all resources are created at the end. But likely due to the failure from the health check, google_compute_region_instance_group_manager is marked as tainted which leads to the timeout.

ghost · 2021-08-04T15:40:07Z

@edward2a Tainting would be the expected behavior here as the MIG should not be considered successfully deployed if health checks are failing. Consider the following two scenarios:

Failure on initial deployment

I deploy the MIG with a bad health check
The resource fails because the MIG does not reach a STABLE state
Terraform taints the resource
I correct the code to fix the failing health check
When I run terraform plan or terraform apply, terraform can never "refresh" the MIG because it wants it to be in a healthy state. I am stuck as I cannot run terraform any longer. I must fix the MIG and/or health check manually.

Successful deployment, unhealthy later

I deploy the MIG successfully and everything is healthy
Something happens in production and the cluster becomes unhealthy
I fix the problem in the code
When I run terraform plan or terraform apply, terraform can never "refresh" the MIG because it wants to be in a healthy state.

The main issue is that terraform plan|apply cannot be executed at all if the MIG is not in a healthy state when using the wait_for_instances option.

Do you still need a new debug log or other information?

rileykarson · 2021-08-04T16:44:21Z

@dv-stephen: While #9657 isn't directly related (the problem there is that the user specified the wrong format and the API is behaving badly) you're correct that Terraform is incorrectly sending requests to projects/projects/{{project}} despite the correct value being in use in your config.

I've spun out #9722 to cover investigating that. We hadn't noticed the issue because the API was behaving correctly despite that- I suspect a change to the client library we use is the root cause. That said, I don't believe that error has an effect on the instance group manager behaviour here, so we can probably isolate the two discussions / fixes.

edwardmedia · 2021-08-04T18:03:47Z

@ScottSuarez how do you want to resolve this?

ScottSuarez · 2021-08-06T00:29:12Z

Revision of my previous ask after reading more into the resource.

Could you help me understand how to know when an instance is unhealthy and will never reach a stable state. I'm not very familiar with the health checks but it seems to be apart of autoHealingPolicies[].healthCheck which seems to infer some automatic resolution to this unless there is somewhere else a healthCheck is visible.

If we did do such an early preempting with a warning or otherwise I would want to know for certain that it will never reach a stable state. How can I tell that from the resource? Otherwise if I stop the poll when stable may eventually be reached wait_for_status seems to be a pointless flag.

ScottSuarez · 2021-08-06T00:43:40Z

If I understand correctly

I would need to iterate over each managed instance and verify that every single instance is in an unhealthy state.
https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listManagedInstances
https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs

Would there need to be a buffer here between making the change to IGM and verifying that all downstream instances are unhealthy? This isn't really very tenable, and I would recommend asking the api to add such a feature. If indeed there are scenarios where all downstream VMs will never reach healthy states based on an IGM configuration (which it seems like there are from your ask). I do not believe it should be terraform's requirement to make such a complicated analysis.

I believe instanceGroupManager api should surface such a status
https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers

ghost · 2021-08-06T15:41:24Z

@ScottSuarez I think the problem is that the create function invokes the read function -- they share the same logic. I believe the create logic works as expected: the resource should time out if the desired MIG state is never reached (in my case, STABLE). However, the read function should not require the MIG to be in any particular state, but should just read the current state of the MIG.

Here's where create calls read: https://github.com/hashicorp/terraform-provider-google/blob/master/google/resource_compute_instance_group_manager.go#L462.

I think a sensible solution is to simply change the resourceComputeInstanceGroupManagerRead function to read the current state rather than waiting for in-progress operations to complete:

terraform-provider-google/google/resource_compute_instance_group_manager.go

Line 547 in 38351f6

if operation != "" {

.

ScottSuarez · 2021-08-06T17:40:56Z

read normally doesn't require a state. wait_for_instances and wait_for_instances_status is what enables it to wait for the state to be STABLE or UPDATED. If you don't use wait_for_instances and set it to false it should just shallowly apply the op and read the resource without waiting for the downstream manager to hit any particular state.

reading the resource after create is a normal pattern for terraform resources. It's how we refresh the state of the resource to reconcile what we did with what is present. The operation polling is separate. This is about the actually apply of the initial http call ensuring that the resource was created as this is an asynchronous action.

ghost · 2021-08-06T17:53:19Z

@ScottSuarez Yeah, that mostly makes sense; however, read should never prevent a terraform resource from being refreshed. Even with wait_for_instances, read shouldn't be polling and waiting for a resource state -- it should read the current state, regardless of what it is. I agree the reading/polling is necessary for create and update. Does that make more sense?

ScottSuarez · 2021-08-09T18:29:34Z

@dv-stephen agreed. We shouldn't be doing this polling during read since it can result a broken refresh. I'll make the change to do this polling during create/update with an increased timeout.

github-actions · 2021-09-17T02:10:11Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ghost added the bug label Jul 25, 2021

edwardmedia self-assigned this Jul 26, 2021

edwardmedia added the waiting-response label Jul 26, 2021

github-actions bot removed the waiting-response label Jul 26, 2021

edwardmedia added the waiting-response label Jul 26, 2021

github-actions bot removed the waiting-response label Jul 27, 2021

edwardmedia added the waiting-response label Jul 28, 2021

github-actions bot removed the waiting-response label Jul 28, 2021

edwardmedia added the waiting-response label Jul 29, 2021

github-actions bot removed the waiting-response label Jul 29, 2021

edwardmedia added the waiting-response label Jul 30, 2021

github-actions bot removed the waiting-response label Jul 30, 2021

edwardmedia added the waiting-response label Jul 30, 2021

github-actions bot removed the waiting-response label Jul 30, 2021

edwardmedia added the waiting-response label Jul 30, 2021

github-actions bot removed the waiting-response label Jul 30, 2021

edwardmedia added the waiting-response label Aug 1, 2021

github-actions bot removed the waiting-response label Aug 2, 2021

edwardmedia changed the title ~~Cannot refresh instance_group_manager if operation in progress~~ instance_group_manager marked tainted if healthcheck failing Aug 3, 2021

rileykarson mentioned this issue Aug 4, 2021

Handwritten GCE resources send requests to projects/projects/{{project}} #9722

Closed

edwardmedia assigned ScottSuarez and unassigned edwardmedia Aug 4, 2021

ScottSuarez added persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work and removed bug labels Aug 6, 2021

rileykarson added the size/s label Aug 9, 2021

rileykarson added this to the Sprint 16 (2021) milestone Aug 9, 2021

ScottSuarez mentioned this issue Aug 11, 2021

wait for instance status only after create/update GoogleCloudPlatform/magic-modules#5057

Merged

5 tasks

ScottSuarez closed this as completed in GoogleCloudPlatform/magic-modules#5057 Aug 17, 2021

This was referenced Aug 17, 2021

wait for instance status only after create/update hashicorp/terraform-provider-google-beta#3531

Merged

wait for instance status only after create/update #9832

Merged

github-actions bot locked as resolved and limited conversation to collaborators Sep 17, 2021

github-actions bot added forward/review In review; remove label to forward service/compute-managed labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance_group_manager marked tainted if healthcheck failing #9631

instance_group_manager marked tainted if healthcheck failing #9631

ghost commented Jul 25, 2021 •

edited by ghost

Loading

edwardmedia commented Jul 26, 2021

ghost commented Jul 27, 2021

ghost commented Jul 27, 2021

edwardmedia commented Jul 28, 2021

ghost commented Jul 28, 2021 •

edited by ghost

Loading

edwardmedia commented Jul 29, 2021

ghost commented Jul 29, 2021

edwardmedia commented Jul 30, 2021

ghost commented Jul 30, 2021

edwardmedia commented Jul 30, 2021

ghost commented Jul 30, 2021

nphilbrook commented Aug 2, 2021

edwardmedia commented Aug 3, 2021 •

edited

Loading

ghost commented Aug 4, 2021

rileykarson commented Aug 4, 2021

edwardmedia commented Aug 4, 2021

ScottSuarez commented Aug 6, 2021

ScottSuarez commented Aug 6, 2021

ghost commented Aug 6, 2021

ScottSuarez commented Aug 6, 2021 •

edited

Loading

ghost commented Aug 6, 2021

ScottSuarez commented Aug 9, 2021

github-actions bot commented Sep 17, 2021

instance_group_manager marked tainted if healthcheck failing #9631

instance_group_manager marked tainted if healthcheck failing #9631

Comments

ghost commented Jul 25, 2021 • edited by ghost Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

edwardmedia commented Jul 26, 2021

ghost commented Jul 27, 2021

ghost commented Jul 27, 2021

edwardmedia commented Jul 28, 2021

ghost commented Jul 28, 2021 • edited by ghost Loading

edwardmedia commented Jul 29, 2021

ghost commented Jul 29, 2021

edwardmedia commented Jul 30, 2021

ghost commented Jul 30, 2021

edwardmedia commented Jul 30, 2021

ghost commented Jul 30, 2021

nphilbrook commented Aug 2, 2021

edwardmedia commented Aug 3, 2021 • edited Loading

ghost commented Aug 4, 2021

Failure on initial deployment

Successful deployment, unhealthy later

rileykarson commented Aug 4, 2021

edwardmedia commented Aug 4, 2021

ScottSuarez commented Aug 6, 2021

ScottSuarez commented Aug 6, 2021

ghost commented Aug 6, 2021

ScottSuarez commented Aug 6, 2021 • edited Loading

ghost commented Aug 6, 2021

ScottSuarez commented Aug 9, 2021

github-actions bot commented Sep 17, 2021

ghost commented Jul 25, 2021 •

edited by ghost

Loading

ghost commented Jul 28, 2021 •

edited by ghost

Loading

edwardmedia commented Aug 3, 2021 •

edited

Loading

ScottSuarez commented Aug 6, 2021 •

edited

Loading