-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
instance_group_manager marked tainted if healthcheck failing #9631
instance_group_manager marked tainted if healthcheck failing #9631
Comments
@dv-stephen do you have the debug log available? |
@edwardmedia Yes, here's the debug log: https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd |
It appears it's the same case with or without |
@dv-stephen from the log you provided, I found below line that seems not right. It contains 2
|
@edward2a Oh interesting, I didn't catch that. I updated the issue with the actual test code and not just a trivial example. I excluded the project/network resources as they're just boilerplate but can add them if needed. I didn't completely track it down, but the compute client appends terraform-provider-google/google/config.go Line 517 in bafb3b9
|
@dv-stephen once you fix the data for the |
@edward2a Couple of updates:
|
@dv-stephen how did you hard-code? I assume you were running directly on the provider resources (not modules), right? Can you share your code?
|
@edward2a Sure, I set the resource "google_compute_instance_template" "my_app" {
project = "strong-bad-6nx4"
region = google_compute_subnetwork.primary_region.region
name_prefix = "my-app-"
machine_type = "n1-standard-1"
disk {
boot = true
source_image = "cos-cloud/cos-stable"
disk_type = "pd-ssd"
disk_size_gb = 40
}
network_interface {
subnetwork = google_compute_subnetwork.primary_region.self_link
}
lifecycle {
create_before_destroy = true
}
}
resource "google_compute_health_check" "my_app" {
project = "strong-bad-6nx4"
name = "my-app"
check_interval_sec = 10
timeout_sec = 5
unhealthy_threshold = 5
http_health_check {
port = 80
request_path = "/-/health"
}
}
resource "google_compute_region_instance_group_manager" "my_app" {
project = "strong-bad-6nx4"
region = google_compute_instance_template.my_app.region
name = "my-app"
base_instance_name = "my-app"
version {
instance_template = google_compute_instance_template.my_app.id
}
wait_for_instances = true
target_size = 1
named_port {
name = "http"
port = 80
}
auto_healing_policies {
health_check = google_compute_health_check.my_app.self_link
initial_delay_sec = 30
}
} There is no terraform reusable or child module, just the root. The provider is not configured with a project -- I specify the |
@dv-stephen can you post the debug log? |
@edward2a I'll work on getting you a test project for the MIG issues, but wanted to note that |
Based on my experience filing the bug you referenced here, there is a pretty common pattern that if you access the |
|
@edward2a Tainting would be the expected behavior here as the MIG should not be considered successfully deployed if health checks are failing. Consider the following two scenarios: Failure on initial deployment
Successful deployment, unhealthy later
The main issue is that Do you still need a new debug log or other information? |
@dv-stephen: While #9657 isn't directly related (the problem there is that the user specified the wrong format and the API is behaving badly) you're correct that Terraform is incorrectly sending requests to I've spun out #9722 to cover investigating that. We hadn't noticed the issue because the API was behaving correctly despite that- I suspect a change to the client library we use is the root cause. That said, I don't believe that error has an effect on the instance group manager behaviour here, so we can probably isolate the two discussions / fixes. |
@ScottSuarez how do you want to resolve this? |
Revision of my previous ask after reading more into the resource. Could you help me understand how to know when an instance is unhealthy and will never reach a stable state. I'm not very familiar with the health checks but it seems to be apart of If we did do such an early preempting with a warning or otherwise I would want to know for certain that it will never reach a stable state. How can I tell that from the resource? Otherwise if I stop the poll when stable may eventually be reached |
If I understand correctly I would need to iterate over each managed instance and verify that every single instance is in an unhealthy state. Would there need to be a buffer here between making the change to IGM and verifying that all downstream instances are unhealthy? This isn't really very tenable, and I would recommend asking the api to add such a feature. If indeed there are scenarios where all downstream VMs will never reach healthy states based on an IGM configuration (which it seems like there are from your ask). I do not believe it should be terraform's requirement to make such a complicated analysis. I believe instanceGroupManager api should surface such a status |
@ScottSuarez I think the problem is that the Here's where I think a sensible solution is to simply change the
|
reading the resource after create is a normal pattern for terraform resources. It's how we refresh the state of the resource to reconcile what we did with what is present. The operation polling is separate. This is about the actually apply of the initial http call ensuring that the resource was created as this is an asynchronous action. |
@ScottSuarez Yeah, that mostly makes sense; however, |
@dv-stephen agreed. We shouldn't be doing this polling during read since it can result a broken refresh. I'll make the change to do this polling during create/update with an increased timeout. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
Terraform v1.0.3
Affected Resource(s)
google_compute_region_instance_group_manager
google_compute_instance_group_manager
Terraform Configuration Files
I'm deploying a typical MIG, but with
wait_for_instances = true
:Debug Output
https://gist.github.com/dv-stephen/610fafba3eddd0de9e941ee6fa7e13bd
Expected Behavior
If there is an issue with the MIG such as a bad health check or faulty instance config that prevents the MIG from reaching a healthy state, terraform should be able to refresh the resource and allow code changes to fix the MIG.
Actual Behavior
Terraform hangs on the refresh phase of the MIG resource, waiting for the MIG to become healthy which never happens. The only solution is manual intervention, preventing a GitOps model with changes being made through code.
Steps to Reproduce
wait_for_instances = true
and a health check that will failterraform apply
again which will timeout during the refresh phaseThe text was updated successfully, but these errors were encountered: