Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin.terraform-provider-google_v2.11.0_x4: panic: runtime error: invalid memory address or nil pointer dereference #5018

Closed
hallvors opened this issue Nov 28, 2019 · 19 comments · Fixed by GoogleCloudPlatform/magic-modules#3194, #5808 or hashicorp/terraform-provider-google-beta#1812
Assignees
Labels

Comments

@hallvors
Copy link

hallvors commented Nov 28, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.12.16

  • provider.google v2.11.0
  • provider.google-beta v3.0.0-beta.1

Affected Resource(s)

  • google_v2.11.0_x4

Terraform Configuration Files

It appears to have something to do with this module - if my root module is no longer referencing this one, the crash goes away. Creating two managed instance groups with the same template should be fine, right?:

resource "google_compute_instance_template" "default" {
  name_prefix = "${var.project_appname}-${var.target_environment}-instance-"
  description = "This template is used to create app server instances in a managed instance group. Managed by Terraform."

  tags = ["ssl", "http"]
  labels = {
    environment = var.target_environment
  }

  instance_description = "${var.project_appname}-${var.target_environment} instance. Managed by Terraform."
  machine_type         = "n1-standard-1"
  project              = var.google_project_name
  region               = var.google_region

  scheduling {
    automatic_restart   = true
    on_host_maintenance = "MIGRATE"
  }

  // Create a new boot disk from an image
  disk {
    source_image = var.img_link
    auto_delete  = true
    boot         = true
  }

  network_interface {
    network = "default"
    access_config {}
  }
  # TODO: not sure if these env vars are useful
  metadata_startup_script = "export APP=${var.project_appname}\nexport REPO=${var.project_repository}"
}

resource "google_compute_instance_group_manager" "webservers_backend" {
  provider    = google-beta
  name        = "${var.project_appname}-${var.target_environment}-backend"
  description = "Instance group, backend servers. Managed by Terraform."

  base_instance_name = "${var.project_appname}-${var.target_environment}-backend"
  zone               = var.google_zone

  version {
    name              = "app_instance_group"
    instance_template = google_compute_instance_template.default.self_link
  }


  target_size = 1

  named_port {
    name = "http"
    port = "8080"
  }


  auto_healing_policies {
    health_check      = google_compute_health_check.autohealing.self_link
    initial_delay_sec = 300
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_instance_group_manager" "webservers_frontend" {
  provider    = google-beta
  name        = "${var.project_appname}-${var.target_environment}-frontend"
  description = "Instance group, frontend servers. Managed by Terraform."

  base_instance_name = "${var.project_appname}-${var.target_environment}-frontend"
  zone               = var.google_zone

  version {
    name              = "app_instance_group"
    instance_template = google_compute_instance_template.default.self_link
  }

  named_port {
    name = "http"
    port = "8080"
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.autohealing.self_link
    initial_delay_sec = 300
  }

  lifecycle {
    create_before_destroy = true
  }
}

# some infrastructure-y things: health check, autoscaler

resource "google_compute_health_check" "autohealing" {
  provider            = google-beta
  name                = "${var.project_appname}-${var.target_environment}-autohealing-health-check"
  check_interval_sec  = 15
  timeout_sec         = 10
  healthy_threshold   = 2
  unhealthy_threshold = 10 # 50 seconds

  http_health_check {
    request_path = "/gcp_healtcheck"
    port         = "8080"
  }
}

resource "google_compute_autoscaler" "default" {
  provider = google-beta

  name   = "${var.project_appname}-${var.target_environment}-frontend-autoscaler"
  zone   = var.google_zone
  target = google_compute_instance_group_manager.webservers_frontend.self_link

  autoscaling_policy {
    max_replicas    = 5
    min_replicas    = 1
    cooldown_period = 60
  }
}

Debug Output

Console output: https://gist.github.com/hallvors/c41d3ca7bcc19fd1090f993ae25ee01a

Panic Output

https://gist.github.com/hallvors/95553bc0ac2cca81eae2f03f88d25262

Expected Behavior

No crash, completing setting up resources

Actual Behavior

Crashes consistently on apply and plan when the job is nearly finished. Deleting the state (and generated resources) makes plan work again AFAIK.

Steps to Reproduce

I mostly just run

  1. terraform apply

with lots of -var statements. Backend is remote on GCP.

Important Factoids

The config uses one plain google provider, one google-beta and one google with an alias (to use a different service account auth file).

References

None

@ghost ghost added the bug label Nov 28, 2019
@hallvors
Copy link
Author

Maybe related to the state JSON trying to list the "instances" - and Google Compute Engine basically starting and stopping them at will? So when Terraform runs and tries to compare its state with reality there will likely be some inconsistencies?

plan and apply crash. destroy works and the next plan/apply will not crash and appear to succeed. However, it soon starts crashing again. (FWIW, I'm deploying an app that's work-in-progress and crashes a lot, so GCE is going to consider the instances unhealthy a lot of the time..)

@hallvors
Copy link
Author

I hacked the Terraform state and emptied the instances: [ ... ] arrays of all google_compute_instance_template entries. This prevented crashing, so I think I'm making some useful assumptions here..

@Chupaka
Copy link
Contributor

Chupaka commented Nov 29, 2019

Can you check with more recent version of the provider? 2.11 is quite old.

@hallvors
Copy link
Author

Hi @Chupaka - thanks. I'm a newbie, so I expected terraform init -upgrade to get the latest version(s) of things but thanks to your comment I noticed that some copy-pasted code indeed pins the Google provider version to 2.11. I'll have to run some more deploys to be certain the issue is gone, but thanks a lot for the response. Will follow up after more testing.

@hallvors
Copy link
Author

hallvors commented Dec 9, 2019

I tried with 3.1.0 now and it seems to still crash at the "Refreshing state..." step for google_compute_instance_template. Not sure if it is exactly the same problem but seems very similar:

module.slipway-servers.google_compute_instance_template.default: Refreshing state... [id=capua-staging-instance-20191209141641307900000001]

Error: rpc error: code = Unavailable desc = transport is closing

panic: runtime error: invalid memory address or nil pointer dereference
2019-12-09T16:19:51.044+0100 [DEBUG] plugin.terraform-provider-google_v3.1.0_x5: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xd56d47]
2019-12-09T16:19:51.044+0100 [DEBUG] plugin.terraform-provider-google_v3.1.0_x5:
2019-12-09T16:19:51.044+0100 [DEBUG] plugin.terraform-provider-google_v3.1.0_x5: goroutine 71 [running]:
2019-12-09T16:19:51.044+0100 [DEBUG] plugin.terraform-provider-google_v3.1.0_x5: github.com/hashicorp/terraform-plugin-sdk/internal/helper/

etc.

@edwardmedia edwardmedia self-assigned this Jan 7, 2020
@edwardmedia
Copy link
Contributor

edwardmedia commented Jan 7, 2020

@hallvors Can you post your latest config file that you tested with updated version 3.1.0? The provider was google-beta. If you still see crash, please post the both logs again. Thanks

@hallvors
Copy link
Author

@edwardmedia can I share a demo project with you so you can try to reproduce?

@ghost ghost removed the waiting-response label Jan 23, 2020
@edwardmedia
Copy link
Contributor

@hallvors yes please. Please let me know how I can repro this.

@hallvors
Copy link
Author

hallvors commented Jan 24, 2020

@edwardmedia I have attempted to create a demo and zipped it up here:
https://drive.google.com/file/d/1cuZkPmsWRjYxK-aZUyzMtXtK70dhllL4/view?usp=sharing

I have tried to remove secrets and set up a dedicated GCP project with a separate service account. Hopefully I have not published anything risky, if you spot anything let me know. Also kindly let me know when you have downloaded the zip so I can remove it just in case.

(I suppose this zip now also contains some local terraform state and such.)

I tried first to write a demo from scratch to keep it as simple as possible, but this approach failed to reproduce. So this is probably a bit more complex than it needs to be, but I've commented out some Ansible stuff and things that are probably not relevant. As I'm relatively new to both GCP and Terraform the approaches may not make sense :) but I hope you'll be able to reproduce the crash.

Here's how to reproduce:

cd google-terraform-provider-crash/slipway-test-master
chmod +x ./slipway/init.sh
./slipway/init.sh

Usually Terraform crashes on second run of init.sh

@ghost ghost removed the waiting-response label Jan 24, 2020
@hallvors
Copy link
Author

(I should also ofc remove that project when you're done since Google is probably charging for the test VMs :) )

@edwardmedia
Copy link
Contributor

edwardmedia commented Jan 24, 2020

@hallvors get the zip file. Please remove it

@c2thorn
Copy link
Collaborator

c2thorn commented Feb 7, 2020

Hi @hallvors . Thanks for the effort in creating a demo for us to reproduce. However, would you mind uploading your demo into a github repository or something a bit more transparent? Also, the simpler the demo, the easier it will be for us to pinpoint the root issue. Is there a way we can repro with a terraform apply instead of a shell script?

On first glance, it looks like an issue outside of our provider (panic output shows a nil pointer on this line), but it will be hard to prove either way without being able to easily repro.

@hallvors
Copy link
Author

hallvors commented Feb 8, 2020

Hi @c2thorn, thanks for following up. As far as I remember I set up the demo to actually create infrastructure in a (test) project on my employer's Google account, and I'm a bit worried about abuse if it's kept online. I can share the file if you email hallvord at minus dot no. The shell script runs a few Terraform init and apply commands. If you add set -x or whatever it is at the start you will see them, and it is possible to repro by just re-running the last one once it is set up. Sorry about the complexity, I tried writing a very minimal demo but did not get to a point where it reproduced the problem.

Finally, I asked Terraform devs first and they sent me here :)

@ghost ghost removed the waiting-response label Feb 8, 2020
@c2thorn
Copy link
Collaborator

c2thorn commented Feb 13, 2020

Hey @hallvors, sorry for the delay. In order to help resolve your issue, let's figure out a way to get your configuration in a more transparent manner. Posting a terraform config will not allow other users access to your employer's infrastructure. If you remove the project/orgId, there shouldn't be any chance for malicious intent. If you are still concerned, feel free to join our slack channel and we can speak details/configurations over direct messages.

@hallvors
Copy link
Author

hallvors commented Feb 14, 2020

OK @c2thorn - please clone this:
https://github.com/hallvors/google-terraform-provider-crash

Running terraform apply in slipway-test-master/slipway/terraform/rollout/ tends to cause the crash. But of course you need a bit of state first, and I call this from a shell script setting plenty of required variables, some of them output from terraform commands in other subfolders.. so the actual command is somewhat like this (lightly censored):

terraform apply -auto-approve -var service_account_file=/home/hallvord/repos/capua/config/local-secrets/<censored>-projects-38a4a139c534.json -var project_appname=capua -var google_project_name=<censored> -var service_account_file_dns=/home/hallvord/repos/capua/slipway/config/local-secrets/infrastructure-dns-key.json -var google_dns_project_name=minfrastructure -var google_dns_zone=test-no -var top_level_domain= -var public_server_name=example.no -var internal_server_name=example.no -var admin_server_name=admin.example.no -var google_region=europe-north1 -var google_zone=europe-north1-a -var project_repository=git@github.com:test/example.git -var branch=master -var update_disk_link=https://www.googleapis.com/compute/v1/projects/<censored>/zones/europe-north1-a/disks/capua-production-updatevm-disk -var img_name=capua-master-20200214-2135

@ghost ghost removed the waiting-response label Feb 14, 2020
@c2thorn
Copy link
Collaborator

c2thorn commented Feb 20, 2020

Thanks for being so cooperative @hallvors ! Just wanted to provide an update: I've been able to repro the crash by while limiting the scope down to just the servers and diskimage modules (servers didn't seem to do it by itself). While a bit busy at the moment, I should be able to look for the root cause in the next couple of days. Thanks for your patience.

@c2thorn
Copy link
Collaborator

c2thorn commented Feb 27, 2020

Another update here @hallvors. This is definitely a bug, but there may be a workaround before the fix is finished.

The relevant parts of the crash you are facing have to do with your google_compute_instance_template referencing your google_compute_image self_link in source_image. What I believe is happening is your google_compute_image is being created from the first time your script calls terraform apply, but then is planned for recreation on the second terraform apply. This creates a situation where the google_compute_instance_template source_image reference is planned to change to a value it will not know until after apply is finished.

Normally, Terraform would handle this situation well, but unfortunately there is some custom code written for google_compute_instance_template that does not handle this edge case resulting in the crash.

While this most certainly should be fixed, I don't think you are actually intending to first create your google_compute_image resource and then modify it later causing it to be destroyed/recreated. If you modify your setup to create the image resource once in its final state, I think that will prevent you from seeing the crash.

@hallvors
Copy link
Author

hallvors commented Mar 3, 2020

Thanks a lot for your work on this, @c2thorn ! 🙇‍♂️

@ghost
Copy link

ghost commented Apr 2, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.