Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform: Helm module should remove itself at destroy time before cluster deletion #1422

Closed
markmandel opened this issue Mar 19, 2020 · 12 comments
Labels
area/operations Installation, updating, metrics etc kind/feature New features for Agones obsolete wontfix Sorry, but we're not going to do that.

Comments

@markmandel
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
This is particularly frustrating with GKE, not sure how it is with other providers.

When you delete GKR cluster, if there is a Service setup, the firewall rules and load balancers are left in place and aren't deleted. So you can sometimes hit quota limits and/or extra charges for LBs, external IPs, ets you aren't using.

Describe the solution you'd like

What I would suggest is that when a destroy event occurs to a cluster, the Terraform Helm module should do a helm delete --purge on the installed chart before the cluster is removed, to ensure this gets cleaned up.

See:
https://www.terraform.io/docs/provisioners/index.html#destroy-time-provisioners

For the hooks to implement this.

Describe alternatives you've considered

Writing a bash script to cleanup orphaned resources, but load balancers in GCP are a combo of various other things, and it gets complicated 😕

Additional context
https://github.com/pantheon-systems/kube-gce-cleanup
kubernetes/ingress-gce#136

@markmandel markmandel added kind/feature New features for Agones area/operations Installation, updating, metrics etc labels Mar 19, 2020
@aLekSer
Copy link
Collaborator

aLekSer commented Mar 20, 2020

Relates to #1403 .

@aLekSer
Copy link
Collaborator

aLekSer commented Mar 23, 2020

First of all the solution is already mentioned in this repo:
https://github.com/GoogleCloudPlatform/terraform-google-examples/tree/master/example-gke-k8s-helm#cleanup

Delete the nginx-ingress helm release first so that the forwarding rule and firewall rule are cleaned up by the GCE controller

I have added this into my PR #1375 and tested that now no leftover could be found in Firewall Rules and Forwarding Rule (Load balancer) tabs.

@aLekSer
Copy link
Collaborator

aLekSer commented Mar 23, 2020

I have tried using provisioner :

resource "helm_release" "agones" {
  name         = "agones"
...
 provisioner "local-exec" {
   when = destroy
   command = "helm delete --purge agones"
 }
}

And it is doing helm delete but is failing that after delete we could not find the agones when after provisioner success, the actual helm_release resource should be destroyed:

module.helm_agones.helm_release.agones: Refreshing state... [id=agones]                                                                                                                         [385/1349]
module.gke_cluster.null_resource.test-setting-variables: Destroying... [id=2666900790194419227]                                                                                                          
module.gke_cluster.null_resource.test-setting-variables: Destruction complete after 0s                                                                                                                   
module.gke_cluster.google_compute_firewall.default: Destroying... [id=game-server-firewall-firewall-test-c]                                                                                              
module.helm_agones.helm_release.agones: Destroying... [id=agones]                                                                                                                                        
module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...                                                                                                                                
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "echo 'Destroy' && helm delete --purge agones"]                                                                          
module.helm_agones.helm_release.agones (local-exec): Destroy                                                                                                                                             
module.gke_cluster.google_compute_firewall.default: Destruction complete after 9s                                                                                                                        
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 10s elapsed]                                                                                                                     
module.helm_agones.helm_release.agones (local-exec): release "agones" deleted                                                                                                                            
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 20s elapsed]                                                                                                                     
                                                                                                                                                                                                         
Error: rpc error: code = Unknown desc = release: "agones" not found   

@markmandel
Copy link
Collaborator Author

/cc @chrisst any thoughts on this?

Is there something erroneous with our helm approach?

@markmandel
Copy link
Collaborator Author

markmandel commented Mar 30, 2020

Weird that it fails when you do that work @aLekSer -- if I run a script to delete all Helm instances before destroying everything else it's fine.

A couple of theories:

  1. The helm delete operation is async by default, so it's actually running, but it's not waiting for it before deleting the cluster?
  2. As a thought, I wonder if you do a command = "helm delete agones" (no --purge) on destroy like above, it will still leave behind an empty helm release, which is then still available to be deleted by TF? Might be worth a shot.

@chrisst
Copy link
Contributor

chrisst commented Mar 30, 2020

Unfortunately I'm not very experienced with mixing Helm and Terraform so my thoughts are more educated guesses at this point. I don't think using a local-exec provisioner is heading down the correct path. If a terraform resource, in this case the helm release, is removed or modified by an external process, local-exec, it is almost always going to be problematic for Terraform. Cleaning up after a resource should be handled by the resource's destroy call, so in this case the helm_release should be cleaning up the resources it created. Otherwise to me it's the equivalent of shelling out to gcloud to delete a GCP resource because the resource's destroy is buggy.
It does look like helm should be doing an uninstall on destroy so if there are dangling resources after that call happens I would think it's a bug with the cleanup of the helm_release.

I suspect it's failing because Terraform is trying to delete the helm release which has already been removed through the provisioner call. You can try looking at the debug logs for more information.

@akremsa
Copy link
Contributor

akremsa commented Apr 7, 2020

I and @aLekSer tried following approaches:

  1. Adding
  provisioner "local-exec" {
    when    = "destroy"
    command = "helm delete agones"
  }

to resource "helm_release" "agones"

Result:
There is a following error during destroy step:

module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "helm delete agones"]
module.helm_agones.helm_release.agones (local-exec): Error: Get https://35.228.125.186/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: error executing access token command "/opt/google-cloud-sdk/bin/g
cloud config config-helper --format=json": err=fork/exec /opt/google-cloud-sdk/bin/gcloud: no such file or directory output= stderr=
module.gke_cluster.google_compute_firewall.default: Destruction complete after 8s
Warning: Quoted keywords are deprecated
on ../../../install/terraform/modules/gke/cluster.tf line 134, in resource "google_container_cluster" "primary":
134: when = "destroy"

  1. Adding
provisioner "local-exec" {
    when    = "destroy"
    command = "echo 'Destroy-time provisioner' && sleep 60 && echo 'done'"
  }

to resource "google_container_cluster" "primary" in install/terraform/modules/gke/cluster.tf

Result:
Everything is deleted except ingress
image

@aLekSer aLekSer self-assigned this Aug 18, 2020
@advissor
Copy link

Hi,
I got similar issue for EKS cluster and randomly found this topic
For me , everything provisioned inside kubernetes cluster was attempted to destroyed after cluster was removed
This left ingresses & sgs & helm_releases yelling

So the fix was , prior to destroy run : terraform refresh

@charliesmith-mindera
Copy link

Hi,

I also found a similar issue where the helm resource deletion was not completing before the namespace was removed. In my case the only thing being deployed to the namespace was the helm chart that was handled by the helm_release resource so i added the below, which then meant it cleanly deleted, rather than creating the namespace separately.
resource "helm_release" "prometheus_install" {
name = "prometheus-for-amp"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "prometheus"
create_namespace = true
namespace = var.prometheus_namespace
}

@github-actions
Copy link

github-actions bot commented Jun 1, 2023

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

@github-actions github-actions bot added the stale Pending closure unless there is a strong objection. label Jun 1, 2023
@github-actions
Copy link

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions

@github-actions github-actions bot added obsolete and removed stale Pending closure unless there is a strong objection. labels Jul 15, 2023
@github-actions github-actions bot added the wontfix Sorry, but we're not going to do that. label Aug 15, 2023
@github-actions
Copy link

We are closing this as there was no activity in this issue for last 90 days. Please reopen if you’d like to discuss anything further.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operations Installation, updating, metrics etc kind/feature New features for Agones obsolete wontfix Sorry, but we're not going to do that.
Projects
None yet
Development

No branches or pull requests

6 participants