[Helm/Release] uninstall should be able to wait #1739

gitfool · 2021-09-30T03:14:26Z

Issue details

When installing a series of helm charts, using the preview helm releases feature, and there is a dependency chain between these helm releases and other resources, destroying the stack does not wait for the helm releases that have dependee resources, so the dependees with finalizers (owned by the helm release) are orphaned and the destroy eventually times out and fails.

Steps to reproduce

The following is a dotnet/c# code extract of a very specific case that captures the gist of the issue:

// aws load balancer controller; https://github.com/aws/eks-charts/tree/master/stable/aws-load-balancer-controller
Logger.LogDebug("Installing aws load balancer controller");
var awsLbcRole = new RoleX($"{k8sPrefix}-aws-load-balancer-controller",
    new RoleXArgs
    {
        AssumeRolePolicy = IamHelpers.AssumeRoleForServiceAccount(oidcArn, oidcUrl, "kube-system", "aws-load-balancer-controller", awsProvider),
        InlinePolicies = { ["policy"] = ReadResource("AwsLoadBalancerPolicy.json") }
    },
    new ComponentResourceOptions { Provider = awsProvider });

var awsLbcCrds = new ConfigGroup("aws-load-balancer-controller-crds",
    new ConfigGroupArgs { Yaml = ReadResource("AwsLoadBalancerCrds.yaml") },
    new ComponentResourceOptions { Provider = k8sProvider });

// workaround https://github.com/pulumi/pulumi-kubernetes/issues/1725
var _clusterName = OutputUtilities.GetValueAsync(clusterName).GetAwaiter().GetResult();
var _awsLbcRoleArn = OutputUtilities.GetValueAsync(awsLbcRole.Arn).GetAwaiter().GetResult();

var awsLbcValues =
    new Dictionary<string, object>
    {
        ["clusterName"] = _clusterName,
        ["enableCertManager"] = true,
        ["serviceAccount"] = new { annotations = new Dictionary<string, string> { ["eks.amazonaws.com/role-arn"] = _awsLbcRoleArn } }
    }.ToDictionary(); // workaround https://github.com/pulumi/pulumi/issues/8013

var awsLbcRelease = new Release("aws-load-balancer-controller", // ingress records with alb.ingress.kubernetes.io annotations depend on chart finalizers
    new ReleaseArgs
    {
        Namespace = "kube-system",
        Name = "aws-load-balancer-controller",
        RepositoryOpts = new RepositoryOptsArgs { Repo = "https://aws.github.io/eks-charts" },
        Chart = "aws-load-balancer-controller",
        Version = K8sConfig.AwsLbcChartVersion,
        Values = awsLbcValues,
        Atomic = true,
        SkipCrds = true
    },
    new CustomResourceOptions { DependsOn = { awsLbcCrds, certManagerRelease }, Provider = k8sProvider });

...

// kube prometheus stack; https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
Logger.LogDebug("Installing kube prometheus stack");
var monitoringNs = new Namespace("monitoring",
    new NamespaceArgs { Metadata = new ObjectMetaArgs { Name = "monitoring" } },
    new CustomResourceOptions { Provider = k8sProvider });

GrafanaPassword = string.IsNullOrEmpty(K8sConfig.GrafanaPassword)
    ? new RandomPassword("kps-grafana-password", new RandomPasswordArgs { Length = 16, Special = true }).Result
    : Output.CreateSecret(K8sConfig.GrafanaPassword);

// workaround https://github.com/pulumi/pulumi-kubernetes/issues/1725
var _grafanaPassword = OutputUtilities.GetValueAsync(GrafanaPassword).GetAwaiter().GetResult();

var kubePrometheusStackValues =
    new Dictionary<string, object>
    {
        ["nameOverride"] = "kps",
        ["fullnameOverride"] = "kps",
        ["kubeVersionOverride"] = K8sConfig.Version,
        ["kubeTargetVersionOverride"] = K8sConfig.Version,
        ["alertmanager"] = new
        {
            alertmanagerSpec = new
            {
                storage = new
                {
                    volumeClaimTemplate = new
                    {
                        metadata = new { name = "alertmanager-db" },
                        spec = new
                        {
                            storageClassName = "efs-sc",
                            accessModes = new[] { "ReadWriteOnce" },
                            resources = new { requests = new { storage = "10Gi" } } // mandatory but ignored by efs
                        }
                    }
                },
                nodeSelector = new { role = "monitoring" },
                tolerations = new[] { new { key = "role", @operator = "Exists" } }
            },
            ingress = new
            {
                enabled = true,
                hosts = new[] { $"alertmanager.{AwsConfig.Route53.Internal.Domain}" },
                pathType = "Prefix",
                annotations = new Dictionary<string, string>
                {
                    ["alb.ingress.kubernetes.io/load-balancer-name"] = $"{EnvName}-{AwsConfig.Alb.Internal.Default.Group}",
                    ["alb.ingress.kubernetes.io/scheme"] = AwsConfig.Alb.Internal.Default.Scheme,
                    ["alb.ingress.kubernetes.io/group.name"] = AwsConfig.Alb.Internal.Default.Group,
                    ["alb.ingress.kubernetes.io/target-type"] = AwsConfig.Alb.Internal.Default.TargetType,
                    ["alb.ingress.kubernetes.io/healthcheck-path"] = "/-/healthy",
                    ["external-dns.alpha.kubernetes.io/hostname"] = $"alertmanager.{AwsConfig.Route53.Internal.Domain}",
                    ["kubernetes.io/ingress.class"] = AwsConfig.Alb.Internal.Default.Class
                }
            }
        },
        ["grafana"] = new
        {
            nameOverride = "kps-grafana",
            fullnameOverride = "kps-grafana",
            adminPassword = _grafanaPassword,
            nodeSelector = new { role = "monitoring" },
            tolerations = new[] { new { key = "role", @operator = "Exists" } },
            ingress = new
            {
                enabled = true,
                hosts = new[] { $"grafana.{AwsConfig.Route53.Internal.Domain}" },
                pathType = "Prefix",
                annotations = new Dictionary<string, string>
                {
                    ["alb.ingress.kubernetes.io/load-balancer-name"] = $"{EnvName}-{AwsConfig.Alb.Internal.Default.Group}",
                    ["alb.ingress.kubernetes.io/scheme"] = AwsConfig.Alb.Internal.Default.Scheme,
                    ["alb.ingress.kubernetes.io/group.name"] = AwsConfig.Alb.Internal.Default.Group,
                    ["alb.ingress.kubernetes.io/target-type"] = AwsConfig.Alb.Internal.Default.TargetType,
                    ["alb.ingress.kubernetes.io/healthcheck-path"] = "/api/health",
                    ["external-dns.alpha.kubernetes.io/hostname"] = $"grafana.{AwsConfig.Route53.Internal.Domain}",
                    ["kubernetes.io/ingress.class"] = AwsConfig.Alb.Internal.Default.Class
                }
            }
        },
        ["kubeControllerManager"] = new { enabled = false },
        ["kubeEtcd"] = new { enabled = false },
        //["kubelet"] = new
        //{
        //    serviceMonitor = new
        //    {
        //        metricRelabelings = new[]
        //        {
        //            new
        //            {
        //                sourceLabels = new[] { "node" },
        //                targetLabel = "instance"
        //            }
        //        }
        //    }
        //},
        ["kubeScheduler"] = new { enabled = false },
        ["kube-state-metrics"] = new
        {
            nameOverride = "kps-kube-state-metrics",
            fullnameOverride = "kps-kube-state-metrics",
            nodeSelector = new { role = "monitoring" },
            tolerations = new[] { new { key = "role", @operator = "Exists" } }
        },
        ["prometheus"] = new
        {
            prometheusSpec = new
            {
                podMonitorSelectorNilUsesHelmValues = false,
                probeSelectorNilUsesHelmValues = false,
                ruleSelectorNilUsesHelmValues = false,
                serviceMonitorSelectorNilUsesHelmValues = false,
                retention = "30d",
                storageSpec = new
                {
                    volumeClaimTemplate = new
                    {
                        metadata = new { name = "prometheus-db" },
                        spec = new
                        {
                            storageClassName = "efs-sc",
                            accessModes = new[] { "ReadWriteOnce" },
                            resources = new { requests = new { storage = "150Gi" } } // mandatory but ignored by efs
                        }
                    }
                },
                nodeSelector = new { role = "monitoring" },
                tolerations = new[] { new { key = "role", @operator = "Exists" } }
            },
            ingress = new
            {
                enabled = true,
                hosts = new[] { $"prometheus.{AwsConfig.Route53.Internal.Domain}" },
                pathType = "Prefix",
                annotations = new Dictionary<string, string>
                {
                    ["alb.ingress.kubernetes.io/load-balancer-name"] = $"{EnvName}-{AwsConfig.Alb.Internal.Default.Group}",
                    ["alb.ingress.kubernetes.io/scheme"] = AwsConfig.Alb.Internal.Default.Scheme,
                    ["alb.ingress.kubernetes.io/group.name"] = AwsConfig.Alb.Internal.Default.Group,
                    ["alb.ingress.kubernetes.io/target-type"] = AwsConfig.Alb.Internal.Default.TargetType,
                    ["alb.ingress.kubernetes.io/healthcheck-path"] = "/-/healthy",
                    ["external-dns.alpha.kubernetes.io/hostname"] = $"prometheus.{AwsConfig.Route53.Internal.Domain}",
                    ["kubernetes.io/ingress.class"] = AwsConfig.Alb.Internal.Default.Class
                }
            }
        },
        //["nodeExporter"] = new
        //{
        //    serviceMonitor = new
        //    {
        //        relabelings = new[]
        //        {
        //            new
        //            {
        //                sourceLabels = new[] { "__meta_kubernetes_pod_node_name" },
        //                targetLabel = "kubernetes_node"
        //            }
        //        }
        //    }
        //},
        ["prometheus-node-exporter"] = new
        {
            nameOverride = "kps-prometheus-node-exporter",
            fullnameOverride = "kps-prometheus-node-exporter",
            tolerations = new[] { new { key = "role", @operator = "Exists" } }
        },
        ["prometheusOperator"] = new
        {
            admissionWebhooks = new
            {
                certManager = new { enabled = true },
                patch = new
                {
                    nodeSelector = new { role = "monitoring" },
                    tolerations = new[] { new { key = "role", @operator = "Exists" } }
                }
            },
            nodeSelector = new { role = "monitoring" },
            tolerations = new[] { new { key = "role", @operator = "Exists" } }
        }
    }.ToDictionary(); // workaround https://github.com/pulumi/pulumi/issues/8013

var kubePrometheusStackRelease = new Release("kube-prometheus-stack",
    new ReleaseArgs
    {
        Namespace = "monitoring",
        Name = "kube-prometheus-stack",
        RepositoryOpts = new RepositoryOptsArgs { Repo = "https://prometheus-community.github.io/helm-charts" },
        Chart = "kube-prometheus-stack",
        Version = K8sConfig.KubePrometheusStackChartVersion,
        Values = kubePrometheusStackValues,
        Atomic = true,
        SkipCrds = true
    },
    new CustomResourceOptions { DependsOn = { awsLbcRelease /* finalizers */, certManagerRelease, efsDriverRelease /* finalizers */, kubePrometheusStackCrds, monitoringNs }, Provider = k8sProvider });

TL;DR: kps helm release depends on aws lbc helm release since it creates ingress records that are annotated accordingly for the aws lbc and will have the side effect of creating an aws alb for the given group. The dependency in this case is not so much for creation as deletion, since the aws lbc injects a finalizer into the ingress record so that it can tear down the aws alb when the reference count goes to 0. If the aws lbc is deleted first then the ingress records will be orphaned waiting for the finalizer and subsequently the monitoring namespace containing the ingress records cannot be deleted either.

It's obviously important to respect the dependency chain, specifically tearing down resources in the chain in reverse order, but as part of that process it needs to synchronously wait for dependee resources to complete their tear down before continuing.

I'm not sure if it's better to only wait if configured accordingly, or through tracked dependencies; or to always wait, which would simplify the implementation, assuming it would not impact performance by blocking concurrent destruction of resources, and given that destroying a stack should ultimately wait for all resources to be destroyed anyway.

References:

helm uninstall cli has wait and timeout options
Presumably the helm sdk used by pulumi kubernetes would support these options too?

The text was updated successfully, but these errors were encountered:

gitfool · 2021-10-07T07:11:38Z

@viveklak since you didn't have a test case, I'm confirming that #1742 fixes this issue. Thanks! 🎉

gitfool added the kind/bug Some behavior is incorrect or out of spec label Sep 30, 2021

mikhailshilkov added the area/helm label Sep 30, 2021

mikhailshilkov assigned viveklak Sep 30, 2021

viveklak added the helm-release-ga-blockers Items blocking Helm Release GA label Sep 30, 2021

viveklak mentioned this issue Oct 1, 2021

Helm Release: Await deletion if skipAwait is unset or atomic is specified #1742

Merged

mikhailshilkov added this to the 0.63 milestone Oct 6, 2021

viveklak closed this as completed in #1742 Oct 6, 2021

pulumi-bot added the resolution/fixed This issue was fixed label Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Helm/Release] uninstall should be able to wait #1739

[Helm/Release] uninstall should be able to wait #1739

gitfool commented Sep 30, 2021

gitfool commented Oct 7, 2021

[Helm/Release] uninstall should be able to wait #1739

[Helm/Release] uninstall should be able to wait #1739

Comments

gitfool commented Sep 30, 2021

Issue details

Steps to reproduce

gitfool commented Oct 7, 2021