Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudwatch Log Group created with options does not get auto destroyed #920

Closed
2 of 4 tasks
dwaiba opened this issue Jun 15, 2020 · 49 comments · Fixed by #1594
Closed
2 of 4 tasks

Cloudwatch Log Group created with options does not get auto destroyed #920

dwaiba opened this issue Jun 15, 2020 · 49 comments · Fixed by #1594

Comments

@dwaiba
Copy link

dwaiba commented Jun 15, 2020

I have issues

I'm submitting a...

  • bug report
  • feature request
  • support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

On Creating with the following options
cluster gets created nicely but on destroy it throws
**Error: Creating CloudWatch Log Group failed: ResourceAlreadyExistsException: The specified log group already exists: The CloudWatch Log Group '/aws/eks/test-eks/cluster' already exists.

on .terraform/modules/eks-cluster/terraform-aws-eks-12.1.0/cluster.tf line 1, in resource "aws_cloudwatch_log_group" "this":
1: resource "aws_cloudwatch_log_group" "this" {**

module "eks-cluster" {
  source                    = "terraform-aws-modules/eks/aws"
  create_eks                = lookup(var.eks_params, "createeks")
  cluster_name              = lookup(var.eks_params, "cluster_name")
  cluster_version           = lookup(var.eks_params, "cluster_version")
  subnets                   = module.vpc.public_subnets
  vpc_id                    = module.vpc.vpc_id
  cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

  enable_irsa = lookup(var.eks_params, "enable_irsa")

  worker_groups = [
    {
      name                 = "worker-group-1"
      instance_type        = "t2.medium"
      asg_desired_capacity = 1
      asg_max_size         = 5
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "propagate_at_launch" = "true"
          "value"               = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/${lookup(var.eks_params, "cluster_name")}"
          "propagate_at_launch" = "true"
          "value"               = "true"
        }
      ]
    },
    {
      name                 = "worker-group-2"
      instance_type        = "t2.medium"
      asg_desired_capacity = 2
      asg_max_size         = 5
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "propagate_at_launch" = "true"
          "value"               = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/${lookup(var.eks_params, "cluster_name")}"
          "propagate_at_launch" = "true"
          "value"               = "true"
        }
      ]
    }
  ]
}

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create with the above options and destroy

What's the expected behavior?

The Destroy should take care of the Cloudwatch Log Group created with the options.

Are you able to fix this problem and submit a PR? Link here if you have already.

NA

Environment details

  • Affected module version: v12.1.0
  • OS:
  • Terraform version:
    Terraform v0.12.24
    provider.aws v2.66.0
    provider.kubernetes v1.11.3
    provider.local v1.4.0
    provider.null v2.1.2
    provider.random v2.2.1
    provider.template v2.1.2

Any other relevant info

@milesarmstrong
Copy link

milesarmstrong commented Jun 18, 2020

We're seeing this issue too. It seems that terraform does correctly destroy the log group, but that logs are still shipped to CloudWatch for some time after AWS say an EKS cluster has been deleted, causing CloudWatch to recreate the log group.

@milesarmstrong
Copy link

milesarmstrong commented Jun 18, 2020

We're doing terraform destroy followed by terraform apply in two CI stages, and we've added a stage in-between the two that blocks until the log group is really gone.

We've seen it take ~3 minutes to stop being recreated.

function get_log_group {
  aws logs describe-log-groups --log-group-name-prefix ${LOG_GROUP_NAME} --query 'length(logGroups)'
}

REALLY_GONE=0

while [[ $REALLY_GONE -eq 0 ]]; do
  while [[ $(get_log_group) -gt 0 ]]; do
    echo "${LOG_GROUP_NAME} still exists, deleting it..."
    aws logs delete-log-group --log-group-name ${LOG_GROUP_NAME}
    sleep 5
  done
  echo "${LOG_GROUP_NAME} has been deleted. Waiting 300 seconds to validate it is not recreated..."
  sleep 300
  if [[ $(get_log_group) -eq 0 ]]; then
    REALLY_GONE=1
  fi
done

@maiconbaumx
Copy link

This is happening because EKS Cluster gets destroyed after Terraform delete the Cloudwatch Log Group. The AmazonEKSServicePolicy IAM policy (that is assigned to EKS Cluster role by default within this module) has permissions to CreateLogGroup and anything else needed to continue to logging correctly. When the Terraform destroys the Cloudwatch Log Group, the EKS Cluster that is running create it again. Then, when you run Terraform Apply again, the Cloudwatch Log Group doesn't exist in your state anymore (because the Terraform actually destroyed it) and the Terraform doesn't know this resource created outside him.

@stevehipwell
Copy link
Contributor

Could the Terraform config be changed to make sure that the EKS cluster has been destroyed before the CloudWatch log group is destroyed?

@barryib
Copy link
Member

barryib commented Jun 24, 2020

This is happening because EKS Cluster gets destroyed after Terraform delete the Cloudwatch Log Group. The AmazonEKSServicePolicy IAM policy (that is assigned to EKS Cluster role by default within this module) has permissions to CreateLogGroup and anything else needed to continue to logging correctly. When the Terraform destroys the Cloudwatch Log Group, the EKS Cluster that is running create it again. Then, when you run Terraform Apply again, the Cloudwatch Log Group doesn't exist in your state anymore (because the Terraform actually destroyed it) and the Terraform doesn't know this resource created outside him.

Humm are you sure about that @maiconbaumx ? There is an explicit dependency between the cluster resource and the cloudwatch log group. Technically, Terraform should destroy aws_cloudwatch_log_group.this before aws_eks_cluster.this.

@dwaiba
Copy link
Author

dwaiba commented Jun 26, 2020

@barryib yep... that seems to be the case as in terraform-aws-modules/terraform-aws-vpc#435 and hashicorp/terraform#14750 too

@stale
Copy link

stale bot commented Sep 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 24, 2020
@stevehipwell
Copy link
Contributor

Is there any news on this?

@stale stale bot removed the stale label Sep 24, 2020
@RavindraSinghKhichi
Copy link

I am facing the same issue. Any updates on this, expecting auto deletion of the log group once the VPC is destroyed.

@ulfox
Copy link

ulfox commented Nov 2, 2020

Fixed this by adding a depends_on to the cloudwatch log resource

@barryib
Copy link
Member

barryib commented Nov 2, 2020

@ulfox can please you share your code snippet.

@ulfox
Copy link

ulfox commented Nov 2, 2020

@ulfox can please you share your code snippet.

Sure

// ------------------------------------------------------------
// EKS Cluster Cloudwatch Group
// ------------------------------------------------------------
resource "aws_cloudwatch_log_group" "eks_cluster" {
      count             = local.cluster_log_types_enabled ? 1 : 0
      name              = local.eks_aws_cloudwatch_log_group
      retention_in_days = local.eks_log_retention_in_days

      tags = local.eks_cloudwatch_tags
}

// ------------------------------------------------------------
// EKS Cluster
// ------------------------------------------------------------
resource "aws_eks_cluster" "main" {
      name                      = var.eks_cluster_name
      role_arn                  = aws_iam_role.eks_cluster_role.arn
      enabled_cluster_log_types = local.cluster_log_types

      version = local.eks_version

      vpc_config {
            subnet_ids = local.eks_subnet_ids
            security_group_ids = [
                  aws_security_group.ControlPlaneSecurityGroup.id,
            ]
            endpoint_private_access = var.eks_endpoint_private_access
            endpoint_public_access  = var.eks_endpoint_public_access
      }

      timeouts {
            delete = var.eks_delete_timeouot
      }

      depends_on = [
            aws_iam_role_policy_attachment.AmazonEKSClusterPolicy,
            aws_iam_role_policy_attachment.AmazonEKSServicePolicy,
            aws_cloudwatch_log_group.eks_cluster.0
      ]
}

@barryib
Copy link
Member

barryib commented Nov 3, 2020

We already have that depends_on in this module. See https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v13.1.0/cluster.tf#L47

@ulfox
Copy link

ulfox commented Nov 6, 2020

We already have that depends_on in this module. See https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v13.1.0/cluster.tf#L47

apologies, I missed the module part. I thought the comment was directly for the resource.

I switched my code to use the module, cheers

@barryib
Copy link
Member

barryib commented Nov 14, 2020

FWIW I think that this issue is not related to terraform or this module, but it's more about how AWS buffers logs and flush them in cloudwatch log group. See also :

@barryib
Copy link
Member

barryib commented Nov 15, 2020

I was wondering if we shouldn't deny the logs:CreateLogGroup for the cluster if we create the cloudwatch log group our self with Terraform.

@max-rocket-internet @antonbabenko Any thoughts ? I think this also true for the VPC module log group for flow logs.

@antonbabenko
Copy link
Member

Yes, I think you are right.

If I understand correctly now the VPC flow logs will create such logs:CreateLogGroup if necessary (https://github.com/terraform-aws-modules/terraform-aws-vpc/blob/master/vpc-flow-logs.tf#L5). I think it should be reduced from * to just the necessary log group name.

There is related discussion (hashicorp/terraform-provider-aws#902) and a note in terraform-aws-provider/ROADMAP.md.

@barryib
Copy link
Member

barryib commented Nov 15, 2020

But will reducing the resource from * to the necessary log group name prevent log group recreation by the VPC flow log service (vpc-flow-logs.amazonaws.com) or the EKS cluster ?

Unless you wan to say that we should deny logs:CreateLogGroup for the necessary log group name.

@antonbabenko
Copy link
Member

Probably it will be working fine if permissions are scoped to the more precise resources instead of *, but I would recommend checking AWS docs for that.

@mikehodgk
Copy link

mikehodgk commented Nov 20, 2020

one thing that may be considered here is to allow the creation of the log group externally.

For my use case I have a cluster that I want to destroy every night and recreate every morning (development environment) however keeping the logs from previous days could be useful. Perhaps using 0.13's module-level depends_on users could make sure the log group is the top of the food chain?

@barryib
Copy link
Member

barryib commented Nov 20, 2020

one thing that may be considered here is to allow the creation of the log group externally.

For my use case I have a cluster that I want to destroy every night and recreate every morning (development environment) however keeping the logs from previous days could be useful. Perhaps using 0.13's module-level depends_on users could make sure the log group is the top of the food chain?

@ostmike Yes, we can add an option to let users create and managed their own aws_cloudwatch_log_group. Please open a feature request issue for this.

I'll happy to review any PR for that.

But for sure, that would make any difference in a busy EKS environment. Because, EKS will recreate the cloudwatch log group just after Terraform destroy it, if it has logs left.

@den-is
Copy link

den-is commented Jan 25, 2021

That's still an issue.
You have to "manually" destroy CloudWatch Log group if you want to destroy existing cluster and resources and recreate cluster with exactly same name, rather than creating cluster with some random new name.

@stale
Copy link

stale bot commented Apr 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 25, 2021
@navaati
Copy link

navaati commented May 3, 2021

Wowowow, hol’up Mr. Bot, don’t stale, this is still an issue ! So yes as @barryib said the solution, same as for the VPC module which had the same problem, is to deny the logs:CreateLogGroup permission in the service role: it is not needed since we create the log group ourselves with Terraform, and it is causing trouble. See the implementation for the VPC module: terraform-aws-modules/terraform-aws-vpc#550. @antonbabenko you looked at it the other day so you may remember the details.

Now the situation here is a tad more complex because we don’t create the policy ourselves, we use the AWS provided one AmazonEKSClusterPolicy so it’s not like we can apply the exact same patch as for the VPC module. The solution might take the form of adding a specific deny on top of the AWS policy (which we still want to use to benefit from it always being up to date with EKS). Not so sure yet how to do that, @barryib is this what you were thinking about, do you know how to do it ?

(EDIT: note for later, this is where the magic happens https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/cluster.tf#L119 and the three resources below that)

@stale stale bot removed the stale label May 3, 2021
@haarchri
Copy link
Contributor

we tested it with #1594 and deny the create log group - we managed to get the auto destroy working please have a look
thanks guys

@stale
Copy link

stale bot commented Oct 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 21, 2021
@navaati
Copy link

navaati commented Oct 21, 2021

@haarchri Oh yeah ! I knew something was possible, I didn’t know what, and this is it ! Overlaying a "Deny" policy over the existing Amazon policy so as not to diverge from it, that’s very good. And you did all the doc as well. I’ll leave a comment on your PR but I’m not maintainer so I can’t approve it officially.

@ Mr stale bot: not stale :)

@antonbabenko
Copy link
Member

This issue has been resolved in version 17.24.0 🎉

@mikeroll
Copy link

mikeroll commented Feb 8, 2022

Having the same symptoms when destroying and recreating the cluster on v18.4.1. I can't seem to find the IAM hack introduced by #1594 anymore - was it lost during the v18 update or removed intentionally for some reason?

@bryantbiggs
Copy link
Member

Having the same symptoms when destroying and recreating the cluster on v18.4.1. I can't seem to find the IAM hack introduced by #1594 anymore - was it lost during the v18 update or removed intentionally for some reason?

This was tested and deemed no longer required and therefore removed in v18 - I test this module quite often throughout the week for various PRs and do not see any issues with cloudwatch logs not deleting. please feel free to open a new issue with reproduction steps if necessary

@daroga0002
Copy link
Contributor

This was tested and deemed no longer required and therefore removed in v18 - I test this module quite often throughout the week for various PRs and do not see any issues with cloudwatch logs not deleting. please feel free to open a new issue with reproduction steps if necessary

@bryantbiggs you cannot replicate this because you dont have a workload over cluster. This is caused by general latency of cloudwatch log processing under huge load. EKS plane producing logs and is destroyed, cloudwatch group is also destroyed and then cloudwatch internal AWS queue recreates log group (when it was arlelady removed by terraform)

@bryantbiggs
Copy link
Member

This was tested and deemed no longer required and therefore removed in v18 - I test this module quite often throughout the week for various PRs and do not see any issues with cloudwatch logs not deleting. please feel free to open a new issue with reproduction steps if necessary

@bryantbiggs you cannot replicate this because you dont have a workload over cluster. This is caused by general latency of cloudwatch log processing under huge load. EKS plane producing logs and is destroyed, cloudwatch group is also destroyed and then cloudwatch internal AWS queue recreates log group (when it was arlelady removed by terraform)

If that's the case, it's still not the concern of the module. It sounds like users need to delete/stop workloads before destroying the cluster

@daroga0002
Copy link
Contributor

If that's the case, it's still not the concern of the module. It sounds like users need to delete/stop workloads before destroying the cluster

I dont think it make a sense to put such requirement for destroy operation.
If I create development cluster then I deploy something, test my apps and then I am deleting whole cluster, I dont want to waste time for removing deployments which will be destroyed during cluster removal.

Other case is that AWS doesnt guarantee how fast logs will be written to cloudwatch group. So if I enable audit on cluster then even when I dont have a logs there is a lot of API requests logged to controller. So in result again EKS plane is removed, logs are still flowing but terraform removed cloudwatch so it gets recreated.

@haarchri
Copy link
Contributor

haarchri commented Feb 9, 2022

@bryantbiggs why then the overlay deny was removed with version V18 ? when the only option is now to say we need delete/stop workloads before - because we delivered the fix that it fits

@bryantbiggs
Copy link
Member

I believe I answered this above. Users can attach additional policies as needed so feel free to do so with your configuration if that is what you need

@timblaktu
Copy link

timblaktu commented Feb 22, 2022

I hit this recently on a simple eks cluster with virtually no workload running when I terraform destroyed it. I was reading through the issues on this this, this, and this, none of which really seem to completely fix it so I'm just going to manually delete the already-existing resources for now and move on. FWIW, in my case, I'm reproduced this while creating/destroying the cluster in fairly rapid succession.

@trallnag
Copy link

@timblaktu, why is the explicit deny "hack" not working for you?

@timblaktu
Copy link

I don't know. We're using v18.x so we should have the deny hack (you mean this, right?) in place, since it was introduced in 17.24.0. In my case, this is not easily reproducible, and I'm imagining it happens in extraordinary circumstances, like maybe iff a terraform process (apply or destroy) aborted ungracefully or something..

@timblaktu
Copy link

timblaktu commented Apr 3, 2022

@bryantbiggs I upgraded our module to 18.17.1 and will start a tight loop creating/destroying our infra. We don't run much workloads either yet and that might be why this is so rare. Ran into it again today, and suspect it's in cases where we destroy immediately after create finishes..

@bryantbiggs with module v18.17.1, (and tf v1.1.7) we're still seeing the issue with the CloudWatch log group not getting destroyed. It's quite reproducible with my script, and now appears to happen on the very second iteration, so:

create-eks
destroy-eks
create-eks <-- error happens here.

I added the following workaround following the destroy operation in my Makefile:

CLOUDWATCH_LOG_GROUP_NAME="/aws/eks/devops-wheel-${ENVIRONMENT}/cluster"
if aws logs describe-log-groups | jq ".logGroups[].logGroupName" | grep --quiet "${CLOUDWATCH_LOG_GROUP_NAME}" ; then
  echo "WARNING: CloudWatch Log Group still exists! Deleting..."
  echo "  (See module issue for details: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/920)"
  if aws logs delete-log-group --log-group-name "${CLOUDWATCH_LOG_GROUP_NAME}" ; then
     echo "Successfully deleted log group ${CLOUDWATCH_LOG_GROUP_NAME}"
  else
     echo "ERROR deleting log group ${CLOUDWATCH_LOG_GROUP_NAME}"
     exit 1
  fi
fi

and now, still using module v18.17.1 (and tf v1.1.7), I'm STILL SEEING THE ERROR when running my "create/destroy in a loop" script, even in cases where I see in my test output that it detected the log group still exists and deleted it.

Is there a "grace period" after CloudWatch log groups are deleted that they cannot be re-created? (kind of like how secrets manager secrets have a recovery_window?)

EDIT: when looking at the log groups in the AWS Console, I see that the problem ones have their "Retention" setting set to "Never Expire", while others are set to "3 months". The latter comes from the fact that the cloudwatch_log_group_retention_in_days input defaults to 90 days, but I have no idea how/why the others are set to "Never Expire".

@pradeepnnv
Copy link

Still facing this issue using terraform v1.2.9 and eks module "18.26.6".

@sirex99
Copy link

sirex99 commented Oct 11, 2022

@bryantbiggs just to confirm i'm also seeing this issue on the latest version. Could we get the code of #1594 re-added again ?

@bryantbiggs
Copy link
Member

terraform-aws-eks/main.tf

Lines 266 to 287 in 7f90184

# https://github.com/terraform-aws-modules/terraform-aws-eks/issues/920
# Resources running on the cluster are still generaring logs when destroying the module resources
# which results in the log group being re-created even after Terraform destroys it. Removing the
# ability for the cluster role to create the log group prevents this log group from being re-created
# outside of Terraform due to services still generating logs during destroy process
dynamic "inline_policy" {
for_each = var.create_cloudwatch_log_group ? [1] : []
content {
name = local.iam_role_name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = ["logs:CreateLogGroup"]
Effect = "Deny"
Resource = aws_cloudwatch_log_group.this[0].arn
},
]
})
}
}

@bryantbiggs
Copy link
Member

perhaps the ARN for the CloudWatch log group is not sufficient for this type of denial policy. This has been changed back to what the original PR had with a wildcard "*" and is available in v18.30.1 #2267

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.