Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: NodePool was created in the error state RUNNING_WITH_ERROR #10823

Closed
smmnazar opened this issue Jan 4, 2022 · 5 comments
Closed

Error: NodePool was created in the error state RUNNING_WITH_ERROR #10823

smmnazar opened this issue Jan 4, 2022 · 5 comments

Comments

@smmnazar
Copy link

smmnazar commented Jan 4, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

1.1.2

Affected Resource(s)

  • google_container_node_pool

Terraform Configuration Files

# GKE cluster
resource "google_container_cluster" "primary" {
  #count    = var.destroy_infra ? 1 : 0
  name     = var.clustername
  location = var.regionname
  remove_default_node_pool = var.remove_defaultnode
  initial_node_count       = var.initialnode_count

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
}

# Separately Managed Node Pool
resource "google_container_node_pool" "primary_nodes" {
  #count    	 = var.destroy_infra ? 1 : 0
  name       = "${google_container_cluster.primary.name}-node"
  location   = var.regionname
  cluster    = google_container_cluster.primary.name
  node_count = var.gke_num_nodes

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]

    labels = {
      env = var.project_id
    }

    # preemptible  = true
    machine_type = "g1-small"
    tags         = ["gke-node", "${var.clustername}"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    ignore_changes = [
      initial_node_count
    ]
  }
}

Debug Output

Panic Output

Error: NodePool cp-ofs-poc-gke-cluster-node was created in the error state "RUNNING_WITH_ERROR"

Expected Behavior

Node has to be created and status should be running.

Actual Behavior

Node created with error state "Running_With_Error"

Steps to Reproduce

  1. terraform apply
@smmnazar smmnazar added the bug label Jan 4, 2022
@bon77
Copy link

bon77 commented Feb 14, 2022

Not sure if it is related - I get a similar problem on asia-northeast1 and asia-northeast2, but asia-northeast3 the same code works fine.

@rileykarson
Copy link
Collaborator

If you can capture debug logs with export TF_LOG=DEBUG that would help! There are entirely valid reasons for a NP to be in an error state, so this may indicate a GCP problem and not a provider one (i.e. as @bon77 pointed out, regional differences).

There's probably a space to improve the error message at least, provided the API provides a useful one.

@haggishunk
Copy link

Can you check the node pool in the GCP console?

I recently ran into this issue and the console gave more information about the error-- IP exhaustion in the secondary ip range. I was using /24 blocks for the secondary ip range (set by another terraform module or by console) and the cluster was being created with the default 110 pods per node. This article helped me out quite a bit with understanding how this exhaustion can happen.

https://cloud.google.com/kubernetes-engine/docs/how-to/multi-pod-cidr

I changed default pods per node to a modest 16 and voila.

@smmnazar
Copy link
Author

If you can capture debug logs with export TF_LOG=DEBUG that would help! There are entirely valid reasons for a NP to be in an error state, so this may indicate a GCP problem and not a provider one (i.e. as @bon77 pointed out, regional differences).

There's probably a space to improve the error message at least, provided the API provides a useful one.

This helped to resolve the issue. Thanks

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants