Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling doesn't work #1265

Closed
homergleason opened this issue Mar 7, 2024 · 1 comment
Closed

Autoscaling doesn't work #1265

homergleason opened this issue Mar 7, 2024 · 1 comment

Comments

@homergleason
Copy link

homergleason commented Mar 7, 2024

Description

Hello, I’ve been struggling all day, I just can’t start autoscaling.

Set node group draining-node-pool size from 0 to 0, expected delta 0
Although the configuration says something completely different.

I0307 22:19:05.344291       1 static_autoscaler.go:290] Starting main loop
2024-03-07T22:19:05.344485619Z I0307 22:19:05.344378       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.344500697Z I0307 22:19:05.344393       1 hetzner_node_group.go:559] Set node group draining-node-pool size from 0 to 0, expected delta 0
I0307 22:19:05.344399       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.344520926Z I0307 22:19:05.344403       1 hetzner_node_group.go:559] Set node group prod-autoscaled-peak-load size from 0 to 0, expected delta 0
I0307 22:19:05.344559       1 hetzner_servers_cache.go:116] Current serversCache len: 1
I0307 22:19:05.344570       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.344854887Z I0307 22:19:05.344578       1 hetzner_server_type_cache.go:115] Current serverTypeCache len: 0
W0307 22:19:05.344582       1 hetzner_server_type_cache.go:93] Fetching server types from Hetzner API
I0307 22:19:05.473846       1 hetzner_server_type_cache.go:115] Current serverTypeCache len: 1
2024-03-07T22:19:05.473989232Z I0307 22:19:05.473867       1 hetzner_node_group.go:340] Build node group label for draining-node-pool
2024-03-07T22:19:05.474000223Z I0307 22:19:05.473875       1 hetzner_node_group.go:354] draining-node-pool nodegroup labels: map[beta.kubernetes.io/instance-type:cx11 csi.hetzner.cloud/location:fsn1 hcloud/node-group:draining-node-pool kubernetes.io/arch:amd64 topology.kubernetes.io/region:fsn1]
2024-03-07T22:19:05.474085815Z I0307 22:19:05.473995       1 hetzner_server_type_cache.go:115] Current serverTypeCache len: 1
2024-03-07T22:19:05.474099982Z I0307 22:19:05.474036       1 hetzner_server_type_cache.go:115] Current serverTypeCache len: 1
I0307 22:19:05.474041       1 hetzner_node_group.go:340] Build node group label for prod-autoscaled-peak-load
I0307 22:19:05.474047       1 hetzner_node_group.go:354] prod-autoscaled-peak-load nodegroup labels: map[beta.kubernetes.io/instance-type:cpx21 csi.hetzner.cloud/location:hel1 hcloud/node-group:prod-autoscaled-peak-load kubernetes.io/arch:amd64 topology.kubernetes.io/region:hel1]
I0307 22:19:05.474137       1 node_instances_cache.go:132] Get cached cloud provider node instances for prod-autoscaled-peak-load
I0307 22:19:05.474175       1 node_instances_cache.go:132] Get cached cloud provider node instances for draining-node-pool
I0307 22:19:05.474202       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475152902Z I0307 22:19:05.474212       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475176358Z I0307 22:19:05.474258       1 filter_out_schedulable.go:63] Filtering out schedulables
2024-03-07T22:19:05.475185675Z I0307 22:19:05.474267       1 klogx.go:87] Looking for place for default/solverimage-756ff448b8-8tch8
2024-03-07T22:19:05.475192758Z I0307 22:19:05.474338       1 klogx.go:87] failed to find place for default/solverimage-756ff448b8-8tch8: cannot put pod solverimage-756ff448b8-8tch8 on any node
2024-03-07T22:19:05.475217144Z I0307 22:19:05.474346       1 klogx.go:87] Looking for place for default/solverimage-756ff448b8-8xbt4
2024-03-07T22:19:05.475225159Z I0307 22:19:05.474418       1 klogx.go:87] failed to find place for default/solverimage-756ff448b8-8xbt4 based on similar pods scheduling
2024-03-07T22:19:05.475231842Z I0307 22:19:05.474426       1 klogx.go:87] Looking for place for default/solverimage-756ff448b8-dr88l
2024-03-07T22:19:05.475238415Z I0307 22:19:05.474468       1 klogx.go:87] failed to find place for default/solverimage-756ff448b8-dr88l based on similar pods scheduling
2024-03-07T22:19:05.475244917Z I0307 22:19:05.474478       1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
2024-03-07T22:19:05.475251529Z I0307 22:19:05.474487       1 filter_out_schedulable.go:83] No schedulable pods
2024-03-07T22:19:05.475258122Z I0307 22:19:05.474492       1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
2024-03-07T22:19:05.475264774Z I0307 22:19:05.474497       1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 3 unschedulable pods left
2024-03-07T22:19:05.475271276Z I0307 22:19:05.474504       1 klogx.go:87] Pod default/solverimage-756ff448b8-8tch8 is unschedulable
2024-03-07T22:19:05.475277728Z I0307 22:19:05.474507       1 klogx.go:87] Pod default/solverimage-756ff448b8
-8xbt4 is unschedulable
2024-03-07T22:19:05.475284190Z I0307 22:19:05.474510       1 klogx.go:87] Pod default/solverimage-756ff448b8-dr88l is unschedulable
2024-03-07T22:19:05.475291745Z I0307 22:19:05.474617       1 orchestrator.go:108] Upcoming 0 nodes
2024-03-07T22:19:05.475298509Z I0307 22:19:05.474629       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475305021Z I0307 22:19:05.474634       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475311533Z E0307 22:19:05.474650       1 orchestrator.go:446] Couldn't get autoscaling options for ng: prod-autoscaled-peak-load
2024-03-07T22:19:05.475318035Z I0307 22:19:05.474656       1 orchestrator.go:440] Skipping node group draining-node-pool - max size reached
2024-03-07T22:19:05.475324848Z I0307 22:19:05.474727       1 orchestrator.go:542] Pod default/solverimage-756ff448b8-8tch8 can't be scheduled on prod-autoscaled-peak-load, predicate checking error: Insufficient cpu; predicateName=NodeResourcesFit; reasons: Insufficient cpu; debugInfo=
2024-03-07T22:19:05.475331590Z I0307 22:19:05.474734       1 orchestrator.go:544] 2 other pods similar to solverimage-756ff448b8-8tch8 can't be scheduled on prod-autoscaled-peak-load
2024-03-07T22:19:05.475338173Z I0307 22:19:05.474742       1 orchestrator.go:150] No pod can fit to prod-autoscaled-peak-load
2024-03-07T22:19:05.475344665Z I0307 22:19:05.474749       1 orchestrator.go:164] No expansion options
2024-03-07T22:19:05.475351187Z I0307 22:19:05.474791       1 static_autoscaler.go:570] Calculating unneeded nodes
2024-03-07T22:19:05.475358011Z I0307 22:19:05.474797       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475364533Z I0307 22:19:05.474804       1 pre_filtering_processor.go:57] Node prod-agent-heavy-bih should not be processed by cluster autoscaler (no node group config)
2024-03-07T22:19:05.475371086Z I0307 22:19:05.474808       1 hetzner_servers_cache.go:116] Current serversCache len: 1
2024-03-07T22:19:05.475377658
Z I0307 22:19:05.474812       1 pre_filtering_processor.go:57] Node prod-control-plane-hel1-bpe should not be processed by cluster autoscaler (no node group config)
2024-03-07T22:19:05.475384641Z I0307 22:19:05.474842       1 static_autoscaler.go:617] Scale down status: lastScaleUpTime=2024-03-07 21:08:47.628117895 +0000 UTC m=-3582.817780634 lastScaleDownDeleteTime=2024-03-07 21:08:47.628117895 +0000 UTC m=-3582.817780634 lastScaleDownFailTime=2024-03-07 21:08:47.628117895 +0000 UTC m=-3582.817780634 scaleDownForbidden=false scaleDownInCooldown=false
2024-03-07T22:19:05.475399349Z I0307 22:19:05.474868       1 static_autoscaler.go:642] Starting scale down
2024-03-07T22:19:05.475409187Z I0307 22:19:05.475048       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"solverimage-756ff448b8-8tch8", UID:"b52bdd9c-0173-4296-a60e-ddef94b86e36", APIVersion:"v1", ResourceVersion:"214800", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up: 1 Insufficient cpu, 1 max node group size reached
I0307 22:19:05.479852       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"solverimage-756ff448b8-8xbt4", UID:"e5a472c3-e8c1-470d-9e4c-9d18c8bfc55c", APIVersion:"v1", ResourceVersion:"214794", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up: 1 Insufficient cpu, 1 max node group size reached
I0307 22:19:05.482897       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"solverimage-756ff448b8-dr88l", UID:"91e963c9-3978-4137-acc3-42ff3fffe27f", APIVersion:"v1", ResourceVersion:"214785", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up: 1 Insufficient cpu, 1 max node group size reached
I0307 22:19:05.881370       1 leaderelection.go:281] successfully renewed lease kube-system/cluster-autoscaler
I0307 22:19:07.886002       1 leaderelection.go:281] successfully renewed lease kube-system/cluster-autoscaler
I0307 22:19:09.890437       1 leaderelection.go:281] successfully renewed lease kube-system/cluster-autoscaler
I0307 22:19:11.894979       1 leaderelection.go:281] successfully renewed lease kube-system/cluster-autoscaler
I0307 22:19:13.899045       1 leaderelection.go:281] successfully renewed lease kube-system/cluster-autoscaler
I0307 22:19:15.479890       1 static_autoscaler.go:290] Starting main loop

Kube.tf file

locals {
  hcloud_token = ""
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"

  cluster_name = "prod"

  automatically_upgrade_os = false
  automatically_upgrade_k3s = false


  ssh_public_key  = file("")
  ssh_private_key = file("")

  network_region = "eu-central"

  enable_rancher = true
  rancher_hostname = ""
  rancher_install_channel = "stable"
  rancher_bootstrap_password = ""

  initial_k3s_channel = "v1.25"

  control_plane_nodepools = [
    {
      name        = "control-plane-hel1",
      server_type = "cpx31",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
      backups     = true
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-heavy",
      server_type = "cpx31",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
      backups     = true
    }
  ]

  autoscaler_nodepools = [
  {
    name        = "autoscaled-peak-load"
    server_type = "cpx21"           // Choose a server type with sufficient resources
    location    = "hel1"            // Specify the desired location
    min_nodes   = 0                 // Set a minimum number of nodes
    max_nodes   = 5                // Set a maximum number to scale up to during peak load
    #labels      = {
    #  "node.kubernetes.io/role": "peak-workloads"
    #}
    #taints      = [{
    #  key: "node.kubernetes.io/role"
    #  value: "peak-workloads"
    #  effect: "NoExecute"
    #}]
  }
]

  #load_balancer_type     = "lb11"
  #load_balancer_location = "hel1"

  cluster_autoscaler_image   = "registry.k8s.io/autoscaling/cluster-autoscaler"
  cluster_autoscaler_version = "v1.29.0"
  cluster_autoscaler_log_level = 5
  cluster_autoscaler_log_to_stderr = true
  cluster_autoscaler_stderr_threshold = "INFO"
  #cluster_autoscaler_extra_args = [
  #  "--ignore-daemonsets-utilization=true",
  #  "--enforce-node-group-min-size=true",
  #]


  cni_plugin = "cilium"
  cilium_version = "v1.14.0"
  cilium_routing_mode = "native"
  cilium_egress_gateway_enabled = true

  
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.43.0"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

Linux

@homergleason homergleason added the bug Something isn't working label Mar 7, 2024
@mysticaltech
Copy link
Collaborator

@homergleason Remove those, we maintain a fork pending fixes related to Hetzner are shipped in a future version. So just delete those lines to use our own fork:

  cluster_autoscaler_image   = "registry.k8s.io/autoscaling/cluster-autoscaler"
  cluster_autoscaler_version = "v1.29.0"

Then just make sure to give it enough load, with such powerful nodes, you may need to give way bigger ressource requests than in the example below:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: force-scale-up
spec:
 replicas: 1
 selector:
   matchLabels:
     app: force-scale-up
 template:
   metadata:
     labels:
       app: force-scale-up
   spec:
     containers:
     - name: busybox
       image: busybox
       args:
       - /bin/sh
       - -c
       - "while true; do echo 'Forcing scale up...'; sleep 60; done"
       resources:
         requests:
           cpu: 2000m # Requesting a high amount of CPU to force scale up
           memory: 4Gi # Requesting a high amount of memory to force scale up

@mysticaltech mysticaltech removed the bug Something isn't working label Mar 8, 2024
@mysticaltech mysticaltech changed the title [Bug]: Autoscaling doesn't work Autoscaling doesn't work Mar 8, 2024
@kube-hetzner kube-hetzner locked and limited conversation to collaborators Mar 8, 2024
@mysticaltech mysticaltech converted this issue into discussion #1266 Mar 8, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants