Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended way to change agents_size without downtime? #559

Open
1 task done
Israphel opened this issue Jun 5, 2024 · 3 comments
Open
1 task done

Recommended way to change agents_size without downtime? #559

Israphel opened this issue Jun 5, 2024 · 3 comments

Comments

@Israphel
Copy link

Israphel commented Jun 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Description

We deploy our clusters with a default node_pool, using:

agents_pool_name            = "default"
agents_pool_max_surge       = try(each.value.max_surge, "10%")
agents_availability_zones   = ["1", "2", "3"]
agents_type                 = "VirtualMachineScaleSets"
agents_size                 = try(each.value.agents_size, "Standard_D2s_v3")
temporary_name_for_rotation = "tmp"

We're replacing agents_size with the ARM equivalent, and we can see the "tmp" node_pool being created, but then all the default nodes are drained at once, without respecting PDB, essentially taking down every service

1s          Normal   Drain             node/aks-default-15731243-vmss000009      Draining node: aks-default-15731243-vmss000009
2s          Normal   Drain             node/aks-default-15731243-vmss00000x      Draining node: aks-default-15731243-vmss00000x
2s          Normal   Drain             node/aks-default-15731243-vmss00000e      Draining node: aks-default-15731243-vmss00000e

Are we doing it the wrong way? how can we change the agents size without such a drastic draining?

New or Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

@zioproto
Copy link
Collaborator

zioproto commented Jun 7, 2024

@Israphel could you please confirm which version of the module you are using ?

@zioproto
Copy link
Collaborator

zioproto commented Jun 7, 2024

@Israphel I understand you are trying to change the "agents_size" of the system node pool. If you look at the provider documentation this is changing the default_node_pool block of the azurerm_kubernetes_cluster resource.

Please check this documentation:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster

Screenshot 2024-06-07 at 08 30 20

The behaviour you see is expected, and I dont think this is something we can workaround in the module.

I found this related provider issue:

Feel free to open upstream at https://github.com/hashicorp/terraform-provider-azurerm/issues a new issue if you would like this behaviour to change.

I will keep this issue open in case you have additional questions.

Thanks

@Israphel
Copy link
Author

Israphel commented Jun 7, 2024

I use 8.0.0

the only way we found was creating a new node_pool, drain all the defaults, change agents_size and then drain the temporary node_pool once more. Is this what everyone is doing to prevent downtime?

The problem we see is that when you upgrade kubernetes, this doesn't happens, everything goes smoothly and the PDBs are respected. But changing the instance type just drains all at once, too aggresive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants