-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066
Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066
Comments
@SteveLeach-Keytree Do you know if this feature has made it into AKS the AKS API already? I can't find it. If not, I recommend to first file this issue upstream for the AKS team to expose it in the API. |
Is that not what the issue I linked in the reference was? I'm not entirely sure I fully understand what all the different projects are. |
@SteveLeach-Keytree Don't worry, there is a lot indeed 😃 aks-engine stuff might end up in AKS, but in general a feature in aks-engine is not by default exposed in AKS. If there is no documentation or API feature we can use, you should ask it to the AKS team first. |
That's what I'm saying - I linked the AKS team's change that makes this possible. |
I understand what you are saying, but you are pointing towards the change in AKS engine. The Terraform provider for Azure interacts with Azure AKS API's which are a wrapper around the service which uses probably some version of your mentioned AKS engine. We have to work with that API, as does the Azure CLI. The functionality you are interested in is AFAIK not exposed via the Azure AKS API. I cannot find it in the Azure CLI docs either. To get more technical detailed, this link points towards the Go models which we can use to configure the AKS node pool for Terraform, I cannot find it over there as well. To get the feature exposed or more information how it possibly can be done you could best file an issue here. |
Ah, got you now. I'd assumed that the AKS team were the people maintaining the AKS engine and was getting very confused. |
Just FYI: the issue was closed on the AKS side due to inactivity 🤷 |
Regardless of the AKS issue, it appears that at some point this arrived in the REST API: https://learn.microsoft.com/en-us/rest/api/aks/agent-pools/get?view=rest-aks-2023-08-01&tabs=HTTP#agentpoolupgradesettings |
@stephybun @katbyte @aristosvo apologies for the mentions; but I think this issue could progress now, given that the API appears to support it now. |
Code in the original submission was before api came out: This needs to be under upgrade_settings with max_surge to match Azure settings logically for node-pools. resource "azurerm_kubernetes_cluster" "aks" { upgrade_settings { |
This value will also need to be respected when setting a temp-pool-name for upgrade. Meaning currently if you change VM size of AKS cluster in TF you give it a temp-node_pool-name and it creates one and moves your workload over and then re-creates the default-node-pool. the drain_timeout should be the same for the temp-pool and the re-created default-pool as the original custom setting rather than just set to the default 30min |
I'll take a quick look on how hard this will be to implement, will let you know ASAP. From the first look of things it seems we need to update |
@aristosvo It looks like the dependencies listed in the PR have been potentially resolved now. Is there anything that is blocking the attached PR from moving forward? |
Could you add support for soak setting also? So all
|
This is today possible on AKS and the features are all GA: Generally Available: AKS support for node soak duration for upgrades @stephybun @ms-henglu we need to have this in the provider. We already have a issue for the AKS module Azure/terraform-azurerm-aks#530 |
Without the:
parameters, we are currently experiencing downtime during automatic node upgrades. Nodes are upgraded in a very rapid succession before services inside had a chance to recover. I would call this a really high priority problem as it genuinely leaves services offline for a while. Could you please focus some efforts on this? |
Hahah yeah...come on IBM let's get this done; it should have been updated in November 2023 |
I'll take a quick look. Last time it was blocked on Microsoft's API behavior, which was reported here but which triggered no response thus far. @norm-dole-dbtlabs Not cool, I'm an independent contributor, nobody at the Terraform nor HashiCorp side can be blamed for anything in this regard. |
@zioproto is this something you could help with by bringing it to the attention of the AKS team? |
I've brought the PR up to date and completed stuff like docs. The issue is still there, resetting the upgrade settings to the defaults implicitly is therefore not included in the current implementation. Whether that is an issue is up to the reviewers. Support for |
@aristosvo that IBM shot wasn't at you; I think it's noble that your doing this work but it's also the responsibility of Hashicorp/Microsoft through whatever partnership they have to keep this up to date in real time and not hope community members shore up their enterprise support best effort |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Description
AKS feature aks-engine#1276 made the cordon drain timeout configurable, but this attribute does not yet appear to be exposed through the azurerm provider.
We have some pods in a deployment that take a long time to start up as they load all the data they need. We generally have 4 pods in the ReplicaSet and a minimum of 2 (defined in a pod disruption budget). During an AKS upgrade we need to allow time for 2 new pods to start up on new nodes before terminating the last 2 on the old nodes, but the cordon & drain times-out before they are ready.
New or Affected Resource(s)
azurerm_kubernetes_cluster_node_pool
Potential Terraform Configuration
References
The text was updated successfully, but these errors were encountered: