Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azurerm_kubernetes_cluster_node_pool - support for the source_snapshot_id property #21511

Merged
merged 4 commits into from
May 3, 2023

Conversation

ms-henglu
Copy link
Contributor

Fixes #21442

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ms-henglu. Overall this is off to a good start - once the comments left in-line have been addressed we can take another look through.

@ms-henglu
Copy link
Contributor Author

Thanks @stephybun for the code review, I've updated this PR as suggested, would you please take anther look? Thanks!

@zioproto
Copy link
Contributor

I am testing like this:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "this" {
  name     = "PR21511"
  location = "eastus"
}

resource azurerm_kubernetes_cluster "aks" {
  name                = "aks"
  location            = "eastus"
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "aks"
  kubernetes_version  = "1.26.3"


  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_D3_v2"

  }

  identity {
    type = "SystemAssigned"
  }

}

resource azurerm_kubernetes_cluster_node_pool "aks" {
  name                = "fromsnapshot"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size             = "Standard_D3_v2"
  node_count          = 1
  snapshot_id  = "/subscriptions/<redacted>/resourceGroups/snapshots/providers/Microsoft.ContainerService/snapshots/testedsnap"
}

This is the snapshot I am using:

{
    "creationData": {
      "sourceResourceId": "/subscriptions/<redacted>/resourcegroups/azureservicemesh/providers/Microsoft.ContainerService/managedClusters/azureservicemesh/agentPools/nodepool1"
    },
    "enableFips": null,
    "id": "/subscriptions/<redacted>/resourceGroups/snapshots/providers/Microsoft.ContainerService/snapshots/testedsnap",
    "kubernetesVersion": "1.26.3",
    "location": "eastus",
    "name": "testedsnap",
    "nodeImageVersion": "AKSUbuntu-2204gen2containerd-202304.10.0",
    "osSku": "Ubuntu",
    "osType": "Linux",
    "resourceGroup": "snapshots",
    "snapshotType": "NodePool",
    "systemData": {
      "createdAt": "2023-04-26T06:42:32.795219+00:00",
      "createdBy": "<redacted>",
      "createdByType": "User",
      "lastModifiedAt": "2023-04-26T06:42:32.795219+00:00",
      "lastModifiedBy": "<redacted>",
      "lastModifiedByType": "User"
    },
    "tags": null,
    "type": "Microsoft.ContainerService/Snapshots",
    "vmSize": "Standard_DS3_v2"
  }

I am failing with the following error:

azurerm_kubernetes_cluster_node_pool.aks: Creating...
╷
│ Error: creating Agent Pool (Subscription: "<redacted>"
│ Resource Group Name: "PR21511"
│ Managed Cluster Name: "aks"
│ Agent Pool Name: "fromsnapshot"): agentpools.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest" Message="The target nodepool and source snapshot should have same Distro."
│
│   with azurerm_kubernetes_cluster_node_pool.aks,
│   on main.tf line 30, in resource "azurerm_kubernetes_cluster_node_pool" "aks":
│   30: resource azurerm_kubernetes_cluster_node_pool "aks" {
│

What does it mean "The target nodepool and source snapshot should have same Distro." ? The resource azurerm_kubernetes_cluster_node_pool does not have any "distro" parameter.

@ms-henglu
Copy link
Contributor Author

Hi @zioproto ,

Thank you for testing this feature, I've confirmed it with the service team, it seems that the snapshot is taken when the node pool is using Ubuntu 18.04, now the node pool uses 20.04, so it failed to create the node pool. Would you please try to create a new snapshot then restore it?

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes @ms-henglu, I left a few more comments but once they're resolved and the tests are passing this should be good to go 👍

@zioproto
Copy link
Contributor

Thank you for testing this feature, I've confirmed it with the service team, it seems that the snapshot is taken when the node pool is using Ubuntu 18.04, now the node pool uses 20.04, so it failed to create the node pool. Would you please try to create a new snapshot then restore it?

@ms-henglu I will test again but I am confident the snapshot was taken from a cluster running AKS 1.26.3

You can see it from the output I attached in the previous comment from az aks nodepool snapshot list, you can see in the output the nodeImageVersion of the snapshot:

"nodeImageVersion": "AKSUbuntu-2204gen2containerd-202304.10.0",

There is no Ubuntu 18 involved here.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @ms-henglu LGTM 🍉

@stephybun stephybun merged commit 18d72e1 into hashicorp:main May 3, 2023
@github-actions github-actions bot added this to the v3.55.0 milestone May 3, 2023
stephybun added a commit that referenced this pull request May 3, 2023
@zioproto
Copy link
Contributor

zioproto commented May 3, 2023

@stephybun @ms-henglu I tested again and this is still broken for me. Here are my repro steps:

Create a cluster with azcli:

az aks create \
 --location eastus \
 --name cheshire-cat \
 --enable-addons monitoring \
 --resource-group cheshire-cat \
 --network-plugin azure  \
 --kubernetes-version 1.25.6  \
 --node-vm-size Standard_DS3_v2 \
 --node-count 2 \
 --auto-upgrade-channel rapid \
 --node-os-upgrade-channel  NodeImage \
 --enable-asm

Get the nodepool_id

NODEPOOL_ID=$(az aks nodepool show --name nodepool1 --cluster-name cheshire-cat --resource-group cheshire-cat --query id -o tsv)

Create a snapshot:

 az aks nodepool snapshot create --name catsnapshot --resource-group snapshots --nodepool-id $NODEPOOL_ID --location eastus

Get the snapshot id

SNAPSHOT_ID=$(az aks nodepool snapshot show --name catsnapshot --resource-group snapshots --query id -o tsv)

Run the following Terraform code, where in azurerm_kubernetes_cluster_node_pool I added the $SNAPSHOT_ID value by hand.

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "this" {
  name     = "PR21511"
  location = "eastus"
}

resource azurerm_kubernetes_cluster "aks" {
  name                = "aks"
  location            = "eastus"
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "aks"
  kubernetes_version  = "1.25.6"

  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_D3_v2"

  }

  identity {
    type = "SystemAssigned"
  }

}

resource azurerm_kubernetes_cluster_node_pool "aks" {
  name                = "npfromsnap"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size             = "Standard_D3_v2"
  node_count          = 1
  snapshot_id  = "/subscriptions/<redacted>/resourceGroups/snapshots/providers/Microsoft.ContainerService/snapshots/catsnapshot"
}

I run terraform init and terraform apply.

Once the aks cluster resource is created, the nodepool resource fails immediately, with this error:

azurerm_kubernetes_cluster.aks: Creation complete after 4m50s [id=/subscriptions/<redacted>/resourceGroups/PR21511/providers/Microsoft.ContainerService/managedClusters/aks]
azurerm_kubernetes_cluster_node_pool.aks: Creating...
╷
│ Error: creating Agent Pool (Subscription: "<redacted>"
│ Resource Group Name: "PR21511"
│ Managed Cluster Name: "aks"
│ Agent Pool Name: "npfromsnap"): agentpools.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest" Message="The target nodepool and source snapshot should have same Distro."
│
│   with azurerm_kubernetes_cluster_node_pool.aks,
│   on main.tf line 31, in resource "azurerm_kubernetes_cluster_node_pool" "aks":
│   31: resource azurerm_kubernetes_cluster_node_pool "aks" {

This should instead work. After the Terraform fails I have the aks cluster that Terraform created successfully. If I use azcli to create an additional nodepool using that same snapshot ID it works as expected:

az aks nodepool add  --name nodepool2 --cluster-name aks --resource-group PR21511 --snapshot-id $SNAPSHOT_ID

Do we need to open a new PR now ? Do you want me to open a new GitHub Issue ?

Thanks

@zioproto
Copy link
Contributor

zioproto commented May 3, 2023

@ms-henglu I am confused, I dont understand what you used in the code between az aks snapshot and az aks nodepool snapshot. These are 2 different APIs.

Can you confirm you are using the nodepool snapshot API ?
https://learn.microsoft.com/en-us/cli/azure/aks/nodepool/snapshot?view=azure-cli-latest

Screenshot 2023-05-03 at 15 46 19

@zioproto
Copy link
Contributor

zioproto commented May 4, 2023

Kudos to @stephybun for figuring out I am mixing Standard_DS3_v2 and Standard_D3_v2 SKUs in my testing repro steps.

@zioproto
Copy link
Contributor

zioproto commented May 4, 2023

Kudos to @ms-henglu for explaining to me that there are 2 cli commands:

  • az aks snapshot
  • az aks nodepool snapshot

they're using a same API called "Microsoft.ContainerService/snapshots@2023-02-02-preview", there's a field called snapshotType = "NodePool" | "ManagedCluster" which is used to differentiate the snapshot type. 

@zioproto
Copy link
Contributor

zioproto commented May 4, 2023

Kudos to @stephybun for figuring out I am mixing Standard_DS3_v2 and Standard_D3_v2 SKUs in my testing repro steps.

if the vm_size is enforced by the snapshot, it should not be mandatory in the resource azurerm_kubernetes_cluster_node_pool, correct ?

@tombuildsstuff
Copy link
Contributor

@zioproto based on above, it appears that your original issue was that you're were mixing different generations of Virtual Machine - you can't restore a snapshot from a Gen2 compute node onto a Gen1 compute note - unfortunately this is a misleading error message coming back from the API.

The vm_size field should remain Required, since you can change the VM Size but not the VM Family, per the documentation:

Any node pool or cluster created from a snapshot must use a VM from the same virtual machine family as the snapshot, for example, you can't create a new N-Series node pool based of a snapshot captured from a D-Series node pool because the node images in those cases are structurally different.

@github-actions
Copy link

github-actions bot commented May 5, 2023

This functionality has been released in v3.55.0 of the Terraform Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

Copy link

github-actions bot commented Jun 1, 2024

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Azure Kubernetes Service (AKS) node pool snapshot
4 participants