This module was created and tested on Linux and MacOS.
- Azure Resource Group
- AKS Cluster
- 1x CPU nodepool (defaults to 1x CPU node -- Standard_D16_v5)
- 2x GPU nodepool (defaults to 1x T4 -- Standard_NC6s_v3)
- Installs Latest version of GPU Operator
- Kubectl
- Azure CLI
- Azure Account & Subscription where you are permitted to create cloud resources
- Terraform (CLI)
- Azure Kubelogin
This module assumes that you have a working terraform
binary and active Azure credentials.
No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.
-
Clone the repo
git clone https://github.com/NVIDIA/nvidia-terraform-modules.git cd aks
-
Logging in to Azure via the CLI
- Run the below command , this will authenticate you to your Azure account
az login
-
Update
terraform.tfvars
file to customize a parameter from its default value, please uncomment the line and change the content-
update
cluster_name
, if needed -
update
location
, if needed -
Add the IDs of the members or groups who should have cluster access to the variable
admin_group_object_ids
.The GUID input can be retrieved in the Azure portal by searching for the desired user or group, for more info please refer Find Object Id
-
Set
true
forinstall_nim_operator
, if you want to install NIM Operatoradmin_group_object_ids = ["xxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxx"] cluster_name = "aks-cluster" # cpu_machine_type = "Standard_D16_v5" # cpu_node_pool_count = 1 # cpu_node_pool_disk_size = 100 # cpu_node_pool_max_count = 5 # cpu_node_pool_min_count = 1 # cpu_os_sku = "Ubuntu" # existing_resource_group_name = "" # gpu_machine_type = "Standard_NC6s_v3" # gpu_node_pool_count = 2 # gpu_node_pool_disk_size = 100 # gpu_node_pool_max_count = 5 # gpu_node_pool_min_count = 1 install_gpu_operator = "true" # gpu_operator_namespace = "gpu-operator" # gpu_operator_version = "v24.9.0" # gpu_operator_driver_version = "550.127.05" # install_nim_operator = "false" # nim_operator_version = "v1.0.0" # nim_operator_namespace = "nim-operator" # gpu_os_sku = "Ubuntu" # kubernetes_version = "1.30" location = "westus2"
-
-
Initialize the module with below command
terraform init
-
Run the below command to view the proposed changes
terraform plan -out tfplan
-
Run the below command to apply the configuration
terraform apply tfplan
-
Once cluster is created run the below command with aks cluster name and resource group name to get kubeconfig so you are able to run
kubectl
commandsaz aks get-credentials --resource-group aks-cluster-rg --name aks-cluster
-
Run the beloe commands to delete all remaining Azure resources created by this module. You should see
Destroy complete!
message after a few minutes.terraform state rm kubernetes_namespace_v1.gpu-operator terraform state rm kubernetes_namespace_v1.nim-operator
terraform destroy --auto-approve
Call the AKS module by adding this to an existing Terraform file:
module "nvidia-aks" {
source = "git::github.com/NVIDIA/nvidia-terraform-modules/aks"
cluster_name = "nvidia-aks"
admin_group_object_ids = [] # See below for the value of this variable
}
All configurable options for this module are listed below. If you need additional values added, please open a pull request. ```
- None. If you do encounter an issue, please file a GitHub issue.
-
New Azure accounts which have not turned on VMs or GPU VMs in any region will need to request quota in that region. During installation, if you see a quota-related error, click the link in the error message to be redirected to the Azure console with a prepopulated quota request. Re-run
terraform apply
once the quota request is complete. This will take ~5m per quota request -
When using Azure Cloudshell during installation, if you see a
MSI
orBad Request(400)
error, it is known issue with Azure Cloushell. There are 2 workarounds:- Use Azure CLI on a local machine
- In Cloud Shell, run
az login
and re-runterraform apply
Name | Version |
---|---|
terraform | >= 1.3.4 |
azurerm | ~>3.48.0 |
kubernetes | ~>2.19.0 |
Name | Version |
---|---|
azurerm | ~>3.48.0 |
helm | n/a |
kubernetes | ~>2.19.0 |
No modules.
Name | Type |
---|---|
azurerm_kubernetes_cluster.aks | resource |
azurerm_kubernetes_cluster_node_pool.aks | resource |
azurerm_resource_group.aks | resource |
helm_release.gpu-operator | resource |
helm_release.nim_operator | resource |
kubernetes_namespace_v1.gpu-operator | resource |
kubernetes_namespace_v1.nim-operator | resource |
azurerm_kubernetes_cluster.akscluster | data source |
azurerm_resource_group.existing | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
admin_group_object_ids | (Required) A list of Object IDs (GUIDs) of Azure Active Directory Groups which should have Owner Role on the Cluster. This is not the email address of the group, the GUID can be found in the Azure panel by searching for the AD Group NOTE: You will need Azure "Owner" role (not "Contributor") to attach an AD role to the Kubernetes cluster. |
list(any) |
n/a | yes |
cluster_name | The name of the AKS Cluster to be created | string |
"aks-cluster" |
no |
cpu_machine_type | Machine instance type of the AKS CPU node pool | string |
"Standard_D16_v5" |
no |
cpu_node_pool_count | Count of nodes in Default GPU pool | number |
1 |
no |
cpu_node_pool_disk_size | Disk size in GB of nodes in the Default GPU pool | number |
100 |
no |
cpu_node_pool_max_count | Max count of nodes in Default CPU pool | number |
5 |
no |
cpu_node_pool_min_count | Min ount of number of nodes in Default CPU pool | number |
1 |
no |
cpu_os_sku | Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 | string |
"Ubuntu" |
no |
existing_resource_group_name | The name of an existing resource group the Kubernetes cluster should be deployed into. Defaults to the name of the cluster + -rg if none is specified |
string |
null |
no |
gpu_machine_type | Machine instance type of the AKS GPU node pool | string |
"Standard_NC6s_v3" |
no |
gpu_node_pool_count | Count of nodes in Default GPU pool | number |
2 |
no |
gpu_node_pool_disk_size | Disk size in GB of nodes in the Default GPU pool | number |
100 |
no |
gpu_node_pool_max_count | Max count of nodes in Default GPU pool | number |
5 |
no |
gpu_node_pool_min_count | Min count of number of nodes in Default GPU pool | number |
2 |
no |
gpu_operator_driver_version | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. | string |
"550.127.05" |
no |
gpu_operator_namespace | The namespace to deploy the NVIDIA GPU operator into | string |
"gpu-operator" |
no |
gpu_operator_version | Version of the GPU operator to be installed | string |
"v24.9.0" |
no |
gpu_os_sku | Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 | string |
"Ubuntu" |
no |
install_gpu_operator | Whether to Install GPU Operator. Defaults to false available. | string |
"true" |
no |
install_nim_operator | Whether to Install NIM Operator. Defaults to false available. | string |
"false" |
no |
kubernetes_version | Version of Kubernetes to turn on. Run 'az aks get-versions --location --output table' to view all available versions | string |
"1.30" |
no |
location | The region to create resources in | any |
n/a | yes |
nim_operator_namespace | The namespace for the GPU operator deployment | string |
"nim-operator" |
no |
nim_operator_version | Version of the GPU Operator to deploy. Defaults to latest available. | string |
"v1.0.0" |
no |
Name | Description |
---|---|
client_certificate | n/a |
kube_config | n/a |
kubernetes_cluster_name | n/a |
location | n/a |
resource_group_name | n/a |