NVIDIA AKS cluster

Tested on

This module was created and tested on Linux and MacOS.

Resources Created

Azure Resource Group
AKS Cluster
1x CPU nodepool (defaults to 1x CPU node -- Standard_D16_v5)
2x GPU nodepool (defaults to 1x T4 -- Standard_NC6s_v3)
Installs Latest version of GPU Operator

Prerequisites

Kubectl
Azure CLI
Azure Account & Subscription where you are permitted to create cloud resources
Terraform (CLI)
Azure Kubelogin

Usage

This module assumes that you have a working terraform binary and active Azure credentials.

No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.

Clone the repo

git clone https://github.com/NVIDIA/nvidia-terraform-modules.git

cd aks

Logging in to Azure via the CLI
- Run the below command , this will authenticate you to your Azure account
```
az login
```

Update terraform.tfvars file to customize a parameter from its default value, please uncomment the line and change the content

update cluster_name, if needed
update location, if needed
Add the IDs of the members or groups who should have cluster access to the variable admin_group_object_ids.

The GUID input can be retrieved in the Azure portal by searching for the desired user or group, for more info please refer Find Object Id

Set true for install_nim_operator, if you want to install NIM Operator

admin_group_object_ids       = ["xxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxx"]
cluster_name                 = "aks-cluster"
# cpu_machine_type             = "Standard_D16_v5"
# cpu_node_pool_count          = 1
# cpu_node_pool_disk_size      = 100
# cpu_node_pool_max_count      = 5
# cpu_node_pool_min_count      = 1
# cpu_os_sku                   = "Ubuntu"
# existing_resource_group_name = ""
# gpu_machine_type             = "Standard_NC6s_v3"
# gpu_node_pool_count          = 2
# gpu_node_pool_disk_size      = 100
# gpu_node_pool_max_count      = 5
# gpu_node_pool_min_count      = 1 
install_gpu_operator         = "true"
# gpu_operator_namespace       = "gpu-operator"
# gpu_operator_version         = "v24.9.0"
# gpu_operator_driver_version  = "550.127.05"
# install_nim_operator         = "false"
# nim_operator_version         = "v1.0.0"
# nim_operator_namespace       = "nim-operator"
# gpu_os_sku                   = "Ubuntu"
# kubernetes_version           = "1.30"
location                     = "westus2"

Initialize the module with below command
```
terraform init
```
Run the below command to view the proposed changes
```
terraform plan -out tfplan
```
Run the below command to apply the configuration
```
terraform apply tfplan
```
Once cluster is created run the below command with aks cluster name and resource group name to get kubeconfig so you are able to run kubectl commands
```
az aks get-credentials --resource-group aks-cluster-rg --name aks-cluster
```

Cleaning up / Deleting resources

Run the beloe commands to delete all remaining Azure resources created by this module. You should see Destroy complete! message after a few minutes.

terraform state rm kubernetes_namespace_v1.gpu-operator

terraform state rm kubernetes_namespace_v1.nim-operator

terraform destroy --auto-approve

Running as a module

Call the AKS module by adding this to an existing Terraform file:

module "nvidia-aks" {
  source                 = "git::github.com/NVIDIA/nvidia-terraform-modules/aks" 
  cluster_name           = "nvidia-aks"
  admin_group_object_ids = [] # See below for the value of this variable
}

All configurable options for this module are listed below. If you need additional values added, please open a pull request. ```

Issues

None. If you do encounter an issue, please file a GitHub issue.

Troubleshooting

Quota Errors

New Azure accounts which have not turned on VMs or GPU VMs in any region will need to request quota in that region. During installation, if you see a quota-related error, click the link in the error message to be redirected to the Azure console with a prepopulated quota request. Re-run terraform apply once the quota request is complete. This will take ~5m per quota request
Azure Cloudshell Errors

When using Azure Cloudshell during installation, if you see a MSI or Bad Request(400) error, it is known issue with Azure Cloushell. There are 2 workarounds:
- Use Azure CLI on a local machine
- In Cloud Shell, run az login and re-run terraform apply

Requirements

Name	Version
terraform	>= 1.3.4
azurerm	~>3.48.0
kubernetes	~>2.19.0

Providers

Name	Version
azurerm	~>3.48.0
helm	n/a
kubernetes	~>2.19.0

Modules

No modules.

Resources

Name	Type
azurerm_kubernetes_cluster.aks	resource
azurerm_kubernetes_cluster_node_pool.aks	resource
azurerm_resource_group.aks	resource
helm_release.gpu-operator	resource
helm_release.nim_operator	resource
kubernetes_namespace_v1.gpu-operator	resource
kubernetes_namespace_v1.nim-operator	resource
azurerm_kubernetes_cluster.akscluster	data source
azurerm_resource_group.existing	data source

Inputs

Name	Description	Type	Default	Required
admin_group_object_ids	(Required) A list of Object IDs (GUIDs) of Azure Active Directory Groups which should have Owner Role on the Cluster. This is not the email address of the group, the GUID can be found in the Azure panel by searching for the AD Group NOTE: You will need Azure "Owner" role (not "Contributor") to attach an AD role to the Kubernetes cluster.	`list(any)`	n/a	yes
cluster_name	The name of the AKS Cluster to be created	`string`	`"aks-cluster"`	no
cpu_machine_type	Machine instance type of the AKS CPU node pool	`string`	`"Standard_D16_v5"`	no
cpu_node_pool_count	Count of nodes in Default GPU pool	`number`	`1`	no
cpu_node_pool_disk_size	Disk size in GB of nodes in the Default GPU pool	`number`	`100`	no
cpu_node_pool_max_count	Max count of nodes in Default CPU pool	`number`	`5`	no
cpu_node_pool_min_count	Min ount of number of nodes in Default CPU pool	`number`	`1`	no
cpu_os_sku	Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022	`string`	`"Ubuntu"`	no
existing_resource_group_name	The name of an existing resource group the Kubernetes cluster should be deployed into. Defaults to the name of the cluster + `-rg` if none is specified	`string`	`null`	no
gpu_machine_type	Machine instance type of the AKS GPU node pool	`string`	`"Standard_NC6s_v3"`	no
gpu_node_pool_count	Count of nodes in Default GPU pool	`number`	`2`	no
gpu_node_pool_disk_size	Disk size in GB of nodes in the Default GPU pool	`number`	`100`	no
gpu_node_pool_max_count	Max count of nodes in Default GPU pool	`number`	`5`	no
gpu_node_pool_min_count	Min count of number of nodes in Default GPU pool	`number`	`2`	no
gpu_operator_driver_version	The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available.	`string`	`"550.127.05"`	no
gpu_operator_namespace	The namespace to deploy the NVIDIA GPU operator into	`string`	`"gpu-operator"`	no
gpu_operator_version	Version of the GPU operator to be installed	`string`	`"v24.9.0"`	no
gpu_os_sku	Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022	`string`	`"Ubuntu"`	no
install_gpu_operator	Whether to Install GPU Operator. Defaults to false available.	`string`	`"true"`	no
install_nim_operator	Whether to Install NIM Operator. Defaults to false available.	`string`	`"false"`	no
kubernetes_version	Version of Kubernetes to turn on. Run 'az aks get-versions --location --output table' to view all available versions	`string`	`"1.30"`	no
location	The region to create resources in	`any`	n/a	yes
nim_operator_namespace	The namespace for the GPU operator deployment	`string`	`"nim-operator"`	no
nim_operator_version	Version of the GPU Operator to deploy. Defaults to latest available.	`string`	`"v1.0.0"`	no

Outputs

Name	Description
client_certificate	n/a
kube_config	n/a
kubernetes_cluster_name	n/a
location	n/a
resource_group_name	n/a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NVIDIA AKS cluster

Tested on

Resources Created

Prerequisites

Usage

Cleaning up / Deleting resources

Running as a module

Issues

Troubleshooting

Quota Errors

Azure Cloudshell Errors

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

NVIDIA AKS cluster

Tested on

Resources Created

Prerequisites

Usage

Cleaning up / Deleting resources

Running as a module

Issues

Troubleshooting

Quota Errors

Azure Cloudshell Errors

Requirements

Providers

Modules

Resources

Inputs

Outputs