Skip to content

Latest commit

 

History

History
217 lines (169 loc) · 12.6 KB

README.md

File metadata and controls

217 lines (169 loc) · 12.6 KB

NVIDIA AKS cluster

Tested on

This module was created and tested on Linux and MacOS.

Resources Created

  • Azure Resource Group
  • AKS Cluster
  • 1x CPU nodepool (defaults to 1x CPU node -- Standard_D16_v5)
  • 2x GPU nodepool (defaults to 1x T4 -- Standard_NC6s_v3)
  • Installs Latest version of GPU Operator

Prerequisites

  1. Kubectl
  2. Azure CLI
  3. Azure Account & Subscription where you are permitted to create cloud resources
  4. Terraform (CLI)
  5. Azure Kubelogin

Usage

This module assumes that you have a working terraform binary and active Azure credentials.

No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.

  1. Clone the repo

    git clone https://github.com/NVIDIA/nvidia-terraform-modules.git
    
    cd aks
    
  2. Logging in to Azure via the CLI

    • Run the below command , this will authenticate you to your Azure account
    az login
    
  3. Update terraform.tfvars file to customize a parameter from its default value, please uncomment the line and change the content

    • update cluster_name, if needed

    • update location, if needed

    • Add the IDs of the members or groups who should have cluster access to the variable admin_group_object_ids.

      The GUID input can be retrieved in the Azure portal by searching for the desired user or group, for more info please refer Find Object Id

    • Set true for install_nim_operator, if you want to install NIM Operator

      admin_group_object_ids       = ["xxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxx"]
      cluster_name                 = "aks-cluster"
      # cpu_machine_type             = "Standard_D16_v5"
      # cpu_node_pool_count          = 1
      # cpu_node_pool_disk_size      = 100
      # cpu_node_pool_max_count      = 5
      # cpu_node_pool_min_count      = 1
      # cpu_os_sku                   = "Ubuntu"
      # existing_resource_group_name = ""
      # gpu_machine_type             = "Standard_NC6s_v3"
      # gpu_node_pool_count          = 2
      # gpu_node_pool_disk_size      = 100
      # gpu_node_pool_max_count      = 5
      # gpu_node_pool_min_count      = 1 
      install_gpu_operator         = "true"
      # gpu_operator_namespace       = "gpu-operator"
      # gpu_operator_version         = "v24.9.0"
      # gpu_operator_driver_version  = "550.127.05"
      # install_nim_operator         = "false"
      # nim_operator_version         = "v1.0.0"
      # nim_operator_namespace       = "nim-operator"
      # gpu_os_sku                   = "Ubuntu"
      # kubernetes_version           = "1.30"
      location                     = "westus2"
      
  4. Initialize the module with below command

    terraform init
    
  5. Run the below command to view the proposed changes

    terraform plan -out tfplan
    
  6. Run the below command to apply the configuration

    terraform apply tfplan
    
  7. Once cluster is created run the below command with aks cluster name and resource group name to get kubeconfig so you are able to run kubectl commands

    az aks get-credentials --resource-group aks-cluster-rg --name aks-cluster
    

Cleaning up / Deleting resources

  1. Run the beloe commands to delete all remaining Azure resources created by this module. You should see Destroy complete! message after a few minutes.

    terraform state rm kubernetes_namespace_v1.gpu-operator
    
    terraform state rm kubernetes_namespace_v1.nim-operator
    
    terraform destroy --auto-approve
    

Running as a module

Call the AKS module by adding this to an existing Terraform file:

module "nvidia-aks" {
  source                 = "git::github.com/NVIDIA/nvidia-terraform-modules/aks" 
  cluster_name           = "nvidia-aks"
  admin_group_object_ids = [] # See below for the value of this variable
}

All configurable options for this module are listed below. If you need additional values added, please open a pull request. ```

Issues

  • None. If you do encounter an issue, please file a GitHub issue.

Troubleshooting

  • Quota Errors

    New Azure accounts which have not turned on VMs or GPU VMs in any region will need to request quota in that region. During installation, if you see a quota-related error, click the link in the error message to be redirected to the Azure console with a prepopulated quota request. Re-run terraform apply once the quota request is complete. This will take ~5m per quota request

  • Azure Cloudshell Errors

    When using Azure Cloudshell during installation, if you see a MSI or Bad Request(400) error, it is known issue with Azure Cloushell. There are 2 workarounds:

    • Use Azure CLI on a local machine
    • In Cloud Shell, run az login and re-run terraform apply

Requirements

Name Version
terraform >= 1.3.4
azurerm ~>3.48.0
kubernetes ~>2.19.0

Providers

Name Version
azurerm ~>3.48.0
helm n/a
kubernetes ~>2.19.0

Modules

No modules.

Resources

Name Type
azurerm_kubernetes_cluster.aks resource
azurerm_kubernetes_cluster_node_pool.aks resource
azurerm_resource_group.aks resource
helm_release.gpu-operator resource
helm_release.nim_operator resource
kubernetes_namespace_v1.gpu-operator resource
kubernetes_namespace_v1.nim-operator resource
azurerm_kubernetes_cluster.akscluster data source
azurerm_resource_group.existing data source

Inputs

Name Description Type Default Required
admin_group_object_ids (Required) A list of Object IDs (GUIDs) of Azure Active Directory Groups which should have Owner Role on the Cluster.
This is not the email address of the group, the GUID can be found in the Azure panel by searching for the AD Group
NOTE: You will need Azure "Owner" role (not "Contributor") to attach an AD role to the Kubernetes cluster.
list(any) n/a yes
cluster_name The name of the AKS Cluster to be created string "aks-cluster" no
cpu_machine_type Machine instance type of the AKS CPU node pool string "Standard_D16_v5" no
cpu_node_pool_count Count of nodes in Default GPU pool number 1 no
cpu_node_pool_disk_size Disk size in GB of nodes in the Default GPU pool number 100 no
cpu_node_pool_max_count Max count of nodes in Default CPU pool number 5 no
cpu_node_pool_min_count Min ount of number of nodes in Default CPU pool number 1 no
cpu_os_sku Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 string "Ubuntu" no
existing_resource_group_name The name of an existing resource group the Kubernetes cluster should be deployed into. Defaults to the name of the cluster + -rg if none is specified string null no
gpu_machine_type Machine instance type of the AKS GPU node pool string "Standard_NC6s_v3" no
gpu_node_pool_count Count of nodes in Default GPU pool number 2 no
gpu_node_pool_disk_size Disk size in GB of nodes in the Default GPU pool number 100 no
gpu_node_pool_max_count Max count of nodes in Default GPU pool number 5 no
gpu_node_pool_min_count Min count of number of nodes in Default GPU pool number 2 no
gpu_operator_driver_version The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. string "550.127.05" no
gpu_operator_namespace The namespace to deploy the NVIDIA GPU operator into string "gpu-operator" no
gpu_operator_version Version of the GPU operator to be installed string "v24.9.0" no
gpu_os_sku Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 string "Ubuntu" no
install_gpu_operator Whether to Install GPU Operator. Defaults to false available. string "true" no
install_nim_operator Whether to Install NIM Operator. Defaults to false available. string "false" no
kubernetes_version Version of Kubernetes to turn on. Run 'az aks get-versions --location --output table' to view all available versions string "1.30" no
location The region to create resources in any n/a yes
nim_operator_namespace The namespace for the GPU operator deployment string "nim-operator" no
nim_operator_version Version of the GPU Operator to deploy. Defaults to latest available. string "v1.0.0" no

Outputs

Name Description
client_certificate n/a
kube_config n/a
kubernetes_cluster_name n/a
location n/a
resource_group_name n/a