Skip to content

Commit

Permalink
Merge pull request #23 from NVIDIA/magzhang/update-to-k8s-1.28
Browse files Browse the repository at this point in the history
Updated kubernetes cluster version to 1.28 and update GPU operator to newest
  • Loading branch information
MaggieXJZhang authored Jan 11, 2024
2 parents 53bfca6 + 26b7a3c commit b388bcd
Show file tree
Hide file tree
Showing 16 changed files with 73 additions and 87 deletions.
6 changes: 4 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@

#### Coding Guidelines

- All source code contributions must be formatted prior to checkin by running `terraform fmt`, and changes should not break validation checks provided by the `terraform validate` command.
- All source code contributions must be formatted prior to checkin by running `terraform fmt -recursive`, and changes should not break validation checks provided by the `terraform validate` command.

- When updating a variable, output, or provider, update the documentation accordingly. We use [terraform-docs](https://terraform-docs.io/) to generate the Terraform documentation in each README. Run `terraform-docs markdown .` to generate the documentation.
- When updating a variable, output, or provider, update the documentation accordingly. We use [terraform-docs](https://terraform-docs.io/) to generate the Terraform documentation.
- Run `terraform-docs markdown .` to generate the documentation and replace the bottom of the README section.
- Run `terraform-docs tfvars hcl .` to generate the tfvar defaults, replace the existing comments of `terraform.tfvars` file.

- In addition, please follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality.
- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Each CSP has its own end of life date for the versions of Kubernetes they suppor

| Version | Release Date | Kubernetes Versions | NVIDIA GPU Operator | NVIDIA Data Center Driver* | End of Life |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.6.0 | January 2024 | EKS - 1.28 <br> GKE - 1.28 <br> AKS - 1.28 | 23.9.1 (Default); 23.9.0 (NV AI E) | 535.129.03 (EKS & GKE Default); 535.129.03 (NV AI E version for GKE & EKS) | EKS - Nov 2024 <br> GKE - Nov 2024 <br> AKS - Nov 2024 |
| 0.5.0 | November 2023 | EKS - 1.27 <br> GKE - 1.27 <br> AKS - 1.27 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.104.05 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - July 2024 <br> GKE - August 2024 <br> AKS - July 2024 |
| 0.4.0 | October 2023 | EKS - 1.27 <br> GKE - 1.27 <br> AKS - 1.27 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.104.05 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - July 2024 <br> GKE - August 2024 <br> AKS - July 2024 |
| 0.3.0 | September 2023 | EKS - 1.26 <br> GKE - 1.26 <br> AKS - 1.26 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.54.03 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - June 2024 <br> GKE - June 2024 <br> AKS - March 2024 |
Expand Down
7 changes: 3 additions & 4 deletions aks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
- In Cloud Shell, run `az login` and re-run `terraform apply`



## Requirements

| Name | Version |
Expand Down Expand Up @@ -131,12 +130,12 @@ No modules.
| <a name="input_gpu_node_pool_max_count"></a> [gpu\_node\_pool\_max\_count](#input\_gpu\_node\_pool\_max\_count) | Max count of nodes in Default GPU pool | `number` | `5` | no |
| <a name="input_gpu_node_pool_min_count"></a> [gpu\_node\_pool\_min\_count](#input\_gpu\_node\_pool\_min\_count) | Min count of number of nodes in Default GPU pool | `number` | `2` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace to deploy the NVIDIA GPU operator into | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.6.1"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.9.1"` | no |
| <a name="input_gpu_os_sku"></a> [gpu\_os\_sku](#input\_gpu\_os\_sku) | Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 | `string` | `"Ubuntu"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions | `string` | `"1.27"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions | `string` | `"1.28"` | no |
| <a name="input_location"></a> [location](#input\_location) | The region to create resources in | `any` | n/a | yes |
| <a name="input_nvaie"></a> [nvaie](#input\_nvaie) | To use the versions of GPU operator and drivers specified as part of NVIDIA AI Enterprise, set this to true. More information at https://www.nvidia.com/en-us/data-center/products/ai-enterprise | `bool` | `false` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.3.2"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.9.0"` | no |

## Outputs

Expand Down
8 changes: 4 additions & 4 deletions aks/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
Expand All @@ -19,9 +19,9 @@
# gpu_node_pool_max_count = 5
# gpu_node_pool_min_count = 2
# gpu_operator_namespace = "gpu-operator"
# gpu_operator_version = "v23.6.1"
# gpu_operator_version = "v23.9.1"
# gpu_os_sku = "Ubuntu"
# kubernetes_version = "1.26.3"
# kubernetes_version = "1.28"
# location = ""
# nvaie = false
# nvaie_gpu_operator_version = "v23.3.2"
# nvaie_gpu_operator_version = "v23.9.0"
8 changes: 4 additions & 4 deletions aks/variables.tf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

/****************************
Expand All @@ -25,7 +25,7 @@ variable "cluster_name" {
}

variable "kubernetes_version" {
default = "1.27"
default = "1.28"
description = "Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions "
}

Expand Down Expand Up @@ -87,7 +87,7 @@ variable "gpu_os_sku" {
GPU Operator Variables
****************************/
variable "gpu_operator_version" {
default = "v23.6.1"
default = "v23.9.1"
description = "Version of the GPU operator to be installed"
}

Expand All @@ -105,7 +105,7 @@ variable "nvaie" {

variable "nvaie_gpu_operator_version" {
type = string
default = "v23.3.2"
default = "v23.9.0"
description = "The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true`"
}

Expand Down
20 changes: 0 additions & 20 deletions eks/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_aws_profile"></a> [aws\_profile](#input\_aws\_profile) | n/a | `string` | `"development"` | no |
| <a name="input_cidr_block"></a> [cidr\_block](#input\_cidr\_block) | CIDR for VPC | `string` | `"10.0.0.0/16"` | no |
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | n/a | `string` | n/a | yes |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.27"` | no |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.28"` | no |
| <a name="input_cpu_instance_type"></a> [cpu\_instance\_type](#input\_cpu\_instance\_type) | CPU EC2 worker node instance type | `string` | `"t2.xlarge"` | no |
| <a name="input_cpu_node_pool_additional_user_data"></a> [cpu\_node\_pool\_additional\_user\_data](#input\_cpu\_node\_pool\_additional\_user\_data) | User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. | `string` | `""` | no |
| <a name="input_cpu_node_pool_delete_on_termination"></a> [cpu\_node\_pool\_delete\_on\_termination](#input\_cpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
Expand All @@ -136,18 +136,18 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_gpu_node_pool_delete_on_termination"></a> [gpu\_node\_pool\_delete\_on\_termination](#input\_gpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
| <a name="input_gpu_node_pool_root_disk_size_gb"></a> [gpu\_node\_pool\_root\_disk\_size\_gb](#input\_gpu\_node\_pool\_root\_disk\_size\_gb) | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | `number` | `512` | no |
| <a name="input_gpu_node_pool_root_volume_type"></a> [gpu\_node\_pool\_root\_volume\_type](#input\_gpu\_node\_pool\_root\_volume\_type) | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | `string` | `"gp2"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.104.05"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.129.03"` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace for the GPU operator deployment | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.6.1"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.9.1"` | no |
| <a name="input_max_cpu_nodes"></a> [max\_cpu\_nodes](#input\_max\_cpu\_nodes) | Maximum number of CPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_max_gpu_nodes"></a> [max\_gpu\_nodes](#input\_max\_gpu\_nodes) | Maximum number of GPU nodes in the Autoscaling Group | `string` | `"5"` | no |
| <a name="input_min_cpu_nodes"></a> [min\_cpu\_nodes](#input\_min\_cpu\_nodes) | Minimum number of CPU nodes in the Autoscaling Group | `string` | `"0"` | no |
| <a name="input_min_gpu_nodes"></a> [min\_gpu\_nodes](#input\_min\_gpu\_nodes) | Minimum number of GPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_nvaie"></a> [nvaie](#input\_nvaie) | To use the versions of GPU operator and drivers specified as part of NVIDIA AI Enterprise, set this to true. More information at https://www.nvidia.com/en-us/data-center/products/ai-enterprise | `bool` | `false` | no |
| <a name="input_nvaie_gpu_operator_driver_version"></a> [nvaie\_gpu\_operator\_driver\_version](#input\_nvaie\_gpu\_operator\_driver\_version) | The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true` | `string` | `"525.125.06"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.3.2"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.1.0/24",<br> "10.0.2.0/24",<br> "10.0.3.0/24"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.4.0/24",<br> "10.0.5.0/24",<br> "10.0.6.0/24"<br>]</pre> | no |
| <a name="input_nvaie_gpu_operator_driver_version"></a> [nvaie\_gpu\_operator\_driver\_version](#input\_nvaie\_gpu\_operator\_driver\_version) | The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true` | `string` | `"535.129.03"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.9.0"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.0.0/19",<br> "10.0.32.0/19",<br> "10.0.64.0/19"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.96.0/19",<br> "10.0.128.0/19",<br> "10.0.160.0/19"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | AWS region to provision the Holoscan Compliant Kubernetes Cluster | `string` | `"us-west-2"` | no |
| <a name="input_single_nat_gateway"></a> [single\_nat\_gateway](#input\_single\_nat\_gateway) | Should be true if you want to provision a single shared NAT Gateway across all of your private networks | `bool` | `false` | no |
| <a name="input_ssh_key"></a> [ssh\_key](#input\_ssh\_key) | n/a | `string` | `""` | no |
Expand All @@ -166,4 +166,4 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="output_nodes"></a> [nodes](#output\_nodes) | n/a |
| <a name="output_oidc_endpoint"></a> [oidc\_endpoint](#output\_oidc\_endpoint) | n/a |
| <a name="output_private_subnet_ids"></a> [private\_subnet\_ids](#output\_private\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
4 changes: 2 additions & 2 deletions eks/examples/cnpack/aws-pca.tf
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ AWS Private Cert Authority Config

// Create AWS Private Cert Authority
resource "aws_acmpca_certificate_authority" "cnpack-pca" {
count = var.pca_enabled ? 1 : 0
type = "ROOT"
count = var.pca_enabled ? 1 : 0
type = "ROOT"
usage_mode = var.pca_short_lived ? "SHORT_LIVED_CERTIFICATE" : "GENERAL_PURPOSE"
certificate_authority_configuration {
key_algorithm = "RSA_4096"
Expand Down
31 changes: 12 additions & 19 deletions eks/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -1,11 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
# Do not commit this file to Git with sensitive values


# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
Expand All @@ -17,7 +10,7 @@
# aws_profile = "development"
# cidr_block = "10.0.0.0/16"
# cluster_name = ""
# cluster_version = "1.26"
# cluster_version = "1.28"
# cpu_instance_type = "t2.xlarge"
# cpu_node_pool_additional_user_data = ""
# cpu_node_pool_delete_on_termination = true
Expand All @@ -35,25 +28,25 @@
# gpu_node_pool_delete_on_termination = true
# gpu_node_pool_root_disk_size_gb = 512
# gpu_node_pool_root_volume_type = "gp2"
# gpu_operator_driver_version = "535.104.05"
# gpu_operator_driver_version = "535.129.03"
# gpu_operator_namespace = "gpu-operator"
# gpu_operator_version = "v23.6.1"
# gpu_operator_version = "v23.9.1"
# max_cpu_nodes = "2"
# max_gpu_nodes = "5"
# min_cpu_nodes = "0"
# min_gpu_nodes = "2"
# nvaie = false
# nvaie_gpu_operator_driver_version = "525.125.06"
# nvaie_gpu_operator_version = "v23.3.2"
# nvaie_gpu_operator_driver_version = "535.129.03"
# nvaie_gpu_operator_version = "v23.9.0"
# private_subnets = [
# "10.0.1.0/24",
# "10.0.2.0/24",
# "10.0.3.0/24"
# "10.0.0.0/19",
# "10.0.32.0/19",
# "10.0.64.0/19"
# ]
# public_subnets = [
# "10.0.4.0/24",
# "10.0.5.0/24",
# "10.0.6.0/24"
# "10.0.96.0/19",
# "10.0.128.0/19",
# "10.0.160.0/19"
# ]
# region = "us-west-2"
# single_nat_gateway = false
Expand Down
Loading

0 comments on commit b388bcd

Please sign in to comment.