Release v1.45.0
·
157 commits
to main
since this release
Highlights:
- A3 Ultra GKE blueprints updated to use Kueue 0.10.0 and Jobset 0.7.2 which are now supported.
- Module improvements to support GKE cluster deletion protection, default node pools with shielded instances, latest GKE version in Rapid channel for A3 Ultra clusters, configurable upgrade settings for node pools and managed hyperdisk support.
- Example for running NVIDIA NeMo on a3-ultragpu-8g Slurm clusters
What's Changed
Key New Features 🎉
- Integrating kueue v0.10.0 to enable TAS with rank ordering support by @ighosh98 in #3417
- Add max_distance variable by @alyssa-sm in #3413
- Enable hierarchical namespace support in cloud-storage-bucket module by @SwarnaBharathiMantena in #3513
- Add NeMo framework example to a3-Ultra by @akiki-liang0 in #3477
- Remove Slurm-GCP v5 modules from Cluster Toolkit, refer modules/README.md
#3497
Module Improvements 🔨
- Expose cluster deletion protection by @annuay-google in #3392
- Parallelstore striping config by @dgouju in #3333
- Make upgrade settings configurable for gke-cluster by @ighosh98 in #3462
- Add kubectl provider in root module for blueprint with GKE cluster module setup by @mohitchaurasia91 in #3406
- add GKE support for managed hyperdisk by @chengcongdu in #3476
- Add shielded instance config to default pool by @annuay-google in #3507
- Add support for Redhat 7, 8 and 9 to startup-script Anisble by @wiktorn in #3487
Improvements 🛠
- Enable reservations support for kueue integration tests by @ighosh98 in #3424
- Update Kueue TAS Test Definition and add Kueue v0.10.0 toleration by @ighosh98 in #3425
- Update README with GKE parallelstore related example blueprint details by @mohitchaurasia91 in #3409
- Upgrade a3-ultra to use kueue v0.10.0 by @ighosh98 in #3438
- A3 Ultra Integration tests by @ighosh98 in #3453
- Update A3U blueprint to remove commit refs and remove hardcoded network names by @ighosh98 in #3456
- Add compact placement validations by @parulbajaj01 in #3439
- Allow customization of Parallelstore mounts by @wiktorn in #3144
- Add lifecycle rule to ignore local SSDs by @chajath in #3450
- update a3mega nccl plugin to 1.0.7 and rxdm to 1.0.13_1 by @chengcongdu in #3466
- Enable optional creation of cloud router/nat for vpcs by @abbas1902 in #3499
- Use version prefix in conjunction with release channels by @annuay-google in #3520
- Bump jobset version to 0.7.2 and remove v0.7.1 as valid version by @ankitkinra in #3517
- Include MemSpecLimit when calculating defmem by @wiktorn in #3300
Deprecations 💤
- Remove slurm-gcp v5 tests by @harshthakkar01 in #3493
- Remove slurm-gcp v5 examples and update documentation by @harshthakkar01 in #3494
- Remove Slurm-gcp v5 modules and update documentation by @harshthakkar01 in #3497
Bug fixes 🐞
- Fix GKE parallelstore blueprint name going beyond network char limit by @mohitchaurasia91 in #3432
- Updated ansible playbook test file name by @mohitchaurasia91 in #3433
- TAS Plugin Bug fix by @ighosh98 in #3449
- Update mount-daos.sh by @samskillman in #3457
- Placement policy null condition checks added by @ighosh98 in #3459
New Contributors
- @SwarnaBharathiMantena made their first contribution in #3377
- @chajath made their first contribution in #3451
- @parulbajaj01 made their first contribution in #3439
Full Changelog: v1.44.2...v1.45.0