Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine type availability checks to slurm-gcp-v6-nodeset #2962

Conversation

annuay-google
Copy link
Contributor

@annuay-google annuay-google commented Aug 21, 2024

Issue

Blueprint yaml files allow users to specify machine and zone combinations that do not exist. Infra is provisioned by Terraform during ./ghpc deploy , but autoscaling may fail later if capacity is not found by bulk insert APIs in zones specified

Approach

Created a Terraform precondition to verify that the machine type is available in at least one zone. In case they're not, there is no feasible way for bulk insert to allocate machines, and terraform will exit during plan/apply

Testing

Ran ./ghpc create, ./ghpc deploy on this configuration. The specified machine type does not exist in both zones us-central1-b and us-central1-c. Verified that Terraform exits and error message is as expected

config:

vars:
  deployment_name: hpc-slurm
  region: us-central1
  zone: us-central1-b

deployment_groups:
- group: primary
  modules:
  - id: h3_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 20
      # Note that H3 is available in only specific zones. https://cloud.google.com/compute/docs/regions-zones
      machine_type: h3-standard-88
      # H3 does not support pd-ssd and pd-standard
      # https://cloud.google.com/compute/docs/compute-optimized-machines#h3_disks
      disk_type: pd-balanced
      bandwidth_tier: gvnic_enabled
      allow_automatic_updates: false
      zones:
      - us-central1-c

output:

Testing if deployment group hpc-slurm/primary requires adding or changing cloud infrastructure
Error: exit status 1

Error: Resource precondition failed

  on modules/embedded/community/modules/compute/schedmd-slurm-gcp-v6-nodeset/main.tf line 193, in resource "terraform_data" "machine_type_zone_validation":
 193:       condition     = length(local.zones_with_machine_type) > 0
    ├────────────────
    │ local.zones_with_machine_type is empty tuple

machine type h3-standard-88 is not available in any of the zones
["us-central1-b","us-central1-c"]. To list zones in which it is available, run:

gcloud compute machine-types list --filter="name=h3-standard-88"

Hint: terraform plan for deployment group hpc-slurm/primary failed

Added 1 valid zone (us-central1-a) and verified that infra is correctly provisioned

Submission Checklist

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@annuay-google annuay-google added release-improvements Added to release notes under the "Improvements" heading. release-module-improvements Added to release notes under the "Module Improvements" heading. release-chore To not include into release notes and removed release-improvements Added to release notes under the "Improvements" heading. release-module-improvements Added to release notes under the "Module Improvements" heading. labels Aug 21, 2024
@annuay-google annuay-google marked this pull request as draft August 21, 2024 15:18
@nick-stroud nick-stroud assigned mr0re1 and unassigned nick-stroud Aug 21, 2024
@annuay-google annuay-google added release-module-improvements Added to release notes under the "Module Improvements" heading. and removed release-chore To not include into release notes labels Aug 21, 2024
@annuay-google annuay-google marked this pull request as ready for review August 21, 2024 16:35
@mr0re1 mr0re1 assigned annuay-google and unassigned mr0re1 Aug 21, 2024
mr0re1
mr0re1 previously approved these changes Aug 22, 2024
@tpdownes tpdownes dismissed mr0re1’s stale review August 26, 2024 14:02

The boolean logic in the check block does not enforce the correct behavior.

@tpdownes tpdownes assigned annuay-google and unassigned tpdownes Aug 26, 2024
@annuay-google annuay-google requested a review from tpdownes August 28, 2024 13:00
Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the terraform_data resource and precondition useful, it should be inserted into the terraform resource graph.

@tpdownes tpdownes assigned annuay-google and unassigned tpdownes Aug 29, 2024
Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also change versions.tf to require terraform 1.4

@annuay-google
Copy link
Contributor Author

Please also change versions.tf to require terraform 1.4

Done

Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one final helpful hint to the user. Then squash your commits and this should be ready!

@tpdownes tpdownes assigned annuay-google and unassigned tpdownes Aug 29, 2024
@annuay-google
Copy link
Contributor Author

Please add one final helpful hint to the user. Then squash your commits and this should be ready!

Done. Verified the error as well

machine type h3-standard-88 is not available in any of the zones
["us-central1-b"]". To list zones in which it is available, run:

gcloud compute machine-types list --filter="name=h3-standard-88"

@annuay-google annuay-google force-pushed the annuay/add-machine-type-availability-checks branch from 6c9bcfc to dcfd1ce Compare August 30, 2024 13:24
@annuay-google annuay-google requested a review from tpdownes August 30, 2024 13:25
Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big improvement in user experience. Thank you for the submission!

@tpdownes tpdownes assigned annuay-google and unassigned tpdownes Aug 30, 2024
@annuay-google annuay-google merged commit 715f9f2 into GoogleCloudPlatform:develop Aug 30, 2024
10 of 53 checks passed
@annuay-google annuay-google deleted the annuay/add-machine-type-availability-checks branch August 30, 2024 17:50
@annuay-google annuay-google restored the annuay/add-machine-type-availability-checks branch September 2, 2024 08:03
@rohitramu rohitramu mentioned this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants