Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm6. Fix race condition between GCS config files and instances #1932

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

mr0re1
Copy link
Collaborator

@mr0re1 mr0re1 commented Nov 5, 2023

There was a race between config files stored in GCS and instance using those files. Fixed by stopping using SchedMD slurm_cluster module, instead replicate (unwrap) its content within Toolkit's schedmd-slurm-gcp-v6-controller module. This let us to establish proper depends_on = [slurm_files] for controller instance resource.

NOTE: This changes made us to temporary remove support for using CloudSQL as DB backend since it was a point of circular dependencies with updated depends_on. To be adressed later.

Added Filestore usage to the community/examples/hpc-slurm6.yaml as it should work after the fix.

@mr0re1 mr0re1 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Nov 5, 2023
@mr0re1 mr0re1 requested a review from harshthakkar01 November 5, 2023 05:01
@mr0re1 mr0re1 requested review from cboneti and nick-stroud and removed request for harshthakkar01 November 5, 2023 05:14
@mr0re1 mr0re1 assigned nick-stroud and unassigned harshthakkar01 Nov 5, 2023
@nick-stroud nick-stroud assigned mr0re1 and unassigned nick-stroud Nov 9, 2023
There was a race between config files stored in GCS and instance using those files.
Fixed by stopping using SchedMD `slurm_cluster` module, instead replicate (unwrap) its content
within Toolkit's `schedmd-slurm-gcp-v6-controller` module.
This let us to establish proper `depends_on = [slurm_files]` for controller instance resource.

**NOTE:** This changes made us to temporary remove support for using CloudSQL as DB backend
since it was a point of circular dependencies with updated `depends_on`. To be adressed later.

Added Filestore usage to the `community/examples/hpc-slurm6.yaml` as it should work after the fix.
@mr0re1 mr0re1 enabled auto-merge November 9, 2023 19:50
@mr0re1 mr0re1 disabled auto-merge November 9, 2023 19:50
@mr0re1 mr0re1 merged commit 52d274b into GoogleCloudPlatform:develop Nov 9, 2023
@mr0re1 mr0re1 deleted the broken_fs branch November 9, 2023 20:56
@mr0re1 mr0re1 restored the broken_fs branch November 9, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants