Skip to content

Commit

Permalink
Custom Slurm GCP image example
Browse files Browse the repository at this point in the history
  • Loading branch information
wiktorn committed Jan 6, 2025
1 parent 6b28536 commit 9a772f0
Show file tree
Hide file tree
Showing 2 changed files with 320 additions and 2 deletions.
88 changes: 86 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
* [ml-slurm.yaml](#ml-slurmyaml-) ![core-badge]
* [image-builder-v5-legacy.yaml](#image-builder-v5-legacyyaml--) ![core-badge] ![deprecated-badge]
* [image-builder.yaml](#image-builderyaml-) ![core-badge]
* [image-custom-slurm.yaml](#image-custom-slurmyaml) ![community-badge]
* [serverless-batch.yaml](#serverless-batchyaml-) ![core-badge]
* [serverless-batch-mpi.yaml](#serverless-batch-mpiyaml-) ![core-badge]
* [pfs-lustre.yaml](#pfs-lustreyaml-) ![core-badge]
Expand Down Expand Up @@ -670,7 +671,7 @@ Create the deployment folder from the blueprint:

```text
./gcluster create examples/image-builder-v5-legacy.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy image-builder-001"
./gcluster deploy image-builder-001
```

Follow the on-screen prompts to approve the creation of each deployment group.
Expand Down Expand Up @@ -795,7 +796,7 @@ Create the deployment folder from the blueprint:

```text
./gcluster create examples/image-builder.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy image-builder-v6-001"
./gcluster deploy image-builder-v6-001
```

Follow the on-screen prompts to approve the creation of each deployment group.
Expand Down Expand Up @@ -885,6 +886,89 @@ partition is using the custom image. Each compute node should contain the

For this example the following is needed in the selected region:

* Compute Engine API: Images (global, not regional quota): 1 image per invocation of `packer build`
* Compute Engine API: Persistent Disk SSD (GB): **~50 GB**
* Compute Engine API: Persistent Disk Standard (GB): **~64 GB static + 32
GB/node** up to 704 GB
* Compute Engine API: N2 CPUs: **4** (for short-lived Packer VM and Slurm login node)
* Compute Engine API: C2 CPUs: **4** for controller node and **60/node** active
in `compute` partition up to 1,204
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only
needed for `compute` partition_
* Compute Engine API: Resource policies: **one for each job in parallel** -
_only needed for `compute` partition_

### [image-custom-slurm.yaml] ![community-badge]

This blueprint uses the [Packer template module][pkr] to create a custom VM
image and uses it to provision an HPC cluster using the Slurm scheduler. It
is variation of [image-builder.yaml] blueprint that allows building on top
of different base image (for example - on top of RedHat Enterprise Linux).

It differs from [image-builder.yaml] in following aspects:
1. Creates intermediary image with updated packages(see
[Intermediary Image](#intermediary-image-deployment-group-2)).
2. Runs [slurm-gcp] Ansible
playbooks (see [Custom Image](#custom-image-deployment-group-3).

#### Building and using the custom image

Create the deployment folder from the blueprint:

```text
./gcluster create examples/image-custom-slurm.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy image-custom-slurm-001
```

Follow the on-screen prompts to approve the creation of each deployment group.
For example, the network is created in the first deployment group, the VM image
is created in the second group and third group, and the fourth group uses the
image to create an HPC cluster using the Slurm scheduler.

When you are done, clean up the resources in reverse order of creation:

```text
terraform -chdir=image-custom-slurm-001/cluster destroy --auto-approve
terraform -chdir=image-custom-slurm-001/primary destroy --auto-approve
```

Finally, browse to the [Cloud Console][console-images] to delete your custom
image. It will be named beginning with `my-slurm-image` and `base-image-updated`
followed by a date and timestamp for uniqueness.

[console-images]: https://console.cloud.google.com/compute/images

#### Why use different base image?

This targeted to the users that are obliged to use their organization provided
images for the Compute Engine images. The base image has to have
[Google Guest Environment] installed and have [Guest OS Features] enabled that
match your use case.

[image-custom-slurm.yaml]: ./image-custom-slurm.yaml
[slurm-gcp]: https://github.com/GoogleCloudPlatform/slurm-gcp
[Google Guest Environment]: https://cloud.google.com/compute/docs/images/install-guest-environment#installing_guest_environment
[Guest OS Features]: https://cloud.google.com/compute/docs/images/create-custom#guest-os-features

#### Intermediary image (deployment group 2)

The Packer module uses the startup-script module from the first deployment group
and executes the script to produce an intermediary image with all the packages
updated and installs [git](https://git-scm.com/), which is required to run the
[slurm-gcp] Ansible playbooks.

This step is necessary to ensure that [slurm-gcp] playbooks is run with current
kernel version.

#### Custom image (deployment group 3)
The Packer module uses the startup-script module from the first deployment group
and executes the script to produce a final image with [slurm-gcp] installed.
This script can be further extended to add customizations as required.

#### Quota Requirements for image-builder.yaml

For this example the following is needed in the selected region:

* Compute Engine API: Images (global, not regional quota): 1 image per invocation of `packer build`
* Compute Engine API: Persistent Disk SSD (GB): **~50 GB**
* Compute Engine API: Persistent Disk Standard (GB): **~64 GB static + 32
Expand Down
234 changes: 234 additions & 0 deletions examples/image-custom-slurm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

# See instructions at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/examples#image-custom-slurmyaml-

blueprint_name: image-custom-slurm

vars:
project_id: # ## Set GCP Project ID Here ##
deployment_name: image-custom-slurm-001
region: us-central1
zone: us-central1-c
# Image settings
base_image:
# project: rhel-cloud
# family: rhel-9
#project: rocky-linux-cloud
#family: rocky-linux-9
family: hpc-rocky-linux-8
project: cloud-hpc-image-public
image_build_machine_type: n2-standard-16
build_slurm_from_git_ref: rockylinux_9
intermediary_image:
family: base-image-updated
custom_image:
family: my-slurm-image
project: $(vars.project_id)
disk_size_gb: 32

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
modules:
- id: slurm-image-network
source: modules/network/vpc

- id: imagebuilder_sa
source: community/modules/project/service-account
settings:
name: builder
display_name: image builder
project_roles:
- compute.instanceAdmin.v1
- iam.serviceAccountUser
- logging.logWriter
- monitoring.metricWriter
- storage.objectViewer

- id: prepare-image
source: modules/scripts/startup-script
settings:
install_ansible: true
runners:
- type: ansible-local
destination: update_os.yml
content: |
---
- name: Update all packages
hosts: all
become: true
gather_facts: true
tasks:
- name: Update all packages
ansible.builtin.package:
name: "*"
state: latest
- name: Install git
ansible.builtin.package:
name: git
state: latest
- name: Install bottle python package
ansible.builtin.pip:
name: selinux
executable: "{{ ansible_python.executable | dirname }}/pip3"
- id: build-image
source: modules/scripts/startup-script
settings:
runners:
- type: data
destination: /tmp/slurm_vars.json
content: |
{
"reboot": false,
"install_cuda": true,
"install_gcsfuse": true,
"install_lustre": true,
"install_ompi": true,
"update_kernel": false,
"monitoring_agent": "cloud-ops",
}
- type: shell
destination: install_slurm.sh
content: |
#!/bin/bash
set -e -o pipefail
ansible-pull \
-U https://github.com/wiktorn/slurm-gcp/ -C $(vars.build_slurm_from_git_ref) \
-i localhost, --limit localhost --connection=local \
-e @/tmp/slurm_vars.json \
ansible/playbook.yml
# Create intermediary image with all packages updated, so slurm-gcp build
# will run on current kernel and there will be no issues installing DKMS
# modules. Hence, we can set `reboot` and `update_kernel` to false.
- group: image-prepare
modules:
- id: base-image-updated
source: modules/packer/custom-image
kind: packer
settings:
disk_size: $(vars.disk_size_gb)
machine_type: $(vars.image_build_machine_type)
source_image_family: $(vars.base_image.family)
source_image_project_id: [$(vars.base_image.project)]
image_family: $(vars.intermediary_image.family)
service_account_email: $(imagebuilder_sa.service_account_email)
service_account_scopes: ["https://www.googleapis.com/auth/cloud-platform"]
use:
- slurm-image-network
- prepare-image

- group: image
modules:
- id: slurm-image
source: modules/packer/custom-image
kind: packer
settings:
disk_size: $(vars.disk_size_gb)
machine_type: $(vars.image_build_machine_type)
source_image_family: $(vars.intermediary_image.family)
source_image_project_id: [$(vars.project_id)]
image_family: $(vars.custom_image.family)
service_account_email: $(imagebuilder_sa.service_account_email)
service_account_scopes: ["https://www.googleapis.com/auth/cloud-platform"]
use:
- slurm-image-network
- build-image

- group: cluster
modules:
- id: controller_sa
source: community/modules/project/service-account
settings:
name: ctrlr
display_name: Slurm Controller
project_roles:
- compute.instanceAdmin.v1
- iam.serviceAccountUser
- logging.logWriter
- monitoring.metricWriter
- pubsub.admin
- storage.objectViewer

- id: login_sa
source: community/modules/project/service-account
settings:
name: login
display_name: Slurm Login
project_roles:
- logging.logWriter
- monitoring.metricWriter
- storage.objectViewer
- storage.objectUser

- id: compute_sa
source: community/modules/project/service-account
settings:
name: compute
display_name: Slurm Compute
project_roles:
- logging.logWriter
- monitoring.metricWriter
- storage.objectUser
- storage.objectViewer

- id: compute_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [slurm-image-network]
settings:
node_count_dynamic_max: 20
disk_size_gb: $(vars.disk_size_gb)
instance_image: $(vars.custom_image)
instance_image_custom: true
bandwidth_tier: gvnic_enabled
service_account_email: $(compute_sa.service_account_email)
service_account_scopes: ["https://www.googleapis.com/auth/cloud-platform"]

- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [compute_nodeset]
settings:
partition_name: compute
is_default: true

- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [slurm-image-network]
settings:
disk_size_gb: $(vars.disk_size_gb)
instance_image: $(vars.custom_image)
instance_image_custom: true
service_account_email: $(login_sa.service_account_email)
service_account_scopes: ["https://www.googleapis.com/auth/cloud-platform"]

- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- slurm-image-network
- compute_partition
- slurm_login
settings:
disk_size_gb: $(vars.disk_size_gb)
instance_image: $(vars.custom_image)
instance_image_custom: true
service_account_email: $(controller_sa.service_account_email)
service_account_scopes: ["https://www.googleapis.com/auth/cloud-platform"]

0 comments on commit 9a772f0

Please sign in to comment.