Skip to content

Commit

Permalink
Add enable-maintenance-reservation flag in slurm to control reservation
Browse files Browse the repository at this point in the history
for scheduled maintenance
  • Loading branch information
harshthakkar01 committed Sep 4, 2024
1 parent c42a4f9 commit 22c0d78
Show file tree
Hide file tree
Showing 7 changed files with 71 additions and 40 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ No modules.
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Size of boot disk to create for the partition compute nodes. | `number` | `50` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Boot disk type, can be either hyperdisk-balanced, pd-ssd, pd-standard, pd-balanced, or pd-extreme. | `string` | `"pd-standard"` | no |
| <a name="input_enable_confidential_vm"></a> [enable\_confidential\_vm](#input\_enable\_confidential\_vm) | Enable the Confidential VM configuration. Note: the instance image must support option. | `bool` | `false` | no |
| <a name="input_enable_maintenance_reservation"></a> [enable\_maintenance\_reservation](#input\_enable\_maintenance\_reservation) | Enables slurm reservation for scheduled maintenance. | `bool` | `true` | no |
| <a name="input_enable_oslogin"></a> [enable\_oslogin](#input\_enable\_oslogin) | Enables Google Cloud os-login for user login and authentication for VMs.<br>See https://cloud.google.com/compute/docs/oslogin | `bool` | `true` | no |
| <a name="input_enable_placement"></a> [enable\_placement](#input\_enable\_placement) | Enable placement groups. | `bool` | `true` | no |
| <a name="input_enable_public_ips"></a> [enable\_public\_ips](#input\_enable\_public\_ips) | If set to true. The node group VMs will have a random public IP assigned to it. Ignored if access\_config is set. | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ locals {

startup_script = local.ghpc_startup_script
network_storage = var.network_storage

enable_maintenance_reservation = var.enable_maintenance_reservation
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -505,3 +505,10 @@ variable "instance_properties" {
type = any
default = null
}


variable "enable_maintenance_reservation" {
type = bool
description = "Enables slurm reservation for scheduled maintenance."
default = true
}
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ limitations under the License.
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata, provided as a map. | `map(string)` | `{}` | no |
| <a name="input_min_cpu_platform"></a> [min\_cpu\_platform](#input\_min\_cpu\_platform) | Specifies a minimum CPU platform. Applicable values are the friendly names of<br>CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list:<br>https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform | `string` | `null` | no |
| <a name="input_network_storage"></a> [network\_storage](#input\_network\_storage) | An array of network attached storage mounts to be configured on all instances. | <pre>list(object({<br> server_ip = string,<br> remote_mount = string,<br> local_mount = string,<br> fs_type = string,<br> mount_options = string,<br> client_install_runner = optional(map(string))<br> mount_runner = optional(map(string))<br> }))</pre> | `[]` | no |
| <a name="input_nodeset"></a> [nodeset](#input\_nodeset) | Define nodesets, as a list. | <pre>list(object({<br> node_count_static = optional(number, 0)<br> node_count_dynamic_max = optional(number, 1)<br> node_conf = optional(map(string), {})<br> nodeset_name = string<br> additional_disks = optional(list(object({<br> disk_name = optional(string)<br> device_name = optional(string)<br> disk_size_gb = optional(number)<br> disk_type = optional(string)<br> disk_labels = optional(map(string), {})<br> auto_delete = optional(bool, true)<br> boot = optional(bool, false)<br> })), [])<br> bandwidth_tier = optional(string, "platform_default")<br> can_ip_forward = optional(bool, false)<br> disable_smt = optional(bool, false)<br> disk_auto_delete = optional(bool, true)<br> disk_labels = optional(map(string), {})<br> disk_size_gb = optional(number)<br> disk_type = optional(string)<br> enable_confidential_vm = optional(bool, false)<br> enable_placement = optional(bool, false)<br> enable_oslogin = optional(bool, true)<br> enable_shielded_vm = optional(bool, false)<br> gpu = optional(object({<br> count = number<br> type = string<br> }))<br> labels = optional(map(string), {})<br> machine_type = optional(string)<br> maintenance_interval = optional(string)<br> instance_properties_json = string<br> metadata = optional(map(string), {})<br> min_cpu_platform = optional(string)<br> network_tier = optional(string, "STANDARD")<br> network_storage = optional(list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> client_install_runner = optional(map(string))<br> mount_runner = optional(map(string))<br> })), [])<br> on_host_maintenance = optional(string)<br> preemptible = optional(bool, false)<br> region = optional(string)<br> service_account = optional(object({<br> email = optional(string)<br> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br> }))<br> shielded_instance_config = optional(object({<br> enable_integrity_monitoring = optional(bool, true)<br> enable_secure_boot = optional(bool, true)<br> enable_vtpm = optional(bool, true)<br> }))<br> source_image_family = optional(string)<br> source_image_project = optional(string)<br> source_image = optional(string)<br> subnetwork_self_link = string<br> additional_networks = optional(list(object({<br> network = string<br> subnetwork = string<br> subnetwork_project = string<br> network_ip = string<br> nic_type = string<br> stack_type = string<br> queue_count = number<br> access_config = list(object({<br> nat_ip = string<br> network_tier = string<br> }))<br> ipv6_access_config = list(object({<br> network_tier = string<br> }))<br> alias_ip_range = list(object({<br> ip_cidr_range = string<br> subnetwork_range_name = string<br> }))<br> })))<br> access_config = optional(list(object({<br> nat_ip = string<br> network_tier = string<br> })))<br> spot = optional(bool, false)<br> tags = optional(list(string), [])<br> termination_action = optional(string)<br> reservation_name = optional(string)<br> startup_script = optional(list(object({<br> filename = string<br> content = string })), [])<br><br> zone_target_shape = string<br> zone_policy_allow = set(string)<br> zone_policy_deny = set(string)<br> }))</pre> | `[]` | no |
| <a name="input_nodeset"></a> [nodeset](#input\_nodeset) | Define nodesets, as a list. | <pre>list(object({<br> node_count_static = optional(number, 0)<br> node_count_dynamic_max = optional(number, 1)<br> node_conf = optional(map(string), {})<br> nodeset_name = string<br> additional_disks = optional(list(object({<br> disk_name = optional(string)<br> device_name = optional(string)<br> disk_size_gb = optional(number)<br> disk_type = optional(string)<br> disk_labels = optional(map(string), {})<br> auto_delete = optional(bool, true)<br> boot = optional(bool, false)<br> })), [])<br> bandwidth_tier = optional(string, "platform_default")<br> can_ip_forward = optional(bool, false)<br> disable_smt = optional(bool, false)<br> disk_auto_delete = optional(bool, true)<br> disk_labels = optional(map(string), {})<br> disk_size_gb = optional(number)<br> disk_type = optional(string)<br> enable_confidential_vm = optional(bool, false)<br> enable_placement = optional(bool, false)<br> enable_oslogin = optional(bool, true)<br> enable_shielded_vm = optional(bool, false)<br> enable_maintenance_reservation = optional(bool, true)<br> gpu = optional(object({<br> count = number<br> type = string<br> }))<br> labels = optional(map(string), {})<br> machine_type = optional(string)<br> maintenance_interval = optional(string)<br> instance_properties_json = string<br> metadata = optional(map(string), {})<br> min_cpu_platform = optional(string)<br> network_tier = optional(string, "STANDARD")<br> network_storage = optional(list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> client_install_runner = optional(map(string))<br> mount_runner = optional(map(string))<br> })), [])<br> on_host_maintenance = optional(string)<br> preemptible = optional(bool, false)<br> region = optional(string)<br> service_account = optional(object({<br> email = optional(string)<br> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br> }))<br> shielded_instance_config = optional(object({<br> enable_integrity_monitoring = optional(bool, true)<br> enable_secure_boot = optional(bool, true)<br> enable_vtpm = optional(bool, true)<br> }))<br> source_image_family = optional(string)<br> source_image_project = optional(string)<br> source_image = optional(string)<br> subnetwork_self_link = string<br> additional_networks = optional(list(object({<br> network = string<br> subnetwork = string<br> subnetwork_project = string<br> network_ip = string<br> nic_type = string<br> stack_type = string<br> queue_count = number<br> access_config = list(object({<br> nat_ip = string<br> network_tier = string<br> }))<br> ipv6_access_config = list(object({<br> network_tier = string<br> }))<br> alias_ip_range = list(object({<br> ip_cidr_range = string<br> subnetwork_range_name = string<br> }))<br> })))<br> access_config = optional(list(object({<br> nat_ip = string<br> network_tier = string<br> })))<br> spot = optional(bool, false)<br> tags = optional(list(string), [])<br> termination_action = optional(string)<br> reservation_name = optional(string)<br> startup_script = optional(list(object({<br> filename = string<br> content = string })), [])<br><br> zone_target_shape = string<br> zone_policy_allow = set(string)<br> zone_policy_deny = set(string)<br> }))</pre> | `[]` | no |
| <a name="input_nodeset_dyn"></a> [nodeset\_dyn](#input\_nodeset\_dyn) | Defines dynamic nodesets, as a list. | <pre>list(object({<br> nodeset_name = string<br> nodeset_feature = string<br> }))</pre> | `[]` | no |
| <a name="input_nodeset_tpu"></a> [nodeset\_tpu](#input\_nodeset\_tpu) | Define TPU nodesets, as a list. | <pre>list(object({<br> node_count_static = optional(number, 0)<br> node_count_dynamic_max = optional(number, 5)<br> nodeset_name = string<br> enable_public_ip = optional(bool, false)<br> node_type = string<br> accelerator_config = optional(object({<br> topology = string<br> version = string<br> }), {<br> topology = ""<br> version = ""<br> })<br> tf_version = string<br> preemptible = optional(bool, false)<br> preserve_tpu = optional(bool, false)<br> zone = string<br> data_disks = optional(list(string), [])<br> docker_image = optional(string, "")<br> network_storage = optional(list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> client_install_runner = optional(map(string))<br> mount_runner = optional(map(string))<br> })), [])<br> subnetwork = string<br> service_account = optional(object({<br> email = optional(string)<br> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br> }))<br> project_id = string<br> reserved = optional(string, false)<br> }))</pre> | `[]` | no |
| <a name="input_on_host_maintenance"></a> [on\_host\_maintenance](#input\_on\_host\_maintenance) | Instance availability Policy. | `string` | `"MIGRATE"` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import yaml
import datetime as dt
from datetime import datetime
from typing import Dict, Tuple
from typing import Dict, Tuple, Set

import util
from util import (
Expand Down Expand Up @@ -409,14 +409,14 @@ def reconfigure_slurm():
if lookup().cfg.hybrid:
# terraform handles generating the config.yaml, don't do it here
return

upd, cfg_new = util.fetch_config()
if not upd:
log.debug("No changes in config detected.")
return
log.debug("Changes in config detected. Reconfiguring Slurm now.")
util.update_config(cfg_new)

if lookup().is_controller:
conf.gen_controller_configs(lookup())
log.info("Restarting slurmctld to make changes take effect.")
Expand Down Expand Up @@ -454,7 +454,7 @@ def create_reservation(lkp: util.Lookup, reservation_name: str, node: str, start
util.run(f"{lkp.scontrol} create reservation user=slurm starttime={formatted_start_time} duration=180 nodes={node} reservationname={reservation_name}")


def get_slurm_reservation_maintenance(lkp: util.Lookup) -> Dict[str, datetime]:
def get_slurm_reservation_maintenance(lkp: util.Lookup, nodeset_enable_reservation: Set[str]) -> Dict[str, datetime]:
res = util.run(f"{lkp.scontrol} show reservation --json")
all_reservations = json.loads(res.stdout)
reservation_map = {}
Expand All @@ -473,27 +473,47 @@ def get_slurm_reservation_maintenance(lkp: util.Lookup) -> Dict[str, datetime]:
if name != f"{nodes}_maintenance":
continue

if nodes not in nodeset_enable_reservation:
continue

reservation_map[name] = datetime.fromtimestamp(time_epoch)

return reservation_map


def get_upcoming_maintenance(lkp: util.Lookup) -> Dict[str, Tuple[str, datetime]]:
def get_upcoming_maintenance(lkp: util.Lookup, nodeset_enable_reservation: Set[str]) -> Dict[str, Tuple[str, datetime]]:
upc_maint_map = {}

for node, properties in lkp.instances().items():
if 'upcomingMaintenance' in properties:
if 'upcomingMaintenance' in properties and node in nodeset_enable_reservation:
start_time = datetime.strptime(properties['upcomingMaintenance']['startTimeWindow']['earliest'], '%Y-%m-%dT%H:%M:%S%z')
upc_maint_map[node + "_maintenance"] = (node, start_time)

return upc_maint_map


def get_nodeset_enable_reservation(lkp: util.Lookup) -> Set[str]:
nodeset_enable_reservation = set()
for nodeset in lkp.cfg.nodeset.values():
if nodeset.enable_maintenance_reservation and nodeset.node_count_static:
static, _ = lkp.nodenames(nodeset)
nodeset_enable_reservation.update(static)

return nodeset_enable_reservation


def sync_maintenance_reservation(lkp: util.Lookup) -> None:
upc_maint_map = get_upcoming_maintenance(lkp) # map reservation_name -> (node_name, time)
nodeset_enable_reservation = get_nodeset_enable_reservation(lkp)
log.info(f"nodeset enabled for reservation for scheduled maintenance: {nodeset_enable_reservation}")

if not nodeset_enable_reservation:
log.debug("no nodesets are enabled for reservation for scheduled maintenance.")
return

upc_maint_map = get_upcoming_maintenance(lkp, nodeset_enable_reservation) # map reservation_name -> (node_name, time)
log.debug(f"upcoming-maintenance-vms: {upc_maint_map}")

curr_reservation_map = get_slurm_reservation_maintenance(lkp) # map reservation_name -> time
curr_reservation_map = get_slurm_reservation_maintenance(lkp, nodeset_enable_reservation) # map reservation_name -> time
log.debug(f"curr-reservation-map: {curr_reservation_map}")

del_reservation = set(curr_reservation_map.keys() - upc_maint_map.keys())
Expand Down Expand Up @@ -541,14 +561,13 @@ def main():
except Exception:
log.exception("failed to update topology")

## TODO: Enable reservation for scheduled maintenance.
# try:
# sync_maintenance_reservation(lookup())
# except Exception:
# log.exception("failed to sync slurm reservation for scheduled maintenance")
try:
sync_maintenance_reservation(lookup())
except Exception:
log.exception("failed to sync slurm reservation for scheduled maintenance")

try:
# TODO: it performs 1 to 4 GCS list requests,
# TODO: it performs 1 to 4 GCS list requests,
# use cached version, combine with `_list_config_blobs`
install_custom_scripts(check_hash=True)
except Exception:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,20 +81,21 @@ module "nodeset_cleanup" {

locals {
nodesets = [for name, ns in local.nodeset_map : {
nodeset_name = ns.nodeset_name
node_conf = ns.node_conf
instance_template = module.slurm_nodeset_template[ns.nodeset_name].self_link
node_count_dynamic_max = ns.node_count_dynamic_max
node_count_static = ns.node_count_static
subnetwork = ns.subnetwork_self_link
reservation_name = ns.reservation_name
maintenance_interval = ns.maintenance_interval
instance_properties_json = ns.instance_properties_json
enable_placement = ns.enable_placement
network_storage = ns.network_storage
zone_target_shape = ns.zone_target_shape
zone_policy_allow = ns.zone_policy_allow
zone_policy_deny = ns.zone_policy_deny
nodeset_name = ns.nodeset_name
node_conf = ns.node_conf
instance_template = module.slurm_nodeset_template[ns.nodeset_name].self_link
node_count_dynamic_max = ns.node_count_dynamic_max
node_count_static = ns.node_count_static
subnetwork = ns.subnetwork_self_link
reservation_name = ns.reservation_name
maintenance_interval = ns.maintenance_interval
instance_properties_json = ns.instance_properties_json
enable_placement = ns.enable_placement
network_storage = ns.network_storage
zone_target_shape = ns.zone_target_shape
zone_policy_allow = ns.zone_policy_allow
zone_policy_deny = ns.zone_policy_deny
enable_maintenance_reservation = ns.enable_maintenance_reservation
}]
}

Expand Down
Loading

0 comments on commit 22c0d78

Please sign in to comment.