From 81b1a43ab25d4737b4ff4f2e942241f780bb3b15 Mon Sep 17 00:00:00 2001 From: Nick Stroud Date: Wed, 16 Feb 2022 14:03:19 -0800 Subject: [PATCH] IN WORK: Documenting quota requirments --- README.md | 14 +++++++++++++- examples/README.md | 23 +++++++++++++++++++++++ 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index df4640b3f2..c5484acac0 100644 --- a/README.md +++ b/README.md @@ -102,7 +102,9 @@ the `-o` flag as shown in the following example. To deploy the blueprint, use terraform in the resource group directory: > **_NOTE:_** Before you run this for the first time you may need to enable some -> APIs. See [Enable GCP APIs](#enable-gcp-apis). +> APIs and possibly request additional quotas. See +> [Enable GCP APIs](#enable-gcp-apis) and +> [Small Example Quotas](examples/README.md#hpc-cluster-smallyaml). ```shell cd hpc-cluster-small/primary # From hpc-cluster-small.yaml example @@ -153,6 +155,16 @@ List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/gettin * Cloud Filestore API * Cloud Runtime Configuration API - _needed for `high-io` example_ +## GCP Quotas + +You may need to request additional quota to be able to deploy and use your HPC +cluster. For example, by default the `SchedMD-slurm-on-gcp-partition` resource +uses `c2-standard-60` VMs for compute nodes. Default quota for C2 CPUs may be as +low as 8, which would prevent even a single node from being started. + +Required quotas will be based on your custom HPC configuration. Minimum quotas +have been [documented](examples/README.md#example-configs) for the provided examples. + ## Inspecting the Blueprint The blueprint is created in the directory matching the provided blueprint_name diff --git a/examples/README.md b/examples/README.md index aa26c6d706..847b33d954 100644 --- a/examples/README.md +++ b/examples/README.md @@ -34,6 +34,15 @@ uses `c2-standard-60` VMs with placement groups enabled. You may need to request additional quota for `C2 CPUs` in the region you are deploying in. You can select the compute partition using the `srun -p compute` argument. +Quota required for this example: + +* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **3 TB** +* Compute Engine API: N2 CPUs: **12** - _should be granted by default_ +* Compute Engine API: C2 CPUs: **1200** - _only needed to run in the `compute` + partition_ +* Compute Engine API: Affinity Groups: **10** - _only needed to run in the + `compute` partition_ + ### hpc-cluster-high-io.yaml Creates a slurm cluster with tiered file systems for higher performance. It @@ -58,6 +67,20 @@ Similar to the small example, there is a [compute partition](#compute-partition) that should be used for any performance analysis. +Quota required for this example: + +* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB** +* Cloud Filestore API: High Scale SSD capacity (GB) per region: **10240 GiB** - _min + quota request is 61440 GiB_ +* Compute Engine API: Persistent Disk SSD (GB): **~14000 GB** +* Compute Engine API: N2 CPUs: **126** +* Compute Engine API: C2 CPUs: **12,000** - _only needed to max out the + `compute` partition_ +* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only + needed to max out the `compute` partition_ +* Compute Engine API: Resource policies: **one for each job in parallel** - + _only needed to max out the `compute` partition_ + ### Experimental **omnia-cluster-simple.yaml**: Creates a simple omnia cluster, with an