IN WORK: Documenting quota requirments

nick-stroud · Feb 18, 2022 · 81b1a43 · 81b1a43
1 parent d2c6699
commit 81b1a43
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -102,7 +102,9 @@ the `-o` flag as shown in the following example.
 To deploy the blueprint, use terraform in the resource group directory:
 
 > **_NOTE:_** Before you run this for the first time you may need to enable some
-> APIs. See [Enable GCP APIs](#enable-gcp-apis).
+> APIs and possibly request additional quotas. See
+> [Enable GCP APIs](#enable-gcp-apis) and
+> [Small Example Quotas](examples/README.md#hpc-cluster-smallyaml).
 
 ```shell
 cd hpc-cluster-small/primary # From hpc-cluster-small.yaml example
@@ -153,6 +155,16 @@ List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/gettin
 * Cloud Filestore API
 * Cloud Runtime Configuration API - _needed for `high-io` example_
 
+## GCP Quotas
+
+You may need to request additional quota to be able to deploy and use your HPC
+cluster. For example, by default the `SchedMD-slurm-on-gcp-partition` resource
+uses `c2-standard-60` VMs for compute nodes. Default quota for C2 CPUs may be as
+low as 8, which would prevent even a single node from being started.
+
+Required quotas will be based on your custom HPC configuration. Minimum quotas
+have been [documented](examples/README.md#example-configs) for the provided examples.
+
 ## Inspecting the Blueprint
 
 The blueprint is created in the directory matching the provided blueprint_name

diff --git a/examples/README.md b/examples/README.md
@@ -34,6 +34,15 @@ uses `c2-standard-60` VMs with placement groups enabled. You may need to request
 additional quota for `C2 CPUs` in the region you are deploying in. You can
 select the compute partition using the `srun -p compute` argument.
 
+Quota required for this example:
+
+* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **3 TB**
+* Compute Engine API: N2 CPUs: **12** - _should be granted by default_
+* Compute Engine API: C2 CPUs: **1200** - _only needed to run in the `compute`
+  partition_
+* Compute Engine API: Affinity Groups: **10** - _only needed to run in the
+  `compute` partition_
+
 ### hpc-cluster-high-io.yaml
 
 Creates a slurm cluster with tiered file systems for higher performance. It
@@ -58,6 +67,20 @@ Similar to the small example, there is a
 [compute partition](#compute-partition) that should be used for any performance
 analysis.
 
+Quota required for this example:
+
+* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB**
+* Cloud Filestore API: High Scale SSD capacity (GB) per region: **10240 GiB** - _min
+  quota request is 61440 GiB_
+* Compute Engine API: Persistent Disk SSD (GB): **~14000 GB**
+* Compute Engine API: N2 CPUs: **126**
+* Compute Engine API: C2 CPUs: **12,000** - _only needed to max out the
+  `compute` partition_
+* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only
+  needed to max out the `compute` partition_
+* Compute Engine API: Resource policies: **one for each job in parallel** -
+  _only needed to max out the `compute` partition_
+
 ### Experimental
 
 **omnia-cluster-simple.yaml**: Creates a simple omnia cluster, with an