Skip to content

Commit

Permalink
IN WORK: Documenting quota requirments
Browse files Browse the repository at this point in the history
  • Loading branch information
nick-stroud committed Feb 18, 2022
1 parent d2c6699 commit 81b1a43
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 1 deletion.
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ the `-o` flag as shown in the following example.
To deploy the blueprint, use terraform in the resource group directory:

> **_NOTE:_** Before you run this for the first time you may need to enable some
> APIs. See [Enable GCP APIs](#enable-gcp-apis).
> APIs and possibly request additional quotas. See
> [Enable GCP APIs](#enable-gcp-apis) and
> [Small Example Quotas](examples/README.md#hpc-cluster-smallyaml).
```shell
cd hpc-cluster-small/primary # From hpc-cluster-small.yaml example
Expand Down Expand Up @@ -153,6 +155,16 @@ List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/gettin
* Cloud Filestore API
* Cloud Runtime Configuration API - _needed for `high-io` example_

## GCP Quotas

You may need to request additional quota to be able to deploy and use your HPC
cluster. For example, by default the `SchedMD-slurm-on-gcp-partition` resource
uses `c2-standard-60` VMs for compute nodes. Default quota for C2 CPUs may be as
low as 8, which would prevent even a single node from being started.

Required quotas will be based on your custom HPC configuration. Minimum quotas
have been [documented](examples/README.md#example-configs) for the provided examples.

## Inspecting the Blueprint

The blueprint is created in the directory matching the provided blueprint_name
Expand Down
23 changes: 23 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ uses `c2-standard-60` VMs with placement groups enabled. You may need to request
additional quota for `C2 CPUs` in the region you are deploying in. You can
select the compute partition using the `srun -p compute` argument.

Quota required for this example:

* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **3 TB**
* Compute Engine API: N2 CPUs: **12** - _should be granted by default_
* Compute Engine API: C2 CPUs: **1200** - _only needed to run in the `compute`
partition_
* Compute Engine API: Affinity Groups: **10** - _only needed to run in the
`compute` partition_

### hpc-cluster-high-io.yaml

Creates a slurm cluster with tiered file systems for higher performance. It
Expand All @@ -58,6 +67,20 @@ Similar to the small example, there is a
[compute partition](#compute-partition) that should be used for any performance
analysis.

Quota required for this example:

* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB**
* Cloud Filestore API: High Scale SSD capacity (GB) per region: **10240 GiB** - _min
quota request is 61440 GiB_
* Compute Engine API: Persistent Disk SSD (GB): **~14000 GB**
* Compute Engine API: N2 CPUs: **126**
* Compute Engine API: C2 CPUs: **12,000** - _only needed to max out the
`compute` partition_
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only
needed to max out the `compute` partition_
* Compute Engine API: Resource policies: **one for each job in parallel** -
_only needed to max out the `compute` partition_

### Experimental

**omnia-cluster-simple.yaml**: Creates a simple omnia cluster, with an
Expand Down

0 comments on commit 81b1a43

Please sign in to comment.