Update docs to discuss debug & compute partitions

GoogleCloudPlatform · Feb 16, 2022 · de0433b · de0433b
1 parent 4b23ce0
commit de0433b
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -101,12 +101,32 @@ the `-o` flag as shown in the following example.
 
 To deploy the blueprint, use terraform in the resource group directory:
 
+> **_NOTE:_** Before you run this for the first time you may need to enable some
+> APIs. See [Enable GCP APIs](#enable-gcp-apis).
+
 ```shell
 cd hpc-cluster-small/primary # From hpc-cluster-small.yaml example
 terraform init
 terraform apply
 ```
 
+Once the blueprint has successfully been deployed, take the following steps to run a job:
+
+* First navigate to `Compute Engine` > `VM instances` in the Google Cloud Console.
+* Next click on the `SSH` button associated with the `slurm-hpc-small-login0` instance.
+* Finally run the `hostname` command on 3 nodes by running the following command in the shell popup:
+
+```shell
+$ srun -N 3 hostname
+slurm-hpc-slurm-small-debug-0-0
+slurm-hpc-slurm-small-debug-0-1
+slurm-hpc-slurm-small-debug-0-2
+```
+
+By default, this runs the job on the `debug` partition. See details in
+[examples/](examples/README.md#hpc-cluster-smallyaml) for how to run on the more
+performant `compute` partition.  
+
 > **_NOTE:_** Cloud Shell times out after 20 minutes of inactivity. This example
 > deploys in about 5 minutes but for more complex deployments it may be
 > necessary to deploy (`terraform apply`) from a cloud VM. The same process
@@ -121,6 +141,18 @@ cd <blueprint-directory>/<packer-group>/<custom-vm-image>
 packer build .
 ```
 
+## Enable GCP APIs
+
+In a new GCP project there are several apis that must be enabled to deploy your
+HPC cluster. These will be caught when you perform `terraform apply` but you can
+save time by enabling them upfront.
+
+List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/getting-started#enabling_apis)):
+
+* Compute Engine API
+* Cloud Filestore API
+* Cloud Runtime Configuration API - _needed for `high-io` example_
+
 ## Inspecting the Blueprint
 
 The blueprint is created in the directory matching the provided blueprint_name

diff --git a/examples/README.md b/examples/README.md
@@ -14,13 +14,29 @@ passed to resources if the resources have an input that matches the variable nam
 
 ## Config Descriptions
 
-**hpc-cluster-small.yaml**: Creates a basic auto-scaling SLURM cluster with a
-single SLURM patition and mostly default settings. The blueprint also creates a
-new VPC network, and a filestore instance mounted to `/home`.
+### hpc-cluster-small.yaml
 
-**hpc-cluster-high-io.yaml**: Creates a slurm cluster with tiered file systems
-for higher performance. It connects to the default VPC of the project and
-creates two partitions and a login node.
+Creates a basic auto-scaling SLURM cluster with mostly default settings. The
+blueprint also creates a new VPC network, and a filestore instance mounted to
+`/home`.
+
+There are 2 partitions in this example: `debug` and `compute`. The `debug`
+partition uses `n2-standard-2` VMs, which should work out of the box without
+needing to request additional quota. The purpose of the `debug` partition is to
+make sure that first time users are not immediately blocked by quota
+limitations.
+
+The `compute` partition is far more performant than `debug`. Any performance
+analysis should be done on the `compute` partition. By default it uses
+`c2-standard-60` VMs with placement groups enabled. You may need to request
+additional quota for `C2 CPUs` in the region you are deploying in. You can
+select the compute partition using the `srun -p compute` argument.
+
+### hpc-cluster-high-io.yaml
+
+Creates a slurm cluster with tiered file systems for higher performance. It
+connects to the default VPC of the project and creates two partitions and a
+login node.
 
 File systems:
 
@@ -32,6 +48,9 @@ File systems:
   [DDN Exascaler Lustre](../resources/third-party/file-system/DDN-EXAScaler/README.md)
   file system designed for high IO performance. The capacity is ~10TiB.
 
+Similar to the [small example](#hpc-cluster-smallyaml) there is a `debug`
+partition which should require less quota to get running.
+
 ### Experimental
 
 **omnia-cluster-simple.yaml**: Creates a simple omnia cluster, with an