Skip to content

Commit

Permalink
Update docs to discuss debug & compute partitions
Browse files Browse the repository at this point in the history
  • Loading branch information
nick-stroud committed Feb 16, 2022
1 parent 4b23ce0 commit de0433b
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 6 deletions.
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,32 @@ the `-o` flag as shown in the following example.

To deploy the blueprint, use terraform in the resource group directory:

> **_NOTE:_** Before you run this for the first time you may need to enable some
> APIs. See [Enable GCP APIs](#enable-gcp-apis).
```shell
cd hpc-cluster-small/primary # From hpc-cluster-small.yaml example
terraform init
terraform apply
```

Once the blueprint has successfully been deployed, take the following steps to run a job:

* First navigate to `Compute Engine` > `VM instances` in the Google Cloud Console.
* Next click on the `SSH` button associated with the `slurm-hpc-small-login0` instance.
* Finally run the `hostname` command on 3 nodes by running the following command in the shell popup:

```shell
$ srun -N 3 hostname
slurm-hpc-slurm-small-debug-0-0
slurm-hpc-slurm-small-debug-0-1
slurm-hpc-slurm-small-debug-0-2
```

By default, this runs the job on the `debug` partition. See details in
[examples/](examples/README.md#hpc-cluster-smallyaml) for how to run on the more
performant `compute` partition.

> **_NOTE:_** Cloud Shell times out after 20 minutes of inactivity. This example
> deploys in about 5 minutes but for more complex deployments it may be
> necessary to deploy (`terraform apply`) from a cloud VM. The same process
Expand All @@ -121,6 +141,18 @@ cd <blueprint-directory>/<packer-group>/<custom-vm-image>
packer build .
```

## Enable GCP APIs

In a new GCP project there are several apis that must be enabled to deploy your
HPC cluster. These will be caught when you perform `terraform apply` but you can
save time by enabling them upfront.

List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/getting-started#enabling_apis)):

* Compute Engine API
* Cloud Filestore API
* Cloud Runtime Configuration API - _needed for `high-io` example_

## Inspecting the Blueprint

The blueprint is created in the directory matching the provided blueprint_name
Expand Down
31 changes: 25 additions & 6 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,29 @@ passed to resources if the resources have an input that matches the variable nam

## Config Descriptions

**hpc-cluster-small.yaml**: Creates a basic auto-scaling SLURM cluster with a
single SLURM patition and mostly default settings. The blueprint also creates a
new VPC network, and a filestore instance mounted to `/home`.
### hpc-cluster-small.yaml

**hpc-cluster-high-io.yaml**: Creates a slurm cluster with tiered file systems
for higher performance. It connects to the default VPC of the project and
creates two partitions and a login node.
Creates a basic auto-scaling SLURM cluster with mostly default settings. The
blueprint also creates a new VPC network, and a filestore instance mounted to
`/home`.

There are 2 partitions in this example: `debug` and `compute`. The `debug`
partition uses `n2-standard-2` VMs, which should work out of the box without
needing to request additional quota. The purpose of the `debug` partition is to
make sure that first time users are not immediately blocked by quota
limitations.

The `compute` partition is far more performant than `debug`. Any performance
analysis should be done on the `compute` partition. By default it uses
`c2-standard-60` VMs with placement groups enabled. You may need to request
additional quota for `C2 CPUs` in the region you are deploying in. You can
select the compute partition using the `srun -p compute` argument.

### hpc-cluster-high-io.yaml

Creates a slurm cluster with tiered file systems for higher performance. It
connects to the default VPC of the project and creates two partitions and a
login node.

File systems:

Expand All @@ -32,6 +48,9 @@ File systems:
[DDN Exascaler Lustre](../resources/third-party/file-system/DDN-EXAScaler/README.md)
file system designed for high IO performance. The capacity is ~10TiB.

Similar to the [small example](#hpc-cluster-smallyaml) there is a `debug`
partition which should require less quota to get running.

### Experimental

**omnia-cluster-simple.yaml**: Creates a simple omnia cluster, with an
Expand Down

0 comments on commit de0433b

Please sign in to comment.