Skip to content

Commit

Permalink
Add multi-AZ and multi-region support
Browse files Browse the repository at this point in the history
Add Regions and AZs to the InstanceConfig

The code has been updated support multiple regions.
The instance types that are available and the pricing varies by region so
all instance type info must be maintained by region.

Spot pricing additionally varies by instance type and by AZ and this
commit adds an updated EC2InstanceTypeInfoPkg package that looks up
the spot pricing for each instance type in each AZ and region.

The Region/AZ configuration is added to the InstanceConfig section of the config
file.
The region requires the VpcId, CIDR, and SshKeyPair.
The AZ requires the subnet ID and priority.

The slurm node configuration has been updated to add the AZ id to all compute nodes
and add the AZ name to all partitions.

Users can specify multiple partitions with sbatch if they want jobs to
be spread across multiple AZs.
The modulefile has been updated to set the partition to the list of
all regional/az partitions so that all nodes are available to the jobs
in the priority configured in the config file.

Create compute node security groups for other regions using a custom resource.
Save regional security group ids in ssm parameter store.

Update multi-region route53 hosted zone

Fix IAM permissions to handle multiple regions

Decode iam permissions messsages

Update security groups with remote region cidrs

Create slurmfs ARecord for use in other regions.
This required adding a lambda to do DNS lookups.

Add custom resource to add regional VPCs to the Route53 hosted zone.
This is required for now because of a CDK bug:

aws/aws-cdk#20496

The PR for the above bug is:

aws/aws-cdk#20530

Update github-pages to use mkdocs
Add github-docs target to Makefile

Update to cdk@2.28.1

Create AZ and interactive partitions, set default partitions

Resolves [FEATURE #22: Support mutiple availability zones and regions](#2)
  • Loading branch information
cartalla committed Jul 7, 2022
1 parent e294a16 commit 13601d3
Show file tree
Hide file tree
Showing 51 changed files with 2,560 additions and 787 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,7 @@ site/
# Jekyll
Gemfile.lock
.jekyll-cache
.mkdocs_venv/
_site
site/
.vscode/
21 changes: 15 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,15 +1,24 @@

.PHONY: help local-docs test clean

help:
@echo "Usage: make [ help | clean ]"
@echo "Usage: make [ help | local-docs | github-docs | clean ]"

.mkdocs_venv/bin/activate:
rm -rf .mkdocs_venv
python3 -m venv .mkdocs_venv
source .mkdocs_venv/bin/activate; pip install mkdocs

local-docs: .mkdocs_venv/bin/activate
source .mkdocs_venv/bin/activate; mkdocs serve&
firefox http://127.0.0.1:8000/

github-docs: .mkdocs_venv/bin/activate
source .mkdocs_venv/bin/activate; mkdocs gh-deploy --strict

test:
pytest -x -v tests

jekyll:
gem install jekyll bundler
bundler install
bundle exec jekyll serve

clean:
git clean -d -f -x
# -d: Recurse into directories
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# AWS EDA Slurm Cluster

[View on GitHub Pages](https://aws-samples.github.io/aws-eda-slurm-cluster/)

This repository contains an AWS Cloud Development Kit (CDK) application that creates a SLURM cluster that is suitable for running production EDA workloads on AWS.
This repository contains an AWS Cloud Development Kit (CDK) application that creates a Slurm cluster that is suitable for running production EDA workloads on AWS.
Key features are:

* Automatic scaling of AWS EC2 instances based on demand
Expand All @@ -11,7 +9,7 @@ Key features are:
* Batch and interactive partitions (queues)
* Managed tool licenses as a consumable resource
* User and group fair share scheduling
* SLURM accounting database
* Slurm accounting database
* CloudWatch dashboard
* Job preemption
* Multi-cluster federation
Expand All @@ -21,7 +19,7 @@ Key features are:

## Operating System and Processor Architecture Support

This SLURM cluster supports the following OSes:
This Slurm cluster supports the following OSes:

* Alma Linux 8
* Amazon Linux 2
Expand All @@ -32,7 +30,7 @@ This SLURM cluster supports the following OSes:
RedHat stopped supporting CentOS 8, so for a similar RedHat 8 binary compatible distribution we support Alma Linux and
Rocky Linux as replacements for CentOS.

This SLURM cluster supports both Intel/AMD (x86_64) based instances and ARM Graviton2 (arm64/aarch64) based instances.
This Slurm cluster supports both Intel/AMD (x86_64) based instances and ARM Graviton2 (arm64/aarch64) based instances.

[Graviton 2 instances require](https://github.com/aws/aws-graviton-getting-started/blob/main/os.md) Amazon Linux 2, RedHat 8, AlmaLinux 8, or RockyLinux 8 operating systems.
RedHat 7 and CentOS 7 do not support Graviton 2.
Expand All @@ -52,7 +50,9 @@ This provides the following different combinations of OS and processor architect

## Documentation

To view the docs, clone the repository and run mkdocs:
[View on GitHub Pages](https://aws-samples.github.io/aws-eda-slurm-cluster/)

To view the docs locally, clone the repository and run mkdocs:

The docs are in the docs directory. You can view them in an editor or using the mkdocs tool.

Expand All @@ -74,10 +74,16 @@ firefox http://127.0.0.1:8000/ &

Open a browser to: http://127.0.0.1:8000/

Or you can simply let make do this for you.

```
make local-docs
```

## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This library is licensed under the MIT-0 License. See the LICENSE file.
This library is licensed under the MIT-0 License. See the [LICENSE](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/LICENSE) file.
4 changes: 0 additions & 4 deletions _config.yml

This file was deleted.

1 change: 0 additions & 1 deletion docs/_config.yml

This file was deleted.

40 changes: 19 additions & 21 deletions docs/deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,15 @@ Add the nodjs bin directory to your path.
Note that the version of aws-cdk changes frequently.
The version that has been tested is in the CDK_VERSION variable in the install script.

```
The install script will try to install the prerequisites if they aren't already installed.
```

## Configuration File

The first step in deploying your cluster is to create a configuration file.
A default configuration file is found in [source/resources/config/default_config.yml](source/config/default_config.yml).
A default configuration file is found in [source/resources/config/default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml).
You should create a new config file and update the parameters for your cluster.

The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](source/cdk/config_schema.py).
The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py).
The schema is defined in python, but the actual config file should be in yaml format.

The following are key parameters that you will need to update.
Expand Down Expand Up @@ -115,7 +113,7 @@ The defaults for the following parameters are generally acceptable, but may be m
## Configure the Compute Instances

The InstanceConfig configuration parameter configures the base operating systems, CPU architectures, instance families,
and instance types that the SLURM cluster should support.
and instance types that the Slurm cluster should support.
The supported OSes and CPU architectures are:

| Base OS | CPU Architectures
Expand Down Expand Up @@ -204,7 +202,7 @@ If you want to use the latest base OS AMIs, then configure your AWS cli credenti
the tested version.

```
source/create-ami-map.py > source/resources/config/ami_map.yml
./source/create-ami-map.py > source/resources/config/ami_map.yml
```

## Use Your Own AMIs (Optional)
Expand Down Expand Up @@ -240,13 +238,13 @@ This is useful if the root volume needs additional space to install additional p

## Configure Fair Share Scheduling (Optional)

SLURM supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
Slurm supports [fair share scheduling](https://slurm.schedmd.com/fair_tree.html), but it requires the fair share policy to be configured.
By default, all users will be put into a default group that has a low fair share.
The configuration file is at **source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/accounts.yml.example**
The configuration file is at [source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/accounts.yml.example](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/accounts.yml.example)
in the repository and is deployed to **/opt/slurm/{{ClusterName}}/conf/accounts.yml**.

The file is a simple yaml file that allows you to configure groups, the users that belong to the group, and a fair share weight for the group.
Refer to the SLURM documentation for details on how the fair share weight is calculated.
Refer to the Slurm documentation for details on how the fair share weight is calculated.
The scheduler can be configured so that users who aren't getting their fair share of resources get
higher priority.
The following shows 3 top level groups.
Expand Down Expand Up @@ -322,13 +320,13 @@ These weights can be adjusted based on your needs to control job priorities.

## Configure Licenses

SLURM supports [configuring licenses as a consumable resource](https://slurm.schedmd.com/licenses.html).
Slurm supports [configuring licenses as a consumable resource](https://slurm.schedmd.com/licenses.html).
It will keep track of how many running jobs are using a license and when no more licenses are available
then jobs will stay pending in the queue until a job completes and frees up a license.
Combined with the fairshare algorithm, this can prevent users from monopolizing licenses and preventing others from
being able to run their jobs.

The configuration file is at **source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/accounts.yml.example**
The configuration file is at [source/resources/playbooks/roles/SlurmCtl/templates/tools/slurm/etc/slurm_licenses.conf.example](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/playbooks/roles/SlurmCtl/templates/opt/slurm/cluster/etc/slurm_licenses.conf.example)
in the repository and is deployed to **/opt/slurm/{{ClusterName}}/conf/accounts.yml**.

The example configuration shows how the number of licenses can be configured as just a comma separated list.
Expand All @@ -351,11 +349,11 @@ with command line arguments, however it is better to specify all of the paramete
## Use the Cluster

Configuring your environment for users requires root privileges.
The configuration commands are found in the outputs of the SLURM cloudformation stack.
The configuration commands are found in the outputs of the Slurm cloudformation stack.

### Configure SLURM Users and Groups
### Configure Slurm Users and Groups

The SLURM cluster needs to configure the users and groups of your environment.
The Slurm cluster needs to configure the users and groups of your environment.
For efficiency, it does this by capturing the users and groups from your environment
and saves them in a json file.
When the compute nodes start they create local unix users and groups using this json file.
Expand All @@ -364,18 +362,18 @@ Choose a single instance in your VPC that will always be running and that is joi
so that it can list all users and groups.
For SOCA this would be the Scheduler instance.
Connect to that instance and run the commands in the **MountCommand** and **ConfigureSyncSlurmUsersGroups** outputs
of the SLURM stack.
These commands will mount the SLURM file system at **/opt/slurm/{{ClusterName}}** and then create
of the Slurm stack.
These commands will mount the Slurm file system at **/opt/slurm/{{ClusterName}}** and then create
a cron job that runs every 5 minutes and updates **/opt/slurm/{{ClusterName}}/config/users_groups.json**.

### Configure SLURM Submitter Instances
### Configure Slurm Submitter Instances

Instances that need to submit to SLURM need to have their security group IDs in the **SubmitterSecurityGroupIds** configuration parameter
so that the security groups allow communication between the submitter instances and the SLURM cluster.
They also need to be configured by mounting the file system with the SLURM tools and
Instances that need to submit to Slurm need to have their security group IDs in the **SubmitterSecurityGroupIds** configuration parameter
so that the security groups allow communication between the submitter instances and the Slurm cluster.
They also need to be configured by mounting the file system with the Slurm tools and
configuring their environment.
Connect to the submitter instance and run the commands in the **MountCommand** and **ConfigureSubmitterCommand** outputs
of the SLURM stack.
of the Slurm stack.
If all users need to use the cluster then it is probably best to create a custom AMI that is configured with the configuration
commands.

Expand Down
6 changes: 3 additions & 3 deletions docs/federation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ If you need to run jobs in more than one AZ then you can use the [federation fea

The config directory has example configuration files that demonstrate how deploy federated cluster into 3 AZs.

* [source/config/slurm_eda_az1.yml](source/config/slurm_eda_az1.yml)
* [source/config/slurm_eda_az2.yml](source/config/slurm_eda_az2.yml)
* [source/config/slurm_eda_az3.yml](source/config/slurm_eda_az3.yml)
* [source/config/slurm_eda_az1.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az1.yml)
* [source/config/slurm_eda_az2.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az2.yml)
* [source/config/slurm_eda_az3.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/config/slurm_eda_az3.yml)

These clusters should be deployed sequentially.
The first cluster creates a cluster and a slurmdbd instance.
Expand Down
17 changes: 0 additions & 17 deletions docs/mkdocs.md

This file was deleted.

Loading

0 comments on commit 13601d3

Please sign in to comment.