These templates will build a compute grid made by a single master VMs running the management services, multiple VM Scaleset for deploying compute nodes, and optionally a set of nodes to run BeeGFS as a parallel shared file system. Ganglia is an option for monitoring the cluster load, and PBS Pro can optionally be setup for job scheduling.
The following diagram shows the overall Compute, Storage and Network infrastructure which is going to be provisioning within Azure to support running HPC applications.
A single VNET (grid-vnet) is used in which four subnets are created, one for the infrastructure (infra-subnet), one for the compute nodes (compute-subnet), one for the storage (storage-subnet) and one for the VPN Gateway (GatewaySubnet). The following addresses range is used :
- grid-vnet 10.0.0.0/20 allowing 4091 private IPs from 10.0.0.4 to 10.0.15.255
- compute-subnet 10.0.0.0/21 allowing 2043 private IPs from 10.0.0.4 to 10.0.7.255
- infra-subnet 10.0.8.0/28 allowing 11 private IPs from 10.0.8.4 to 10.0.8.15
- gatewaysubnet 10.0.9.0/29 allowing 3 private IPs from 10.0.9.4 to 10.0.9.7
- storage-subnet 10.0.10.0/25 allowing 251 private IPs from 10.0.10.4 to 10.0.10.255
Notice that Azure Network start each range at the x.x.x.4 address, reducing by 3 the number of available IPs in a subnet. So, this must be taken in account when designing your virtual network architecture. Infiniband is automatically provided when HPC Azure nodes are provisioned.
For DNS, the Azure DNS is used for name resolution on the private IPs.
Compute nodes are deployed thru VM Scale sets and Managed Disks, made each by up to 100 VMs instances. They are all inside the compute-subnet.
Depending on the workload to run on the cluster, there may be a need to build a scalable file system. BeeGFS is proposed as an option, each storage node will host the storage and metadata services. Several Premium Disks are configured in RAID0 to store the metadata in addition to the real store. For a small size cluster, there is an option to use the master machine as an NFS server with data disk attached in a RAID0 volume.
A dedicated VM (the master node) is used as a jumpbox, exposing an SSH endpoint, and hosting these services :
- Ganglia metadata service and monitoring web site
- PBS Pro job scheduler
- BeeGFS management services
To build the compute grid, three main steps need to be executed :
- Create the networking infrastructure and the jumpbox
- Optionally deploy the BeeGFS nodes
- Provision the compute nodes
The OS for this solution is CentOS 7.2. All scripts have been tested only for that version. SLES 12 can be used for a plain raw cluster, without Ganglia, PBS Pro and BeeGFS.
Starting on February 22, 2017 Master, Compute nodes and BeeGFS nodes are all provisioned using Managed Disks.
Azure CLI 2.0 setup instruction can be found here
Below is an example on how to provision the templates. First you have to login with your credentials. If you have several subscriptions, make sure to make the one you want to deploy in the default. Then create a resource group providing the region and a name for it, and finally invoke the template passing your local parameter file. In the template URI make sure to use the RAW URI https://raw.githubusercontent.com/eewolfe/azureHPC/master/*** and not the github HTML link.
az login
az account set --subscription [subscriptionId]
az group create -l "SouthCentralUS" -n rg-master
az group deployment create -g rg-master --template-uri https://raw.githubusercontent.com/eewolfe/azureHPC/master/Compute-Grid-Infra/deploy-master.json --parameters @myparams.json
- upload.sh - This script will upload all of the templates to a storage account and return the uri and key to support ARM deployment directly from the storage account. This eliminates the need to host this repository in a public location
After running upload.sh, export the SCRIPT_URL and SCRIPT_SASKEY so that the following scripts will pick up the new location. Otherwise, it will default to this github repository
- deploy.master.sh - Creates the resource group and deploys the master node
- deploy.beegfs.sh - Deploys a VM scale set and a BeeGFS cluster
- deploy.nodes.sh - Deploys the compute cluster
- deploy.usernodes.sh - Deploys several nodes for users to log into and run the UI for the applications
The template deploy-master.json will provision the networking infrastructure as well as a master VM exposing an SSH endpoint for remote connection.
You have to provide these parameters to the template :
- vmPrefix : a 8 characters prefix to be used to name your objects. The master VM will be named as [prefix]master
- sharedStorage : to specify the shared storage to use. Allowed values are : none, beegfs, nfsonmaster.
- scheduler : the job scheduler to be setup. Allowed values are : none, pbspro
- monitoring : the monitoring tools to be setup. Allowed values are : none, ganglia
- masterImage : the OS to be used. Should be CentOS_7.2
- dataDiskSize : the size of the data disks to attached. Allowed values are : none, P10 (128GB), P20 (512GB), P30 (1023GB)
- nbDataDisks : Number of data disks to attach. Default is 2, maximum is 16.
- VMSku : This is to specify the instance size of the master VM. For example Standard_DS3_v2
- adminUsername : This is the name of the administrator account to create on the VM
- adminPassword : Password to associate to the administrator account. It is highly encourage to use SSH authentication and passwordless instead.
- sshKeyData : The public SSH key to associate with the administrator user. Format has to be on a single line 'ssh-rsa key'
Once the deployment succeed, use the output masterFQDN to retrieve the master name and SSH on it. The output GangliaURI contains the URI of the Ganglia monitoring page, which should display after few minutes graphs of the current load.
To check if PBSPro is installed, run the command pbsnodes -a this should return no available nodes, but the command should run successfully.
If nfsonmaster is choosen, an NFS mount point named /data will be created.
BeeGFS will be checked later once the storage nodes will be deployed.
If your compute cluster require a scalable shared file storage, you can deploy BeeGFS nodes to create a unique namespace. Prior doing your deployment you will have to decide how much storage nodes you will require and for each how much data disks you will provide for the storage and metadata services. Data disks are based on Premium Storage and can have three different sizes :
- P10 : 128 GB
- P20 : 512 GB
- P30 : 1023 GB
The storage nodes will be included in the VNET created in the previous step, and all inside the storage-subnet .
The template BeeGFS/deploy-beegfs-vmss.json will provision the storage nodes with CentOS 7.2 and BeeGFS version 6.
You have to provide these parameters to the template :
- nodeType : Default value is both and should be kept as is. Other values meta and storage are allowed for advanced scenarios in which meta data services and storage services are deployed on dedicated nodes.
- nodeCount : Total number of storage nodes to deploy. Maximum is 100.
- VMsku : The VM instance type to be used in the Standard_DSx_v2 series. Default is Standard_DS3_v2.
- RGvnetName : The name of the Resource Group used to deploy the Master VM and the VNET.
- adminUsername : This is the name of the administrator account to create on the VM. It is recommended to use the same than for the Master VM.
- sshKeyData : The public SSH key to associate with the administrator user. Format has to be on a single line 'ssh-rsa key'
- masterName : The short name of the Master VM, on which the BeeGFS management service is installed
- storageDiskSize : Size of the Data Disk to be used for the storage service (P10, P20, P30). Default is P10.
- nbStorageDisks : Number of data disks to be attached to a single VM. Min is 2, Max is 8, Default is 2.
- metaDiskSize : Size of the Data Disk to be used for the metadata service (P10, P20, P30). Default is P10.
- nbMetaDisks : Number of data disks to be attached to a single VM. Min is 2, Max is 8, Default is 2.
- customDomain : If the VNET is configure to use a custom domain, specify the name of this custom domain to be used
Storage nodes will be named beegfs000000 beegfs000001 ... . After few minutes, they should appear in the Ganglia monitoring web page if setup.
To check that the nodes are well registered into the BeeGFS management service, SSH on the master VM and then run these commands :
- to list the storage nodes : beegfs-ctl --listnodes --nodetype=storage
- to list the metadata nodes : beegfs-ctl --listnodes --nodetype=metadata
- to display the BeeGFS file system : beegfs-df
- to display the BeeGFS network : beegfs-net
The mount point to use is /share/scratch , and it should already be mounted on the master VM.
Compute nodes are provisioned using VM Scalesets, each set can have up to 100 VMs. You will have to provide the number of VM per scalesets and how many sets you want to create. All scalesets will contains the same VM instances.
You have to provide these parameters to the template :
- VMsku : Instance type to provision. Default is Standard_D3_v2
- sharedStorage : default is none. Allowed values are (nfsonmaster, beegfs, none)
- scheduler : default is none. Allowed values are (pbspro, none)
- monitoring : default is ganglia. Allowed values are (ganglia, none)
- computeNodeImage : OS to use for compute nodes. Default and recommended value is CentOS_7.2
- vmSSPrefix : 8 characters prefix to use to name the compute nodes. The naming pattern will be prefixAABBBBBB where AA is two digit number of the scaleset and BBBBBB is the 8 hexadecimal value inside the Scaleset
- instanceCountPerVMSS : number of VMs instance inside a single scaleset. Default is 2, maximum is 100
- numberOfVMSS : number of VM scaleset to create. Default is 1, maximum is 100
- RGvnetName : The name of the Resource Group used to deploy the Master VM and the VNET.
- adminUsername : This is the name of the administrator account to create on the VM. It is recommended to use the same than for the Master VM.
- adminPassword : Password to associate to the administrator account. It is highly encourage to use SSH authentication and passwordless instead.
- sshKeyData : The public SSH key to associate with the administrator user. Format has to be on a single line 'ssh-rsa key'
- masterName : The short name of the Master VM
- postInstallCommand : a post installation command to launch after povisioning. This command needs to be encapsulated in quotes, for example 'bash /data/postinstall.sh'.
- imageId : Specify the resource ID of the image to be used in the format /subscriptions/{SubscriptionId}/resourceGroups/{ResourceGroup}/providers/Microsoft.Compute/images/{ImageName} this value is only used when the computeNodeImage is set to CustomLinux or CustomWindows
After few minutes, once the provision succeed, you should see the new hosts added on the Ganglia monitoring page if setup.
If PBS Pro is used, SSH on the master and run the pbsnodes -a command to list all the registered nodes.
If nfsonmaster is choosen the NFS mount point /data is automatically mounted.
Your cluster is now ready to host applications and run jobs
Intel MPI and Infiniband are only available for A8/A9 and H16r instances. A default user named hpcsvc has been created on the compute nodes and on the master node with passwordless access so it can be immediately used to run MPI across nodes.
To begin, you need first to ssh on the master and then switch to the hpcsvc user. From there, ssh one one of the compute nodes, and configure MPI by following the instructions from here
To run the 2 node pingpong test, execute the following command
impi_version=`ls /opt/intel/impi`
source /opt/intel/impi/${impi_version}/bin64/mpivars.sh
mpirun -hosts <host1>,<host2> -ppn 1 -n 2 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 -env I_MPI_FALLBACK_DEVICE=0 IMB-MPI1 pingpong
You should expect an output as the one below
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Thu Jan 26 02:16:14 2017
# Machine : x86_64
# System : Linux
# Release : 3.10.0-229.20.1.el7.x86_64
# Version : #1 SMP Tue Nov 3 19:10:07 UTC 2015
# MPI Version : 3.0
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# IMB-MPI1 pingpong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 3.37 0.00
1 1000 3.40 0.28
2 1000 3.69 0.52
4 1000 3.39 1.13
8 1000 3.41 2.24
16 1000 3.38 4.51
32 1000 2.78 10.99
64 1000 2.79 21.90
128 1000 3.12 39.09
256 1000 3.34 73.13
512 1000 3.79 128.87
1024 1000 4.85 201.48
2048 1000 5.74 340.21
4096 1000 7.06 552.98
8192 1000 8.51 917.87
16384 1000 10.86 1438.11
32768 1000 16.55 1888.21
65536 640 28.15 2220.37
131072 320 53.47 2337.75
262144 160 84.07 2973.66
524288 80 148.77 3360.92
1048576 40 284.91 3509.84
2097152 20 546.43 3660.15
4194304 10 1077.75 3711.45
# All processes entering MPI_Finalize
ssh on the master node and switch to the hpcsvc user. Then change directory to home
sudo su hpcsvc
cd
create a shell script named pingpong.sh with the content listed below
#!/bin/bash
# set the number of nodes and processes per node
#PBS -l nodes=2:ppn=1
# set name of job
#PBS -N mpi-pingpong
source /opt/intel/impi/5.1.3.181/bin64/mpivars.sh
mpirun -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong
Then submit a job
qsub pingpong.sh
The job output will be written in the current directory in files named mpi-pingpong.e* and mpi-pingpong.o*
The mpi-pingpong.o* file should contains the MPI pingpong output as shown above when doing the manual test.
Please report bugs by opening an issue in the GitHub Issue Tracker
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.