-
Notifications
You must be signed in to change notification settings - Fork 433
Build and run ROCm UCX OpenMPI
The LargeBar feature must be enabled on the AMD GPU, like the AMD Radeon Instinct (MI series) of the GPU cards:
The physical address of all system memory and PCI exported GPU device memory must be under (1<<44) for GFX9 GPUs like Vega10 (MI25) and Vega20 (MI50 or MI60). This can be set in the system BIOS according to this page:
You can use the code below as a sanity check for BAR setting. If the Large BAR is not enabled, running this example would cause a segmentation fault. The setting is usually enabled in the BIOS. If it is not set up correctly, segmentation fault in ROCm UCT would show up in UCX (or when UCX is used in Open MPI).
#include <stdio.h>
#include "hip/hip_runtime.h"
int main(int argc, char ** argv) {
int * buf;
hipMalloc((void**)& buf, 100);
printf("address buf %p \n", buf);
printf("Buf[0] = %d\n", buf[0]);
buf[0] = 1;
printf("Buf[0] = %d\n", buf[0]);
return 0;
}
Compile the check_large_bar.c example:
$ /opt/rocm/bin/hipcc $(/opt/rocm/bin/hipconfig --cpp_config) -L/opt/rocm/lib/ -lamdhip64 check_large_bar.c -o check_large_bar
For example, output on a system with LargeBar enabled (no issue):
$ ./check_large_bar
address buf 0x14642b400000
Buf[0] = -1094795586
Buf[0] = 1
For example, output on a system with BIOS that is not set up for LargeBar, you would get a segmentation fault:
$ ./check_large_bar
address buf 0x7efa41c00000
Segmentation fault
To resolve the BAR memory issue mentioned above, you should check the following BIOS settings.
- For typical scenario relates to BAR, please visit BAR Memory Overview in the ROCm documentation
- In the System BIOS, look for the "PCIe/PCI/PnP Configuration"
- For example on the Supermicro server (SYS-4029GP-TRT2), you should look into the following:
Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G
For descriptions to all relevant BIOS settings for AMD EPYC Rome CPUs:
- Please read "Chapter 3.4 HPC and Telco Settings" on pages 23-25 of the Workload Tuning Guide for AMD EPYC™ 7002 Series Processor Based Servers for the recommended BIOS settings for HPC. The non-default options in the guide are highlighted at the bottom of this wiki page.
- AMD I/O Power Management Utility
There are occasion where we notice system hang when IOMMU is enabled. If such case happens, you can try to following workarounds:
-
There is a BIOS option should be called the “AMD I/O Virtualization Technology”, you can set the option to disable to disable IOMMU. You may get a prompt that shows disabling AMD I/O Virtualization Technology will disable both SMT and set x2apic to auto, which is fine.
-
The above BIOS option is more preferred, but if you do have a use case that requires IOMMU to be enabled, the alternate approach would be the use the boot params:
iommu=pt amd_iommu=on
Append the above to GRUB_CMDLINE_LINUX in /etc/default/grub and "update-grub2" to boot with the params, and use "cat /proc/cmdline" to verify the params were being used to boot.
Install instruction is here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html Note: For multi-node with InfiniBand, please install MLNX_OFED first, reboot, then install ROCm next.
sudo apt install git m4 autoconf automake libtool flex
export INSTALL_DIR=/path/to/install
export UCX_DIR=$INSTALL_DIR/ucx
export OMPI_DIR=$INSTALL_DIR/ompi
git clone https://github.com/openucx/ucx.git
cd ucx
# git checkout v1.10.x # optional to use v1.10.x branch
./autogen.sh
mkdir build
cd build
../contrib/configure-opt --prefix=$UCX_DIR --with-rocm=/opt/rocm --without-knem --without-cuda --enable-gtest --enable-examples
make
make install
git clone --recursive -b v4.1.x https://github.com/open-mpi/ompi.git
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --without-verbs
make
make install
The OSU Micro Benchmarks v5.7 (12/11/2020) (link) added to evaluate the performance of various primitives with AMD GPU device and ROCm support. This functionality is exposed when configured with --enable-rocm option. We can use the following steps to compile OMB:
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.7.tar.gz
tar xvf osu-micro-benchmarks-5.7.tar.gz
mv osu-micro-benchmarks-5.7 osu
cd osu
./configure --enable-rocm --with-rocm=/opt/rocm CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ $(hipconfig -C) -lamdhip64" CPPFLAGS="-std=c++11"
Use UCX_RNDV_THRESH
environment variable to lower the rendezvous threshold from the 8KB default to 128B to improve the midrange performance.
cd osu
$OMPI_DIR/bin/mpirun -np 2 --mca btl '^openib' -x UCX_TLS=sm,self,rocm_copy,rocm_ipc --mca pml ucx -x UCX_RNDV_THRESH=128 osu_bw -d rocm D D
For GPUs that are connected via PCIe (and not XGMI), use the UCX_RNDV_PIPELINE_SEND_THRESH
and UCX_RNDV_FRAG_SIZE
environment variables to improve the large messages performance.
Note: UCX v1.12 introduced some new rendezvous protocols that may cause some issues for PCIe (non-XGMI) systems.
- Setting this UCX rendezvous scheme to put_zcopy with this environment variable to mpirun with
-x UCX_RNDV_SCHEME=put_zcopy
could avoid such issue. - The
UCX_RNDV_FRAG_SIZE
option was changed fromUCX_RNDV_FRAG_SIZE=4m
toUCX_RNDV_FRAG_SIZE=rocm:4m
in UCX v1.12.
cd osu
$OMPI_DIR/bin/mpirun -np 2 -mca btl '^openib' --mca pml ucx -x UCX_RNDV_SCHEME=put_zcopy -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m mpi/pt2pt/osu_bw -d rocm D D
osu_bw example for GPUs to communicate over XGMI:
The environment variable HSA_ENABLE_SDMA=0
is used for XGMI interconnect to show the effective transfer bandwidth for inter-die data transfer between GPU device 2 and 3 (same MI250/MI250X OAM). For messages larger than 64MB, an effective utilization of about 150GB/s is achieved, which corresponds to 75% of the peak transfer bandwidth of 200GB/s for that connection. See GPU-Enabled MPI in ROCm docs for reference.
mpirun -np 2 \
--mca osc ucx -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc \
-x HSA_ENABLE_SDMA=0 -x UCX_RNDV_THRESH=256k -x LD_LIBRARY_PATH \
-x ROCR_VISIBLE_DEVICES=0,1 \
osu_bw -m 2:268435456 -d rocm D D
# OSU MPI-ROCM Bandwidth Test v6.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
2 0.97
4 1.96
8 3.93
16 7.92
32 15.78
64 26.90
128 61.11
256 92.35
512 160.11
1024 167.99
2048 217.86
4096 240.78
8192 317.69
16384 1472.15
32768 2824.13
65536 5545.13
131072 10417.34
262144 19003.88
524288 31896.64
1048576 51322.15
2097152 75204.86
4194304 114297.43
8388608 129588.08
16777216 140224.55
33554432 142330.08
67108864 149904.54
134217728 151245.19
268435456 151106.14
osu_bw example for GPUs to communicate over PCIe (non-XGMI):
/opt/mpi/ompi/bin/mpirun -np 2 -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m -x UCX_RNDV_THRESH=128 --mca osc ucx --mca spml ucx -x LD_LIBRARY_PATH -x UCX_LOG_LEVEL=TRACE_DATA --allow-run-as-root -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc osu/mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.74
2 1.49
4 2.98
8 5.96
16 9.41
32 10.14
64 11.98
128 14.03
256 28.42
512 58.70
1024 117.59
2048 234.75
4096 463.15
8192 866.08
16384 1575.19
32768 3130.02
65536 3625.22
131072 4729.94
262144 15873.00
524288 20361.94
1048576 22413.33
2097152 25365.81
4194304 25456.85
osu_latency example for GPUs to communicate over PCIe (non-XGMI):
/opt/mpi/ompi/bin/mpirun -np 2 -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m -x UCX_RNDV_THRESH=128 --mca osc ucx --mca spml ucx -x LD_LIBRARY_PATH -x UCX_LOG_LEVEL=TRACE_DATA --allow-run-as-root -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc osu/mpi/pt2pt/osu_latency -d rocm D D
# OSU MPI-ROCM Latency Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 0.19
1 1.76
2 1.75
4 1.76
8 1.78
16 2.18
32 3.65
64 5.82
128 8.41
256 8.34
512 8.09
1024 8.69
2048 8.11
4096 8.17
8192 8.69
16384 8.54
32768 9.80
65536 17.43
131072 24.93
262144 16.02
524288 25.19
1048576 56.21
2097152 97.90
4194304 173.49
Use the UCX_RNDV_PIPELINE_SEND_THRESH
and UCX_RNDV_FRAG_SIZE
environment variables to improve the large messages performance between 2 nodes.
cd osu
$OMPI_DIR/bin/mpirun -np 2 -host <host1 ip>,<host2 ip> --mca btl '^openib' --mca pml ucx -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m mpi/pt2pt/osu_bw -d rocm D D
If you see OSU hang with error message:
WARNING: There was an error initializing an OpenFabrics device.
rc_iface.c:778 UCX ERROR error modifying QP to RTR: Invalid argument
you are running on GID-based multi-host setup, try again with UCX_IB_ADDR_TYPE=ib_global
.
To troubleshoot performance issue, you may also want to check the system BIOS settings.
For the descriptions of the BIOS settings, they can be found in the "Chapter 3.4 HPC and Telco Settings" on pages 23-25 of the Workload Tuning Guide for AMD EPYC™ 7002 Series Processor Based Servers
The recommended options for AMD EPYC Rome CPU with the Instinct GPUs are shown below:
PCIe Settings | Options |
---|---|
Advanced > PCIe > Above 4G Decoding | Enable |
SMT Settings | Options |
---|---|
Advanced > AMD CBS > CPU Common Options > Performance > CCD/Core/Thread Enablement | Accept |
Advanced > AMD CBS > CPU Common Options > Performance > CCD/Core/Thread Enablement > SMT Control | Disable |
Global Core C-States | Options |
---|---|
Advanced > AMD CBS > CPU Common Options > Global C-state Control | Auto |
NUMA Nodes (NPS) | Options |
---|---|
Advanced > AMD CBS > DF Common Options > Memory Addressing > NUMA nodes per socket | NPS4 |
Memory Interleaving | Options |
---|---|
Advanced > AMD CBS > DF Common Options > Memory Addressing > Memory Interleaving | Auto |
IOMMU | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > IOMMU | Disabled |
PCIe Ten Bit Tag Support | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > PCIe Ten Bit Tag Support | Enable |
Determinism Slider | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Determinism Control | Manual |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Determinism Slider | Power |
Set cTDP | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > cTDP Control | Manual |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > cTDP | 240 (EPYC 7742) |
Package Power Limit | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Package Power Limit Control | Manual |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Package Power Limit | 240 (EPYC 7742) |
AMD CPU xGMI | Options |
---|---|
Set AMD CPU xGMI width to 16 bits and speed to 18 Gbps if CPU xGMI is supported by Chassis design | |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Link Width Control | Manual |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Force Link Width | 2 |
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Force Link Width Control | Force |
Advanced > AMD CBS > DF Common Options > Link > 4-Link xGMI Max Speed | 18Gbps |
Advanced > AMD CBS > DF Common Options > Link > 3-Link xGMI Max Speed | 18Gbps |
APBDIS | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > APBDIS | 1 |
DF C-States | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > DF Cstates | Auto |
Fixed SoC P State | Options |
---|---|
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Fixed SOC Pstate | P0 |
Memory Speed | Options |
---|---|
Set to max Memory Speed, if using 3200MHz DIMMs | 1600MHz |
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration | Accept |
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration > Overclock | Enabled |
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration > Memory Clock Speed | 1600MHz (for DDR4 3200MHz) |
RAM Power Down | Options |
---|---|
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Controller Configuration > DRAM Power Options > Power Down Enable | Disabled |
SME | Options |
---|---|
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > Security > TSME | Disabled |