UTPX (Userspace Transparent Paging Extension) is a proof-of-concept LD_PRELOAD
library that
accelerates HIP
managed allocations (e.g hipMallocManaged
) on systems without XNACK or with XNACK disabled.
Without XNACK, the GPU device cannot handle pagefaults and ROCm degrades the allocation to host-resident memory where all access from the device must cross the host-device interconnect (e.g., PCIe). In this degraded mode, no page migration occurs, so performance for memory-bandwidth bound applications will be limited to interconnect performance. For example, a MI100 operating at PCIe 4.0 x16 without XNACK will see application performance capped at 31.5GB/s when device-resident memory should be capable of 1228.8 GB/s, a 40x difference.
UTPX solves this by shifting pagefault handling to the host where it can be easily accomplished in
the userspace.
The implementation performs userspace page migration using mprotect
and signal handlers.
Paging is done at the granularity of the allocation itself using a Mirror-on-Access (MoA) scheme.
With MoA, initial allocations are resident on the host, and a separate device-resident allocation is
made whenever a kernel is launched that has dependency on the allocation.
A device to host write-back is triggered by a mprotect
induced pagefault, this happens if the host
mode of the device-resident memory is accessed in any way (e.g., through pointer dereference).
hipMemAdvise
and hipMemPrefetchAsync
may be a much better alternative than the MoA scheme which
is quite brittle and consumes extra host and device resource for keeping the allocations alive.
However, preliminary experiments show these two APIs providing only very limited performance
improvements (~2x from host-resident).
A simpler alternative would be to always use device-resident memory (e.g., hipMalloc
).
Notice that hipMalloc
return a host-accessible memory on the device, which is different
from cudaMalloc
this API is mirroring from.
This alternative only works well for explicit USM models like SYCL (using
only sycl::shared_malloc
) and for applications where the algorithm is completely offloaded.
StdPar models that interpose malloc
/free
would see all originally host-allocated memory become
remote which adds catastrophic latency even if bandwidth isn't a limiting factor, not to mention
that device allocations are physical and immediate and not virtual.
UTPX provides support for these two scenarios as a special mode by replacing hipMallocManaged
with hipMalloc
or by invoking hipMemPrefetchAsync
on the dependent memory before each kernel
invocation.
UTPX requires certain HIP memory APIs to be intercepted for the mirrored allocation to work correctly. A minimal set are implemented such that HIP, roc-stdpar, ICPX w/ Codeplay plugin, and hipSYCL (StdPar or SYCL using Integrated SCMP mode only) work:
<<<>>>
/hipLaunchKernelGGL
/hipModuleLaunchKernel
hipDeviceSynchronize
hipMallocManaged
hipFree
hipMemcpy
hipMemset
hipPointerGetAttributes
(partial, only works in roc-stdpar)- Any device query, event, or stream API, those do not require special handling.
The environment variable UTPX_MODE
supports:
MIRROR
for MoA, this is the default ifUTPX_MODE
is not setDEVICE
for always device resident but host-accessible (hipMalloc
) allocationADVISE
for coarse grainedhipMallocManaged
and dynamichipMemPrefetchAsync
calls on kernel submission
# On RadeonVII
# without UTPX:
> ./std-indices-stream
BabelStream
Version: 4.0
Implementation: STD (index-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Backing storage typeid: Pd
Function MBytes/sec Min (sec) Max Average
Copy 21989.273 0.02442 0.02462 0.02450
Mul 22020.178 0.02438 0.02460 0.02449
Add 19225.037 0.04189 0.04197 0.04193
Triad 19229.798 0.04188 0.04204 0.04193
Dot 13468.631 0.03986 0.04003 0.03994
# With UTPX:
> LD_PRELOAD="$HOME/utpx/build/libutpx.so" ./std-indices-stream
BabelStream
Version: 4.0
Implementation: STD (index-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Backing storage typeid: Pd
Function MBytes/sec Min (sec) Max Average
Copy 845300.937 0.00064 0.00064 0.00064
Mul 842076.480 0.00064 0.00070 0.00065
Add 835445.048 0.00096 0.00103 0.00097
Triad 833085.608 0.00097 0.00101 0.00097
Dot 851026.490 0.00063 0.00065 0.00064
This work is part of an evaluation of multiple StdPar implementations on AMD platforms. The full report is available at https://arxiv.org/abs/2401.02680, please cite this working using:
@misc{lin2024preliminary,
title={Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC},
author={Wei-Chen Lin and Simon McIntosh-Smith and Tom Deakin},
year={2024},
eprint={2401.02680},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
- CMake >= 3.14
- C++17-capable compiler
- x86_64 Linux (ROCm is only supported on x86_64: Linux on another architecture is not supported)
- No extra library is required beyond libstdc++; a ROCm (HIP/HSA) installation is not required and will not be used
cd utpx
cmake -Bbuild -H. -DCMAKE_BUILD_TYPE=Debug # use -DCMAKE_BUILD_TYPE=Release for benchmarks
cmake --build build -j
# library available at build/libutpx.so