Documentation for rocWMMA is available at https://rocm.docs.amd.com/projects/rocWMMA/en/latest.
- Added OCP f8/bf8 datatype support
- Added support for gfx12 arch targets
- Optimized some aos<->soa transforms with half-rotation offsets
- Refactored rocBLAS reference entrypoint for validation and benchmarking
- ROCWMMA_* preprocessor configurations are now all assigned values
- Updated default arch targets for ASAN builds
- Updated actor-critic implementation
- Fixed a bug in f64 validation due to faulty typecasting
- Fixed a bug causing runtime compilation errors with hipRTC
- Various documentation updates and fixes
- Added internal utilities for element-wise vector transforms
- Added internal utilities for cross-lane vector transforms
- Implemented internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16
- Added tests for new internal transforms
- Improved loading layouts by increasing vector width for fragments with blockDim > 32
- API applyDataLayout transform now accepts WaveCount template argument for cooperative fragments
- API applyDataLayout transform now physically applies aos<->soa transform as necessary
- Refactored entry-point of std library usage to improve hipRTC support
- Documentation updates for installation, programmer's guide and API reference
- Fixed some header includes ordering to improve portability
- Added bf16 support for hipRTC sample
- Changed Clang C++ version to C++17
- Updated rocwmma_coop API
- Linked rocWMMA to hiprtc
- Fixed compile/runtime arch checks
- Built all test in large code model
- Removed inefficient branching in layout loop unrolling
- Support for gfx940, gfx941, and gfx942 targets
- Support for f8, bf8, and xfloat32 data types
- support for
HIP_NO_HALF
,__ HIP_NO_HALF_CONVERSIONS__
, and__ HIP_NO_HALF_OPERATORS__
(e.g., PyTorch environment)
- rocWMMA with hipRTC now supports
bfloat16_t
data type - gfx11 WMMA now uses lane swap instead of broadcast for layout adjustment
- Updated samples GEMM parameter validation on host arch
- Disabled GoogleTest static library deployment
- Extended tests now build in large code model
- Fixed a synchronization bug
- Updated rocWMMA CMake versioning
- Cross-lane operation backends (Blend, Permute, Swizzle, and Dpp)
- GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
- Performance GEMM samples for half, single, and double precision
- rocWMMA CMake versioning
- Vectorized support in coordinate transforms
- Included ROCm SMI for runtime clock rate detection
- Fragment transforms for transpose and change data layout
- Default to GPU rocBLAS validation against rocWMMA
- Re-enabled int8 GEMM tests on gfx9
- Upgraded to C++17
- Restructured the unit test folder for consistency
- Consolidated rocWMMA samples common code
- Support for Wave32 on gfx11+
- Infrastructure changes to support hipRTC
- Performance tracking system
- Library config to support multiple architectures
- Vector cross-lane operations support
- Modified the assignment of hardware information
- Modified data access for unsigned data types
- Refactored vector backend to be compatible with
HIP_vector_type
- GEMM driver APIs for flow control built-ins
- benchmark logging systems
- Restructured tests to follow naming convention; added macros for test generation
- Changed CMake to accommodate the modified test infrastructure
- Fine-tuned the multi-block kernels with and without lDs
- Adjusted maximum vector width to dWordx4
- Updated efficiencies to display as whole number percentages
- Updated throughput from GFlops/s to TFlops/s
- Reset the ad-hoc tests to use smaller sizes
- Modified the output validation to use CPU-based implementation against rocWMMA
- Modified the extended vector test to return error codes for memory allocation failures
- Runtime checks to disable tests on non-target GPUS
- Workgroup-aware GEMM kernels
- Workgroup-aware validation and benchmark test suite
- Warm-up run to existing tests
- Refactored
lds_mapping_util
into GEMM global, local mapping, GEMM driver, GEMM config, and scheduling classes - Modified resource allocation and tracking of GEMM and DLRM buffers
- Improved low-level data loading patterns
- Reduced branching on cooperative load and store
- Updated GEMV sample
- Updated GEMM sample
- Unit tests for DLRM kernels
- GEMM sample
- DLRM sample
- SGEMV sample
- Unit tests for cooperative WMMA load and stores
- Unit tests for
IOBarrier.h
- WMMA load and store tests for different matrix types (A, B, and Accumulator)
- More block sizes (1, 2, 4, 8) to test
MmaSyncMultiTest
- Block sizes 4, 8 to test
MmaSynMultiLdsTest
- Support for WMMA load and store layouts with a block dimension greater than 64
- IOShape structure to define the attributes of mapping and layouts for all WMMA matrix types
- CI testing for rocWMMA
- Renamed WMMA to rocWMMA in CMake, header files, and documentation
- Renamed library files
- Modified
Layout.h
to use different matrix offset calculations (base offset, incremental offset, and cumulative offset) - Opaque load and store continue to use incremental offsets as they fill the entire block
- Cooperative load and store use cumulative offsets as they fill only small portions for the entire block
- Increased max split counts to 64 for cooperative load and store
- Moved all the WMMA definitions and API headers to the rocWMMA namespace
- Modified WMMA fill unit tests to validate all matrix types (A, B, Accumulator)
- Unit tests for
MappingUtil.h
- Unit tests for
Layout.h
- Unit tests for non-native vector class in
Types.h
- Unit tests for WMMA load and store contamination check
- Doxygen support for rocWMMA documentation
- MFMA barrier in
IOBarrier.h
- A CMake flag to support WMMA kernel assembly code generation
- MMA sync test WMMA operation with LDS usage
- A script to generate the plots of different WMMA benchmarks
- Multi-block kernels with LDS usage
- Unit tests for multi-block WMMA kernels
- Modified GLlops calculation to accommodate multiple devices
- Removed half-types packing quirk with col major output
- Moved HIP resource management to
HipResource
class - Fixed NaN errors during output comparison
- Templatization for the
amdgcn_convert
class - WMMA load, store, and fill support for integral data types and float64
- MFMA support for i8
- Support for
bf16_1k
MFMA instructions - Code to identify the card type and its support during runtime
- Refactored and simplified
IOBroadcast.h
- Modified the fragment interface compatible with NVIDIA's definition
- Modified CMake to create a lean build of the rocWMMA library
- CMake support for the library and unit tests
- Integrated unit test with GoogleTest and OpenMP
- Host overload operators for
hfloat16_t
- Relative error calculation for non-integral data comparison
- Assembly generation of cooperative load and store code
- Compiler issues with new versions of ROCm
- Sped up compilation time by moving thread block sizes to function arguments instead of template parameters
- Moved all the existing unit tests to a
test
folder - Moved all the header files to
library/include
- Modified
Layout.h
to use RowNT/ColNT to eliminate LDS usage inmma_sync
- Deprecated buffer load/store and local load/store
- support for the bfloat16 compute type
- Renamed
__half
tohfloat_16
for consistency - Modified
Convert.h
to support native to bfloat16 conversion and vice versa - Modified
IOBroadCast.h
to incorporate bfloat16 data packing - Modified
IOTraits.h
to add bfloat16 packing traits - Modified
MFMA.h
to add MFMA invocation calls to bfloat16 data - Modified WMMA types to include
bfloat16_t
- Modified the WMMA load, store, and MMA unit tests to validate bfloat16
- Support for fp16 compute type
- Direct MFMA support for non-native
__half
data type
- Adjusted the vector storage to accommodate non-native types
- Fixed data comparison operators for fp16 data types
- Modified
Convert.h
to support native to__half
conversion and vice versa - Modified
IOBroadCast.h
to incorporate__half
data packing - Modified
IOTraits.h
to add__half
packing traits - Modified
MFMA.h
to add MFMA invocation calls to__half
data - Modified WMMA Types to include
__half _t
- Modified the WMMA load, store, and MMA unit tests to validate
__half
- Defined a WMMA namespace with the supported matrix types, memory, and layouts
- Defined a fragment datatype to control the data transfer between HIP and MFMA
- Implemented the rocWMMA functions :
load_matrix_sync
,load_matrix_coop_sync
,store_matrix_sync
,fill_fragment
, andmma_sync
- Implemented
Types.h
to define the supported data types - Implemented the class
IOTraits
to define packing traits for the defined types as WMMA works on the packed registers - Buffer load, store to support LLVM data instructions
- Opaque load, store
- Cooperative load, store to optimize the memory overhead
- Local load, store to perform register packing
- Implemented
Convert.h
to perform non-native data type conversion to native types and vice versa IOBroadcast
class to perform packing for all input data (multiple registers)- Implemented
IOConfig
to set the optimal input/output configurations for rocWMMA matrix types - Implemented
IOPack
andIOUnpack
to convert the unpacked device memory into packed registers and vice versa Layout
class to define the data layout in matrix space- MFMA to call the low-level MFMA hardware instructions
- Implemented the
MappingUtil
class to map from workgroup configurations to functional wave units Performance.h
to compute GFLOPS based on hardware configurationsReference.h
to implement the CPU GEMM operationUtils.h
to implement matrix data operationsrocBLASReference.h
to invoke the rocBLAS GEMM function- Unit tests to validate WMMA APIs (
load
,store
,fill
, andmma
) - Makefile support to build library and tests