[XeVM] Add first integration tests #425

akroviakov · 2025-01-05T12:32:54Z

This PR contains the necessary changes to launch an XeVM integration test via gc-cpu-runner.

Changes to OpenCLRuntimeWrappers improve error decoding and establish one static queue (this ensures no implicit new queue creation which turned out to be problematic).
Load/store intrinsics only work with the maximum sub-group size (not the active sub-group size which can be smaller), as noted in the restrictions.
For example, on GPU Max 1100 the test runs as expected when launched with 32 threads. Although the full sub-group is still formed even if launched with 16 threads, it is not of the maximum size and hence the result is not as expected (UB according to docs). It is possible to select the maximum subgroup size:

The optional __attribute__((intel_reqd_sub_group_size(<int>))) can be used to indicate that the kernel must be compiled and executed with the specified sub-group size. When this attribute is present, get_max_sub_group_size() is guaranteed to return the specified integer value.

In MLIR this is supported via the intel_reqd_sub_group_size attribute of a kernel.

Matrix multiplication intrinsics sets can differ depending on the minimum supported sub-group size, such intrinsics should be called using the minimum supported sub-group size.

kurapov-peter

Nice, thanks! Looks good overall, some minor comments inlined.

include/gc/Conversion/Passes.h

include/gc/Dialect/LLVMIR/XeVMOps.td

lib/gc/Conversion/XeVMToLLVM/XeVMToLLVM.cpp

lib/gc/ExecutionEngine/OpenCLRuntime/OpenCLRuntimeWrappers.cpp

kurapov-peter · 2025-01-13T14:27:47Z

lib/gc/ExecutionEngine/OpenCLRuntime/OpenCLRuntimeWrappers.cpp

+GPUCLQUEUE *getOrCreateStaticQueue() {
+  if (!lastQueue) {
+    return mgpuStreamCreate();
+  }
+  return lastQueue;
+}


I think we already had it somewhere? @AndreyPavlenko

This is in the GpuOclRuntime.
@akroviakov could we use the gpu runner for this? It will create the queue on startup. It also does not depend on the shared lib with wrappers.

The gpu-runner turned out a bit problematic.
In the upstream, gpu-to-llvm does the type conversion including memrefs (e.g., for kernel arguments) and lowers runtime calls before running gpu-module-to-binary (this ensures uniformly lowered kernel arguments/parameters). However, gpu-to-gpuocl seems to be doing the work of gpu-to-llvm after gpu-module-to-binary.
Assuming we need to lower memref for gpu-module-to-binary to succeed, how can the ops that consume memref in the subsequent gpu-to-gpuocl pass (e.g., allocOp) remain legal?

gpu-to-gpuocl seems to be mainly built with regard to how IMEX handles GPUX with its related custom passes, but I'm not sure how relevant that is when we aim for the upstream flow.

gpu-to-gpuocl converts a subset of gpu dialect ops (launch, alloc, dealloc, memcpy) to the runtime function calls.

Correct. The upstream gpu-to-llvm, however, only legalizes the launchOp arguments, but does not lower the op itself. The launchOp is lowered during translation to LLVM IR (e.g., mlir-translate test).

Normally, the binary string is generated somewhere between gpu-to-llvm and translation by the gpu-module-to-binary pass (e.g., the nvvm pipeline).

Hence it is unusual from the upstream perspective to expect a binary string available before gpu-to-llvm. In the case of gpu-to-gpuocl, both the binary and the memrefs (required due to allocOp legality) are expected to exist simultaneously. However, in the upstream flow, how can we generate a binary for a kernel that has memref arguments if the memrefs themselves have not yet been lowered?

Yes, a "second pass" would be a solution. Maybe we could add some option flag for gpu-to-gpuocl to indicate the current mode (or create a new separate pass). There would be more changes needed though, for example, gpu-to-gpuocl relies on querying imex-specific binary attribute of a GPU module, whereas the upstream uses gpu.binary ops. Required changes could be outside of the scope of this PR.

Sure, we could add the required changes. I'm not sure though if the option flag is really required. On the first run there will not be gpu.launch_func ops, so only the allocs to be lowered.

So the integration test would need to have a gpu.launch (with inlined kernel) instead of gpu.launch_func and do the following:

gpu-to-gpuocl to convert gpu.alloc, etc.

outlining pass to create launchFuncOp from the launchOp

gpu-module-to-binary

gpu-to-gpuocl to convert launchFuncOp and the binary to a runtime call.

Do I understand the intent correctly?

Perhaps, something like that.

Added gc-gpu-runner support for xevm tests, the support relies on outlining pass for running gpu-to-gpuocl twice. I will stick to the upstream flow using wrappers in the follow-up work for testing convenience, so it would be nice for wrapper changes to remain (no other GPU tests use them anyways).

kurapov-peter · 2025-01-13T14:29:13Z

lib/gc/ExecutionEngine/OpenCLRuntime/OpenCLRuntimeWrappers.cpp

-  lastQueue = queue;
  if (ptr) {
-    deallocDeviceMemory(queue, ptr);
+    deallocDeviceMemory(queue ? queue : getOrCreateStaticQueue(), ptr);


this looks a bit suspicious, what's the scenario for the case when we have to create a queue to deallocate memory?

kurapov-peter · 2025-01-13T14:30:43Z

lib/gc/Target/LLVM/XeVM/Target.cpp

+  if (serializedSPIRVBinary->size() % 4) {
+    getOperation().emitError() << "SPIRV code size must be a multiple of 4.";
+    return std::nullopt;
+  }


Is this a translator's silent failure that can result in this check being true?

This is a requirement for the resulting SPIRV to be valid.
The previous serializedSPIRVBinary->size() + 1 was the source of Code size must be a multiple of 4 but is X type of error, hence the check explicitly verifies the resulting code size.

The previous serializedSPIRVBinary->size() + 1 was the source

Was the size calculation incorrect or are there cases when the returned size in not a multiple of 4?

src/gc-cpu-runner/CMakeLists.txt

lib/gc/ExecutionEngine/OpenCLRuntime/OpenCLRuntimeWrappers.cpp

test/mlir/test/gc/cpu-runner/GPU/lit.local.cfg

Garra1980 · 2025-01-17T15:56:19Z

Say, we want to start upstreaming xeVM dialect in current form - what exactly changes should we take from this repo and move to llvm?

akroviakov · 2025-01-17T16:19:39Z

The upstream-relevant xevm is almost entirely covered by:

include/gc/Transforms/Passes.td
test/mlir/test/gc/Transforms/GPU/xevm-attach-target.mlir
test/mlir/test/gc/Transforms/GPU/module-to-binary-xevm.mlir
test/mlir/test/gc/Conversion/GPU/XeVMToLLVM/*
XeVMToLLVMIRTRanslation.cpp
XeVMToLLVM.cpp
XeVMOps.td
XeVMAttachTarget.cpp
XeVMDialect.cpp
+ relevant cmake files

Integration tests are not very relevant for upstreaming.

The miscellaneous changes to wrappers and gpu-runner are gc-specific.

lib/gc/Transforms/GPU/OCL/GpuToGpuOcl.cpp

lib/gc/Transforms/GPU/OCL/CMakeLists.txt

akroviakov · 2025-01-20T15:20:24Z

It seems like all of the crucial points were addressed and CI is green. Shall we merge? @AndreyPavlenko @kurapov-peter

lib/gc/Transforms/GPU/OCL/GpuToGpuOcl.cpp

src/gc-gpu-runner/CMakeLists.txt

lib/gc/Transforms/GPU/OCL/GpuToGpuOcl.cpp

akroviakov requested a review from kurapov-peter January 5, 2025 12:33

akroviakov force-pushed the akroviak/xevm-integration-test branch 3 times, most recently from 2fed342 to b656a80 Compare January 5, 2025 14:56

kurapov-peter reviewed Jan 13, 2025

View reviewed changes

kurapov-peter requested a review from AndreyPavlenko January 13, 2025 14:47

AndreyPavlenko reviewed Jan 13, 2025

View reviewed changes

lib/gc/ExecutionEngine/OpenCLRuntime/OpenCLRuntimeWrappers.cpp Outdated Show resolved Hide resolved

akroviakov force-pushed the akroviak/xevm-integration-test branch from 6f9f446 to 1be438f Compare January 13, 2025 19:02

kurapov-peter reviewed Jan 14, 2025

View reviewed changes

test/mlir/test/gc/cpu-runner/GPU/lit.local.cfg Show resolved Hide resolved

akroviakov force-pushed the akroviak/xevm-integration-test branch 4 times, most recently from ceb79ba to 5540c6f Compare January 16, 2025 13:03

kurapov-peter approved these changes Jan 17, 2025

View reviewed changes

akroviakov force-pushed the akroviak/xevm-integration-test branch 2 times, most recently from fede4a8 to b10a889 Compare January 20, 2025 10:23

akroviakov requested a review from AndreyPavlenko January 20, 2025 10:24