Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XeVM] Add first integration tests #425

Merged
merged 3 commits into from
Jan 21, 2025
Merged

Conversation

akroviakov
Copy link
Contributor

@akroviakov akroviakov commented Jan 5, 2025

This PR contains the necessary changes to launch an XeVM integration test via gc-cpu-runner.

  • Changes to OpenCLRuntimeWrappers improve error decoding and establish one static queue (this ensures no implicit new queue creation which turned out to be problematic).

  • Load/store intrinsics only work with the maximum sub-group size (not the active sub-group size which can be smaller), as noted in the restrictions.
    For example, on GPU Max 1100 the test runs as expected when launched with 32 threads. Although the full sub-group is still formed even if launched with 16 threads, it is not of the maximum size and hence the result is not as expected (UB according to docs). It is possible to select the maximum subgroup size:

The optional __attribute__((intel_reqd_sub_group_size(<int>))) can be used to indicate that the kernel must be compiled and executed with the specified sub-group size. When this attribute is present, get_max_sub_group_size() is guaranteed to return the specified integer value.

In MLIR this is supported via the intel_reqd_sub_group_size attribute of a kernel.

  • Matrix multiplication intrinsics sets can differ depending on the minimum supported sub-group size, such intrinsics should be called using the minimum supported sub-group size.

@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch 3 times, most recently from 2fed342 to b656a80 Compare January 5, 2025 14:56
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Looks good overall, some minor comments inlined.

include/gc/Conversion/Passes.h Outdated Show resolved Hide resolved
include/gc/Dialect/LLVMIR/XeVMOps.td Outdated Show resolved Hide resolved
lib/gc/Conversion/XeVMToLLVM/XeVMToLLVM.cpp Outdated Show resolved Hide resolved
Comment on lines +539 to +542
GPUCLQUEUE *getOrCreateStaticQueue() {
if (!lastQueue) {
return mgpuStreamCreate();
}
return lastQueue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already had it somewhere? @AndreyPavlenko

Copy link
Contributor

@AndreyPavlenko AndreyPavlenko Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the GpuOclRuntime.
@akroviakov could we use the gpu runner for this? It will create the queue on startup. It also does not depend on the shared lib with wrappers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gpu-runner turned out a bit problematic.
In the upstream, gpu-to-llvm does the type conversion including memrefs (e.g., for kernel arguments) and lowers runtime calls before running gpu-module-to-binary (this ensures uniformly lowered kernel arguments/parameters). However, gpu-to-gpuocl seems to be doing the work of gpu-to-llvm after gpu-module-to-binary.
Assuming we need to lower memref for gpu-module-to-binary to succeed, how can the ops that consume memref in the subsequent gpu-to-gpuocl pass (e.g., allocOp) remain legal?

gpu-to-gpuocl seems to be mainly built with regard to how IMEX handles GPUX with its related custom passes, but I'm not sure how relevant that is when we aim for the upstream flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu-to-gpuocl converts a subset of gpu dialect ops (launch, alloc, dealloc, memcpy) to the runtime function calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The upstream gpu-to-llvm, however, only legalizes the launchOp arguments, but does not lower the op itself. The launchOp is lowered during translation to LLVM IR (e.g., mlir-translate test).

Normally, the binary string is generated somewhere between gpu-to-llvm and translation by the gpu-module-to-binary pass (e.g., the nvvm pipeline).

Hence it is unusual from the upstream perspective to expect a binary string available before gpu-to-llvm. In the case of gpu-to-gpuocl, both the binary and the memrefs (required due to allocOp legality) are expected to exist simultaneously. However, in the upstream flow, how can we generate a binary for a kernel that has memref arguments if the memrefs themselves have not yet been lowered?

Copy link
Contributor Author

@akroviakov akroviakov Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a "second pass" would be a solution. Maybe we could add some option flag for gpu-to-gpuocl to indicate the current mode (or create a new separate pass). There would be more changes needed though, for example, gpu-to-gpuocl relies on querying imex-specific binary attribute of a GPU module, whereas the upstream uses gpu.binary ops. Required changes could be outside of the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could add the required changes. I'm not sure though if the option flag is really required. On the first run there will not be gpu.launch_func ops, so only the allocs to be lowered.

Copy link
Contributor Author

@akroviakov akroviakov Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the integration test would need to have a gpu.launch (with inlined kernel) instead of gpu.launch_func and do the following:

  1. gpu-to-gpuocl to convert gpu.alloc, etc.
  2. outlining pass to create launchFuncOp from the launchOp
  3. gpu-module-to-binary
  4. gpu-to-gpuocl to convert launchFuncOp and the binary to a runtime call.

Do I understand the intent correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, something like that.

Copy link
Contributor Author

@akroviakov akroviakov Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gc-gpu-runner support for xevm tests, the support relies on outlining pass for running gpu-to-gpuocl twice. I will stick to the upstream flow using wrappers in the follow-up work for testing convenience, so it would be nice for wrapper changes to remain (no other GPU tests use them anyways).

Comment on lines -470 to +557
lastQueue = queue;
if (ptr) {
deallocDeviceMemory(queue, ptr);
deallocDeviceMemory(queue ? queue : getOrCreateStaticQueue(), ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit suspicious, what's the scenario for the case when we have to create a queue to deallocate memory?

Comment on lines +147 to +150
if (serializedSPIRVBinary->size() % 4) {
getOperation().emitError() << "SPIRV code size must be a multiple of 4.";
return std::nullopt;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a translator's silent failure that can result in this check being true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a requirement for the resulting SPIRV to be valid.
The previous serializedSPIRVBinary->size() + 1 was the source of Code size must be a multiple of 4 but is X type of error, hence the check explicitly verifies the resulting code size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous serializedSPIRVBinary->size() + 1 was the source

Was the size calculation incorrect or are there cases when the returned size in not a multiple of 4?

src/gc-cpu-runner/CMakeLists.txt Show resolved Hide resolved
@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch from 6f9f446 to 1be438f Compare January 13, 2025 19:02
@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch 4 times, most recently from ceb79ba to 5540c6f Compare January 16, 2025 13:03
@Garra1980
Copy link

Say, we want to start upstreaming xeVM dialect in current form - what exactly changes should we take from this repo and move to llvm?

@akroviakov
Copy link
Contributor Author

The upstream-relevant xevm is almost entirely covered by:

include/gc/Transforms/Passes.td
test/mlir/test/gc/Transforms/GPU/xevm-attach-target.mlir
test/mlir/test/gc/Transforms/GPU/module-to-binary-xevm.mlir
test/mlir/test/gc/Conversion/GPU/XeVMToLLVM/*
XeVMToLLVMIRTRanslation.cpp
XeVMToLLVM.cpp
XeVMOps.td
XeVMAttachTarget.cpp
XeVMDialect.cpp
+ relevant cmake files

Integration tests are not very relevant for upstreaming.

The miscellaneous changes to wrappers and gpu-runner are gc-specific.

@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch 2 times, most recently from fede4a8 to b10a889 Compare January 20, 2025 10:23
@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch from 7257221 to c9576cb Compare January 20, 2025 15:01
@akroviakov
Copy link
Contributor Author

It seems like all of the crucial points were addressed and CI is green. Shall we merge? @AndreyPavlenko @kurapov-peter

@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch 2 times, most recently from 317174b to 512a8d4 Compare January 20, 2025 17:25
@akroviakov akroviakov force-pushed the akroviak/xevm-integration-test branch from 512a8d4 to 6f65ea6 Compare January 20, 2025 17:37
@akroviakov akroviakov merged commit d5e6a56 into main Jan 21, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants