-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XeVM] Add first integration tests #425
Conversation
2fed342
to
b656a80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! Looks good overall, some minor comments inlined.
GPUCLQUEUE *getOrCreateStaticQueue() { | ||
if (!lastQueue) { | ||
return mgpuStreamCreate(); | ||
} | ||
return lastQueue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already had it somewhere? @AndreyPavlenko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the GpuOclRuntime.
@akroviakov could we use the gpu runner for this? It will create the queue on startup. It also does not depend on the shared lib with wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gpu-runner
turned out a bit problematic.
In the upstream, gpu-to-llvm
does the type conversion including memrefs (e.g., for kernel arguments) and lowers runtime calls before running gpu-module-to-binary
(this ensures uniformly lowered kernel arguments/parameters). However, gpu-to-gpuocl
seems to be doing the work of gpu-to-llvm
after gpu-module-to-binary
.
Assuming we need to lower memref for gpu-module-to-binary
to succeed, how can the ops that consume memref in the subsequent gpu-to-gpuocl
pass (e.g., allocOp) remain legal?
gpu-to-gpuocl
seems to be mainly built with regard to how IMEX handles GPUX with its related custom passes, but I'm not sure how relevant that is when we aim for the upstream flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu-to-gpuocl converts a subset of gpu dialect ops (launch, alloc, dealloc, memcpy) to the runtime function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The upstream gpu-to-llvm
, however, only legalizes the launchOp arguments, but does not lower the op itself. The launchOp is lowered during translation to LLVM IR (e.g., mlir-translate test).
Normally, the binary string is generated somewhere between gpu-to-llvm
and translation by the gpu-module-to-binary
pass (e.g., the nvvm pipeline).
Hence it is unusual from the upstream perspective to expect a binary string available before gpu-to-llvm
. In the case of gpu-to-gpuocl
, both the binary and the memrefs (required due to allocOp legality) are expected to exist simultaneously. However, in the upstream flow, how can we generate a binary for a kernel that has memref arguments if the memrefs themselves have not yet been lowered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a "second pass" would be a solution. Maybe we could add some option flag for gpu-to-gpuocl
to indicate the current mode (or create a new separate pass). There would be more changes needed though, for example, gpu-to-gpuocl
relies on querying imex-specific binary attribute of a GPU module, whereas the upstream uses gpu.binary
ops. Required changes could be outside of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we could add the required changes. I'm not sure though if the option flag is really required. On the first run there will not be gpu.launch_func ops, so only the allocs to be lowered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the integration test would need to have a gpu.launch
(with inlined kernel) instead of gpu.launch_func
and do the following:
- gpu-to-gpuocl to convert
gpu.alloc
, etc. - outlining pass to create launchFuncOp from the launchOp
- gpu-module-to-binary
- gpu-to-gpuocl to convert launchFuncOp and the binary to a runtime call.
Do I understand the intent correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added gc-gpu-runner
support for xevm tests, the support relies on outlining pass for running gpu-to-gpuocl
twice. I will stick to the upstream flow using wrappers in the follow-up work for testing convenience, so it would be nice for wrapper changes to remain (no other GPU tests use them anyways).
lastQueue = queue; | ||
if (ptr) { | ||
deallocDeviceMemory(queue, ptr); | ||
deallocDeviceMemory(queue ? queue : getOrCreateStaticQueue(), ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit suspicious, what's the scenario for the case when we have to create a queue to deallocate memory?
if (serializedSPIRVBinary->size() % 4) { | ||
getOperation().emitError() << "SPIRV code size must be a multiple of 4."; | ||
return std::nullopt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a translator's silent failure that can result in this check being true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a requirement for the resulting SPIRV to be valid.
The previous serializedSPIRVBinary->size() + 1
was the source of Code size must be a multiple of 4 but is X
type of error, hence the check explicitly verifies the resulting code size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous serializedSPIRVBinary->size() + 1 was the source
Was the size calculation incorrect or are there cases when the returned size in not a multiple of 4?
6f9f446
to
1be438f
Compare
ceb79ba
to
5540c6f
Compare
Say, we want to start upstreaming xeVM dialect in current form - what exactly changes should we take from this repo and move to llvm? |
The upstream-relevant xevm is almost entirely covered by:
Integration tests are not very relevant for upstreaming. The miscellaneous changes to wrappers and |
fede4a8
to
b10a889
Compare
7257221
to
c9576cb
Compare
It seems like all of the crucial points were addressed and CI is green. Shall we merge? @AndreyPavlenko @kurapov-peter |
317174b
to
512a8d4
Compare
512a8d4
to
6f65ea6
Compare
This PR contains the necessary changes to launch an XeVM integration test via
gc-cpu-runner
.Changes to
OpenCLRuntimeWrappers
improve error decoding and establish one static queue (this ensures no implicit new queue creation which turned out to be problematic).Load/store intrinsics only work with the maximum sub-group size (not the active sub-group size which can be smaller), as noted in the restrictions.
For example, on
GPU Max 1100
the test runs as expected when launched with 32 threads. Although the full sub-group is still formed even if launched with 16 threads, it is not of the maximum size and hence the result is not as expected (UB according to docs). It is possible to select the maximum subgroup size:In MLIR this is supported via the
intel_reqd_sub_group_size
attribute of a kernel.