-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple compiler/runtime issues on Mac M4 #19873
Comments
#19881 fixes the JIT issue, but I'm not sure if it fixes the runtime issue or not. Would you like to give it a shot? |
Indeed, the runtime segfaults at null or near-null addresses have to be caused by some null dereference that had to be something else. unrelated to the CPU features aspect. |
To debug the runtime issue, get a better call stack in LLDB in a non-optimized debug build (CMAKE_BUILD_TYPE=Debug), get the callee function name (this is crashing while trying to call a module function) by inspecting local variables around |
A step towards #19873 --------- Signed-off-by: hanhanW <hanhan0912@gmail.com>
Thanks a lot of the help. The JIT issue is resolved :) The call stack is already from a debug build. I tried inspecting the variable but for me its hard to tell for what values its okay to be null. @bjacob The only thing i could tell is that the dispatch is the Also i tried compiling the same module but with f32 values and it runs. The segfault only appears to happen for the f16 case. i could screenshot all the variables and send it if its helpful but there are a lot of them :D ![]() |
Yes, exactly, thanks! Very interesting too that it is working with f32 and only crashing with f16. Thanks also for the dump of variables. This needs to be further debugged in LLDB to pin down where exactly the null pointer is coming from. |
I reproduced the compilation locally (on x86, cross-compiling with It all looks normal -- the This might be a problem rather at the level of the call mechanism itself. While in LLDB, could you dump the full stack trace? To make it go faster, you could try |
So i just found out that everything works when i set --iree-llvmcpu-link-embedded=false, if i set it to true the crash happens. Here the full stack trace:
|
Thanks a lot! Could you also share more disassembly around the crash instruction? The above snippets of disassembly are interesting because they are in contradiction with the symbol being
Here is the disassembly for main$async_dispatch_0_pack_f16:
.Lfunc_begin0:
.file 1 "exported_model_4096x4096.mlir"
.loc 1 1 0
.cfi_startproc
stp x29, x30, [sp, #-16]!
mov x29, sp
.cfi_def_cfa w29, 16
.cfi_offset w30, -8
.cfi_offset w29, -16
mov x8, #0
.Ltmp9:
.loc 1 3 19 prologue_end
ldr x9, [x1, #32]
ldp x10, x9, [x9]
.loc 1 9 10
ldp w11, w12, [x2]
lsl x12, x12, #19
add x13, x12, x11, lsl #10
add x9, x9, x13
add x11, x12, x11, lsl #7
add x10, x10, x11
.LBB0_1:
.loc 1 0 10 is_stmt 0
mov x11, #0
mov x12, x10
mov x13, x9
.LBB0_2:
mov x14, #0
mov x15, #0
.LBB0_3:
.loc 1 9 10 is_stmt 1
ldr h0, [x12, x14]
str h0, [x13, x15]
add x15, x15, #2
add x14, x14, #2, lsl #12
cmp x15, #16
b.ne .LBB0_3
add x11, x11, #1
add x13, x13, #16
add x12, x12, #2
cmp x11, #64
b.ne .LBB0_2
add x8, x8, #1
add x9, x9, #16, lsl #12
add x10, x10, #16, lsl #12
cmp x8, #8
b.ne .LBB0_1
mov w0, #0
.loc 1 9 10 epilogue_begin is_stmt 0
ldp x29, x30, [sp], #16
ret The four instructions ( |
add a printf in iree_elf_call_i_ppp and see if a0/a1/a2 are ok - if they are, it may be a calling convention change or something with memory protection - see runtime/src/iree/hal/local/elf/platform/apple.c |
Definitely looks like an issue with the call mechanics! Note that above we were suspecting that somehow we are calling the wrong target function, as #19873 (comment) was saying that we should be calling
|
more disassembly if it helps:
|
Thanks . The a1/a2 values are normal-looking non-null pointers. The The disassembly up to this point confirms it really is the dispatch_2 function, not dispatch_0:
After this point, the rest of the disassembly is a new function,
Which means that it hasn't properly picked up the CPU features from |
This reproduces on my M2 mac -- it's not M4-specific. The intended callee really is
Going to finish debugging this now that I can reproduce. |
Oh I know! (If you're curious --- the Aarch64 PCS reserves |
ah hah! nice find! |
The code was intending to add the `reserve-x18` flag, and it was being executed... but the `hasFlag` method that it was calling wasn't doing that it thought it was. That `hasFlag` method is just a static member that returns `true` if its string argument starts with a `+`. There isn't an actual method on `SubtargetFeatures` to check if we already have a given feature. Anyway, it doesn't matter - in this case, we don't already have the feature, and even if we did, multiply specified features are not a problem. But it really, really matters that we don't accidentally allocate `x18` again :-) Fixes #19873. --------- Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Thank you, learned something new :) |
Same here, I picked up the issue for learning something. And I learned more from Benoit today, thanks Benoit! |
A step towards iree-org#19873 --------- Signed-off-by: hanhanW <hanhan0912@gmail.com> Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
The code was intending to add the `reserve-x18` flag, and it was being executed... but the `hasFlag` method that it was calling wasn't doing that it thought it was. That `hasFlag` method is just a static member that returns `true` if its string argument starts with a `+`. There isn't an actual method on `SubtargetFeatures` to check if we already have a given feature. Anyway, it doesn't matter - in this case, we don't already have the feature, and even if we did, multiply specified features are not a problem. But it really, really matters that we don't accidentally allocate `x18` again :-) Fixes iree-org#19873. --------- Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com> Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
As newer aarch64 targets increasingly support SVE and SME, this clause was preventing ukernels from being used in cases where they do speed things up. The reason why this logic was out of place here is that what it controls here is the enablement of ukernels, which are a detail of lowering an already tiled workload. If we wanted to use SVE with a variable vector length, or with a fixed vector length different from NEON's 128bit, that decision needed to be made earlier; conversely, if the workload at this point already has the right shaped to be matched to a NEON ukernel, then SVE is not relevant to it anymore. FYI @ziereis , this results in substantially faster code in your test case from #19873. Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
What happened?
There are multiple issues using the iree compiler and runtime on a Mac M4.
compiling with --iree-consteval-jit-target-device=vmvx works, however its not real solution as its not possible to handle all datatypes.
Running from the python bindings works:
Steps to reproduce your issue
to generate the input ir to reproduce the problems:
What component(s) does this issue relate to?
Compiler, Runtime
Version information
4693b1c
Additional context
No response
The text was updated successfully, but these errors were encountered: