Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple compiler/runtime issues on Mac M4 #19873

Closed
ziereis opened this issue Feb 1, 2025 · 17 comments · Fixed by #19895
Closed

Multiple compiler/runtime issues on Mac M4 #19873

ziereis opened this issue Feb 1, 2025 · 17 comments · Fixed by #19895
Assignees
Labels
bug 🐞 Something isn't working

Comments

@ziereis
Copy link
Contributor

ziereis commented Feb 1, 2025

What happened?

There are multiple issues using the iree compiler and runtime on a Mac M4.

  1. host does not seem like a valid target architecture, so the jit eval does not really work.
./build/tools/iree-compile --iree-hal-target-device=llvm-cpu --iree-llvmcpu-target-cpu=host exported_model_4096x4096.mlir    

Internal error while creating host target: Resolution of CPU to CPU-features is not implemented on this target architecture. Pass explicit CPU-features, or implement the missing mapping.

exported_model_4096x4096.mlir:7:10: error: failed to legalize operation 'stream.executable' that was explicitly marked illegal
    %1 = torch.aten.linear %arg0, %0, %none : !torch.vtensor<[4096,4096],f32>, !torch.vtensor<[4096,4096],f32>, !torch.none -> !torch.vtensor<[4096,4096],f32>
         ^
exported_model_4096x4096.mlir:7:10: note: called from
    %1 = torch.aten.linear %arg0, %0, %none : !torch.vtensor<[4096,4096],f32>, !torch.vtensor<[4096,4096],f32>, !torch.none -> !torch.vtensor<[4096,4096],f32>
         ^
exported_model_4096x4096.mlir:7:10: note: see current operation: 
"stream.executable"() <{sym_name = "jit_eval_dispatch_0", sym_visibility = "private"}> ({
  "stream.executable.export"() <{function_ref = @jit_eval_dispatch_0_pack_f32, sym_name = "jit_eval_dispatch_0_pack_f32"}> ({
    %6:3 = "flow.dispatch.workgroup_count_from_slice"() : () -> (index, index, index)
    "stream.return"(%6#0, %6#1, %6#2) : (index, index, index) -> ()
  }) : () -> ()
  "builtin.module"() ({
    "func.func"() <{arg_attrs = [{stream.alignment = 64 : index}, {stream.alignment = 64 : index}], function_type = (!stream.binding, !stream.binding) -> (), sym_name = "jit_eval_dispatch_0_pack_f32"}> ({
    ^bb0(%arg0: !stream.binding, %arg1: !stream.binding):
      %0 = "arith.constant"() <{value = 0 : index}> : () -> index
      %1 = "stream.binding.subspan"(%arg0, %0) : (!stream.binding, index) -> !flow.dispatch.tensor<readonly:tensor<4096x4096xf32>>
      %2 = "stream.binding.subspan"(%arg1, %0) : (!stream.binding, index) -> !flow.dispatch.tensor<writeonly:tensor<512x4096x8x1xf32>>
      %3 = "flow.dispatch.tensor.load"(%1) <{operandSegmentSizes = array<i32: 1, 0, 0, 0, 0>, static_offsets = array<i64: 0, 0>, static_sizes = array<i64: 4096, 4096>, static_strides = array<i64: 1, 1>}> : (!flow.dispatch.tensor<readonly:tensor<4096x4096xf32>>) -> tensor<4096x4096xf32>
      %4 = "tensor.empty"() : () -> tensor<512x4096x8x1xf32>
      %5 = "tensor.pack"(%3, %4) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 0, 0>, outer_dims_perm = array<i64: 0, 1>, static_inner_tiles = array<i64: 8, 1>}> : (tensor<4096x4096xf32>, tensor<512x4096x8x1xf32>) -> tensor<512x4096x8x1xf32>
      "flow.dispatch.tensor.store"(%5, %2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 512, 4096, 8, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<512x4096x8x1xf32>, !flow.dispatch.tensor<writeonly:tensor<512x4096x8x1xf32>>) -> ()
      "func.return"() : () -> ()
    }) : () -> ()
  }) : () -> ()
  "stream.executable.end"() : () -> ()
}) : () -> ()
    %1 = torch.aten.linear %arg0, %0, %none : !torch.vtensor<[4096,4096],f32>, !torch.vtensor<[4096,4096],f32>, !torch.none -> !torch.vtensor<[4096,4096],f32>

compiling with --iree-consteval-jit-target-device=vmvx works, however its not real solution as its not possible to handle all datatypes.

./build/tools/iree-compile --iree-hal-target-device=llvm-cpu --iree-llvmcpu-target-triple="arm64-apple-darwin" --iree-llvmcpu-target-cpu-features="+neon,+fp-armv8,+lse,+lse128,+
fullfp16,+fp16fml,+dotprod,+i8mm,+bf16"  --iree-consteval-jit-debug --iree-consteval-jit-target-device=vmvx exported_model_4096x4096.mlir -o out.vmfb
exported_model_4096x4096.mlir:7:10: warning: skipping consteval initializer: unsupported type for current jit configuration: 'tensor<4096x4096xf16>'
    %1 = torch.aten.linear %arg0, %0, %none : !torch.vtensor<[4096,4096],f16>, !torch.vtensor<[4096,4096],f16>, !torch.none -> !torch.vtensor<[4096,4096],f16>
         ^
::: Rejected consteval initializer:
util.initializer attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  %cst = arith.constant dense_resource<__auto.constant_4096_4096_torch.float16> : tensor<4096x4096xf16>
  %0 = tensor.empty() : tensor<512x4096x8x1xf16>
  %pack = tensor.pack %cst outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %0 : tensor<4096x4096xf16> -> tensor<512x4096x8x1xf16>
  util.global.store %pack, @__hoisted_tensor_512x4096x8x1xf16 : tensor<512x4096x8x1xf16>
  util.return
}
  1. When running the compiled module with iree-run-module or iree-benchmark-module the programm crashes:
Process 63095 launched: '/Users/ziereis/projects/iree/build/tools/iree-benchmark-module' (arm64)
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
2025-02-01T12:02:31+01:00
Running /Users/ziereis/projects/iree/build/tools/iree-benchmark-module
Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 4.42, 4.20, 4.19
***WARNING*** Library was built as DEBUG. Timings may be affected.
Process 63095 stopped
* thread #3, name = 'iree-worker-0', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001004549d4
->  0x1004549d4: str    q17, [x18, x4]
    0x1004549d8: str    q0, [x0, x4]
    0x1004549dc: str    q18, [x1, x4]
    0x1004549e0: str    q4, [x2, x4]
  thread #4, name = 'iree-worker-1', stop reason = EXC_BAD_ACCESS (code=1, address=0x400)
    frame #0: 0x00000001004549d4
->  0x1004549d4: str    q17, [x18, x4]
    0x1004549d8: str    q0, [x0, x4]
    0x1004549dc: str    q18, [x1, x4]
    0x1004549e0: str    q4, [x2, x4]
  thread #5, name = 'iree-worker-2', stop reason = EXC_BAD_ACCESS (code=1, address=0x800)
    frame #0: 0x00000001004549d4
->  0x1004549d4: str    q17, [x18, x4]
    0x1004549d8: str    q0, [x0, x4]
    0x1004549dc: str    q18, [x1, x4]
    0x1004549e0: str    q4, [x2, x4]
  thread #6, name = 'iree-worker-3', stop reason = EXC_BAD_ACCESS (code=1, address=0xc10)
    frame #0: 0x00000001004549d4
->  0x1004549d4: str    q17, [x18, x4]
    0x1004549d8: str    q0, [x0, x4]
    0x1004549dc: str    q18, [x1, x4]
    0x1004549e0: str    q4, [x2, x4]
(lldb) bt
* thread #3, name = 'iree-worker-0', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001004549d4
    frame #1: 0x00000001001042ec iree-benchmark-module`iree_elf_call_i_ppp(symbol_ptr=0x000000010045489c, a0=<unavailable>, a1=<unavailable>, a2=<unavailable>) at arm_64.c:145:10
    frame #2: 0x00000001000fcc1c iree-benchmark-module`iree_hal_elf_executable_issue_call(base_executable=0x0000000145904eb0, ordinal=2, dispatch_state=0x000000016feb2d00, workgroup_state=0x000000016feb2cc0, worker_id=0) at embedded_elf_loader.c:200:13
    frame #3: 0x0000000100109774 iree-benchmark-module`iree_hal_local_executable_issue_call(executable=0x0000000145904eb0, ordinal=2, dispatch_state=0x000000016feb2d00, workgroup_state=0x000000016feb2cc0, worker_id=0) at local_executable.c:42:10
    frame #4: 0x00000001000bced4 iree-benchmark-module`iree_hal_task_cmd_dispatch_tile(user_context=0x0000000146012830, tile_context=0x000000016feb2e00, pending_submission=0x000000016feb2f10) at task_command_buffer.c:776:26
    frame #5: 0x00000001000ca168 iree-benchmark-module`iree_task_dispatch_shard_execute(task=0x0000000146008b80, processor_id=0, worker_id=0, worker_local_memory=(data = "", data_length = 16777216), pending_submission=0x000000016feb2f10) at task.c:795:11
    frame #6: 0x00000001000cc10c iree-benchmark-module`iree_task_worker_execute(worker=0x000000014000c1c0, task=0x0000000146008b80, pending_submission=0x000000016feb2f10) at worker.c:218:7
    frame #7: 0x00000001000cbf70 iree-benchmark-module`iree_task_worker_pump_once(worker=0x000000014000c1c0, pending_submission=0x000000016feb2f10) at worker.c:279:3
    frame #8: 0x00000001000cbd2c iree-benchmark-module`iree_task_worker_pump_until_exit(worker=0x000000014000c1c0) at worker.c:327:12
    frame #9: 0x00000001000cb814 iree-benchmark-module`iree_task_worker_main(worker=0x000000014000c1c0) at worker.c:403:5
    frame #10: 0x00000001000cee3c iree-benchmark-module`iree_thread_start_routine(param=0x00000001459040e0) at threading_darwin.c:83:29
    frame #11: 0x000000018f1f82e4 libsystem_pthread.dylib`_pthread_start + 136
  1. running iree-benchmark-module/run-module from the python package does not appear to crash, however it just exits and doesn't report anything
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
2025-02-01T12:04:49+01:00
Running iree-benchmark-module
Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 3.71, 4.04, 4.12

Running from the python bindings works:

import iree.runtime as rt
import numpy as np

config = rt.Config("local-task")
vmm = rt.load_vm_module(rt.VmModule.mmap(config.vm_instance, "out.vmfb"))

x = np.ones((4096,4096), dtype=np.float16)
y = vmm.main(x)
print(y.to_host())

Steps to reproduce your issue

to generate the input ir to reproduce the problems:

import torch
import torch.nn as nn

import iree.turbine.aot as aot


class Linear(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(4096, 4096, bias=False, dtype=torch.float16)

    def forward(self, x: torch.Tensor):
        return self.layer(x)


model = Linear()
example_x = torch.empty(4096, 4096, dtype=torch.float16)
exported = aot.export(model, example_x)
exported.save_mlir("exported_model_4096x4096.mlir")

What component(s) does this issue relate to?

Compiler, Runtime

Version information

4693b1c

Additional context

No response

@ziereis ziereis added the bug 🐞 Something isn't working label Feb 1, 2025
@ziereis ziereis changed the title Multiple issues on Mac M4 Multiple compiler/runtime issues on Mac M4 Feb 1, 2025
@hanhanW
Copy link
Contributor

hanhanW commented Feb 3, 2025

#19881 fixes the JIT issue, but I'm not sure if it fixes the runtime issue or not. Would you like to give it a shot?

@hanhanW hanhanW self-assigned this Feb 3, 2025
@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

Indeed, the runtime segfaults at null or near-null addresses have to be caused by some null dereference that had to be something else. unrelated to the CPU features aspect.

@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

To debug the runtime issue, get a better call stack in LLDB in a non-optimized debug build (CMAKE_BUILD_TYPE=Debug), get the callee function name (this is crashing while trying to call a module function) by inspecting local variables around iree_hal_elf_executable_issue_call, understand from the parameters there where that null pointer is coming from. Is the crash already in the module function or while trying to call it?

hanhanW added a commit that referenced this issue Feb 3, 2025
A step towards #19873

---------

Signed-off-by: hanhanW <hanhan0912@gmail.com>
@ziereis
Copy link
Contributor Author

ziereis commented Feb 3, 2025

Thanks a lot of the help. The JIT issue is resolved :)

The call stack is already from a debug build. I tried inspecting the variable but for me its hard to tell for what values its okay to be null.

@bjacob The only thing i could tell is that the dispatch is the main$async_dispatch_0_pack_f16, is this what you mean by callee function name?

Also i tried compiling the same module but with f32 values and it runs. The segfault only appears to happen for the f16 case.

i could screenshot all the variables and send it if its helpful but there are a lot of them :D

Image

@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

The only thing i could tell is that the dispatch is the main$async_dispatch_0_pack_f16, is this what you mean by callee function name?

Yes, exactly, thanks!

Very interesting too that it is working with f32 and only crashing with f16.

Thanks also for the dump of variables.

This needs to be further debugged in LLDB to pin down where exactly the null pointer is coming from.

@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

I reproduced the compilation locally (on x86, cross-compiling with --iree-llvmcpu-target-cpu=apple-m4 exported_model_4096x4096.mlir --iree-llvmcpu-target-triple=aarch64-none-elf).

It all looks normal -- the main$async_dispatch_0_pack_f16 function looks fine in IR and then in aarch64 assembly. It is compiled to a simple scalar loop, it's going to be slow but shouldn't crash when called on valid arguments.

This might be a problem rather at the level of the call mechanism itself. While in LLDB, could you dump the full stack trace?

To make it go faster, you could try --iree-llvmcpu-enable-ukernels=all, but I expect it won't make a difference to the crash issue you're experiencing here, again because the problem seems to be happening before we even reach that point.

@ziereis
Copy link
Contributor Author

ziereis commented Feb 3, 2025

So i just found out that everything works when i set --iree-llvmcpu-link-embedded=false, if i set it to true the crash happens.

Here the full stack trace:

(lldb) bt all
  thread #1, queue = 'com.apple.main-thread'
    frame #0: 0x000000018f1bf830 libsystem_kernel.dylib`poll + 8
    frame #1: 0x00000001000933ac iree-run-module`iree_syscall_poll(fds=0x000000016fdf9d50, nfds=1, deadline_ns=9223372036854775807, out_signaled_count=0x000000016fdf9d48) at wait_handle_poll.c:40:10
    frame #2: 0x000000010009381c iree-run-module`iree_wait_one(handle=0x000000016fdf9dc8, deadline_ns=9223372036854775807) at wait_handle_poll.c:400:3
    frame #3: 0x00000001000854b0 iree-run-module`iree_hal_task_semaphore_wait(base_semaphore=0x0000600002c44420, value=1, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at task_semaphore.c:315:12
    frame #4: 0x0000000100024814 iree-run-module`iree_hal_semaphore_wait(semaphore=0x0000600002c44420, value=1, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at semaphore.c:77:7
    frame #5: 0x0000000100024d00 iree-run-module`iree_hal_semaphore_list_wait(semaphore_list=iree_hal_semaphore_list_t @ 0x000000016fdf9eb8, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at semaphore.c:201:14
    frame #6: 0x0000000100023a58 iree-run-module`iree_hal_fence_wait(fence=0x00006000006442d0, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at fence.c:251:26
    frame #7: 0x0000000100165ec8 iree-run-module`iree_hal_module_fence_await(stack=0x000000016fdfd0c8, module=0x0000000140f05fb0, state=0x0000600002c44360, args=0x000000016fdfa370, rets=0x000000016fdfa360) at module.c:1777:25
    frame #8: 0x0000000100189280 iree-run-module`iree_vm_shim_iCrD_i(stack=0x000000016fdfd0c8, flags=1, args_storage=(data = "\xff\xff\xff\xff\U00000001", data_length = 24), rets_storage=(data = "", data_length = 4), target_fn=(iree-run-module`iree_hal_module_fence_await at module.c:1742), module=0x0000000140f05fb0, module_state=0x0000600002c44360) at shims.c:83:1
    frame #9: 0x0000000100181824 iree-run-module`iree_vm_native_module_issue_call(module=0x0000000140f05fb0, stack=0x000000016fdfd0c8, callee_frame=0x000000016fdfd1f8, flags=1, args_storage=(data = "\xff\xff\xff\xff\U00000001", data_length = 24), rets_storage=(data = "", data_length = 4)) at native_module.c:364:7
    frame #10: 0x00000001001813a8 iree-run-module`iree_vm_native_module_begin_call(self=0x0000000140f05fb0, stack=0x000000016fdfd0c8, call=iree_vm_function_call_t @ 0x000000016fdfa2c8) at native_module.c:420:10
    frame #11: 0x00000001000fc62c iree-run-module`iree_vm_bytecode_issue_import_call(stack=0x000000016fdfd0c8, call=iree_vm_function_call_t @ 0x000000016fdfa3f8, cconv_results=(data = "i", size = 1), dst_reg_list=0x0000000141802330, out_caller_frame=0x000000016fdfcb60, out_caller_registers=0x000000016fdfcb88) at dispatch.c:452:7
    frame #12: 0x00000001000fa234 iree-run-module`iree_vm_bytecode_call_import_variadic(stack=0x000000016fdfd0c8, module_state=0x000000014100b800, import_ordinal=2147483670, caller_registers=iree_vm_registers_t @ 0x000000016fdfa4b0, segment_size_list=0x0000000141802324, src_reg_list=0x000000014180232a, dst_reg_list=0x0000000141802330, out_caller_frame=0x000000016fdfcb60, out_caller_registers=0x000000016fdfcb88) at dispatch.c:609:10
    frame #13: 0x00000001000ef8e0 iree-run-module`iree_vm_bytecode_dispatch(stack=0x000000016fdfd0c8, module=0x0000000140f05800, current_frame=0x000000016fdfd138, regs=iree_vm_registers_t @ 0x000000016fdfcb88, call_results=(data = "", data_length = 16)) at dispatch.c:1721:5
    frame #14: 0x00000001000e4e34 iree-run-module`iree_vm_bytecode_dispatch_begin(stack=0x000000016fdfd0c8, module=0x0000000140f05800, call=iree_vm_function_call_t @ 0x000000016fdfccb0, cconv_arguments=(data = "r_r", size = 1), cconv_results=(data = "r", size = 1)) at dispatch.c:636:26
    frame #15: 0x00000001000ff164 iree-run-module`iree_vm_bytecode_module_begin_call(self=0x0000000140f05800, stack=0x000000016fdfd0c8, call=iree_vm_function_call_t @ 0x000000016fdfcdf0) at module.c:845:10
    frame #16: 0x00000001001783d8 iree-run-module`iree_vm_begin_invoke(state=0x000000016fdfd080, context=0x0000600002344000, function=iree_vm_function_t @ 0x000000016fdfcf50, flags=0, policy=0x0000000000000000, inputs=0x0000600002b44780, host_allocator=iree_allocator_t @ 0x000000016fdfcf40) at invocation.c:504:7
    frame #17: 0x0000000100177b10 iree-run-module`iree_vm_invoke(context=0x0000600002344000, function=iree_vm_function_t @ 0x000000016fdfd070, flags=0, policy=0x0000000000000000, inputs=0x0000600002b44780, outputs=0x0000600002b44820, host_allocator=iree_allocator_t @ 0x000000016fdfd060) at invocation.c:302:26
    frame #18: 0x000000010002bda4 iree-run-module`iree_tooling_run_function(context=0x0000600002344000, function=iree_vm_function_t @ 0x000000016fdff238, device=0x0000000140f05df0, device_allocator=0x0000600002644080, host_allocator=iree_allocator_t @ 0x000000016fdff228, out_exit_code=0x000000016fdff3c4) at run_module.c:253:9
    frame #19: 0x000000010002b65c iree-run-module`iree_tooling_run_module_with_data(instance=0x0000000140f055e0, default_device_uri=(data = 0x0000000000000000, size = 0), module_contents=(data = 0x0000000000000000, data_length = 0), host_allocator=iree_allocator_t @ 0x000000016fdff2f8, out_exit_code=0x000000016fdff3c4) at run_module.c:413:7
    frame #20: 0x000000010002b54c iree-run-module`iree_tooling_run_module_from_flags(instance=0x0000000140f055e0, host_allocator=iree_allocator_t @ 0x000000016fdff380, out_exit_code=0x000000016fdff3c4) at run_module.c:387:10
    frame #21: 0x00000001000036d4 iree-run-module`main(argc=1, argv=0x000000016fdff660) at iree-run-module-main.c:43:14
    frame #22: 0x000000018ee78274 dyld`start + 2840
  thread #2, name = 'iree-poller'
    frame #0: 0x000000018f1bf830 libsystem_kernel.dylib`poll + 8
    frame #1: 0x00000001000933ac iree-run-module`iree_syscall_poll(fds=0x000000014100cc20, nfds=2, deadline_ns=9223372036854775807, out_signaled_count=0x000000016fe86ea8) at wait_handle_poll.c:40:10
    frame #2: 0x0000000100093668 iree-run-module`iree_wait_any(set=0x000000014100c800, deadline_ns=9223372036854775807, out_wake_handle=0x000000016fe86ee8) at wait_handle_poll.c:354:3
    frame #3: 0x000000010008b1dc iree-run-module`iree_task_poller_commit_wait(poller=0x0000000148008098, deadline_ns=9223372036854775807) at poller.c:443:7
    frame #4: 0x000000010008b038 iree-run-module`iree_task_poller_pump_until_exit(poller=0x0000000148008098) at poller.c:521:5
    frame #5: 0x000000010008ab08 iree-run-module`iree_task_poller_main(poller=0x0000000148008098) at poller.c:544:5
    frame #6: 0x0000000100094b98 iree-run-module`iree_thread_start_routine(param=0x0000600002c442a0) at threading_darwin.c:83:29
    frame #7: 0x000000018f1f82e4 libsystem_pthread.dylib`_pthread_start + 136
* thread #3, name = 'iree-worker-0', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001002a492c
    frame #1: 0x000000010007c30c iree-run-module`iree_elf_call_i_ppp(symbol_ptr=0x00000001002a4808, a0=<unavailable>, a1=<unavailable>, a2=<unavailable>) at arm_64.c:145:10
    frame #2: 0x0000000100042308 iree-run-module`iree_hal_elf_executable_issue_call(base_executable=0x0000600003740000, ordinal=2, dispatch_state=0x000000016feb2d00, workgroup_state=0x000000016feb2cc0, worker_id=0) at embedded_elf_loader.c:200:13
    frame #3: 0x000000010007e9b0 iree-run-module`iree_hal_local_executable_issue_call(executable=0x0000600003740000, ordinal=2, dispatch_state=0x000000016feb2d00, workgroup_state=0x000000016feb2cc0, worker_id=0) at local_executable.c:42:10
    frame #4: 0x00000001000809e0 iree-run-module`iree_hal_task_cmd_dispatch_tile(user_context=0x0000000137010430, tile_context=0x000000016feb2e00, pending_submission=0x000000016feb2f10) at task_command_buffer.c:776:26
    frame #5: 0x000000010008f27c iree-run-module`iree_task_dispatch_shard_execute(task=0x000000014100a8b0, processor_id=0, worker_id=0, worker_local_memory=(data = "", data_length = 131072), pending_submission=0x000000016feb2f10) at task.c:795:11
    frame #6: 0x0000000100091220 iree-run-module`iree_task_worker_execute(worker=0x00000001480081c0, task=0x000000014100a8b0, pending_submission=0x000000016feb2f10) at worker.c:218:7
    frame #7: 0x0000000100091084 iree-run-module`iree_task_worker_pump_once(worker=0x00000001480081c0, pending_submission=0x000000016feb2f10) at worker.c:279:3
    frame #8: 0x0000000100090e40 iree-run-module`iree_task_worker_pump_until_exit(worker=0x00000001480081c0) at worker.c:327:12
    frame #9: 0x0000000100090928 iree-run-module`iree_task_worker_main(worker=0x00000001480081c0) at worker.c:403:5
    frame #10: 0x0000000100094b98 iree-run-module`iree_thread_start_routine(param=0x0000600002c44240) at threading_darwin.c:83:29
    frame #11: 0x000000018f1f82e4 libsystem_pthread.dylib`_pthread_start + 136

@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

Thanks a lot! Could you also share more disassembly around the crash instruction? The above snippets of disassembly are interesting because they are in contradiction with the symbol being async_dispatch_0_pack_f16. That function simply does not contain instructions such as the disassembly snippets:

    frame #0: 0x00000001004549d4
->  0x1004549d4: str    q17, [x18, x4]
    0x1004549d8: str    q0, [x0, x4]
    0x1004549dc: str    q18, [x1, x4]
    0x1004549e0: str    q4, [x2, x4]

Here is the disassembly for async_dispatch_0_pack_f16:

main$async_dispatch_0_pack_f16:
.Lfunc_begin0:
	.file	1 "exported_model_4096x4096.mlir"
	.loc	1 1 0
	.cfi_startproc
	stp	x29, x30, [sp, #-16]!
	mov	x29, sp
	.cfi_def_cfa w29, 16
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	mov	x8, #0
.Ltmp9:
	.loc	1 3 19 prologue_end
	ldr	x9, [x1, #32]
	ldp	x10, x9, [x9]
	.loc	1 9 10
	ldp	w11, w12, [x2]
	lsl	x12, x12, #19
	add	x13, x12, x11, lsl #10
	add	x9, x9, x13
	add	x11, x12, x11, lsl #7
	add	x10, x10, x11
.LBB0_1:
	.loc	1 0 10 is_stmt 0
	mov	x11, #0
	mov	x12, x10
	mov	x13, x9
.LBB0_2:
	mov	x14, #0
	mov	x15, #0
.LBB0_3:
	.loc	1 9 10 is_stmt 1
	ldr	h0, [x12, x14]
	str	h0, [x13, x15]
	add	x15, x15, #2
	add	x14, x14, #2, lsl #12
	cmp	x15, #16
	b.ne	.LBB0_3
	add	x11, x11, #1
	add	x13, x13, #16
	add	x12, x12, #2
	cmp	x11, #64
	b.ne	.LBB0_2
	add	x8, x8, #1
	add	x9, x9, #16, lsl #12
	add	x10, x10, #16, lsl #12
	cmp	x8, #8
	b.ne	.LBB0_1
	mov	w0, #0
	.loc	1 9 10 epilogue_begin is_stmt 0
	ldp	x29, x30, [sp], #16
	ret

The four instructions (str instructions with x-register offset) in the snippet can instead be found in another dispatch function, main$async_dispatch_2_unpack_elementwise_4096x4096_f32xf16.

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

add a printf in iree_elf_call_i_ppp and see if a0/a1/a2 are ok - if they are, it may be a calling convention change or something with memory protection - see runtime/src/iree/hal/local/elf/platform/apple.c

@bjacob
Copy link
Contributor

bjacob commented Feb 3, 2025

Definitely looks like an issue with the call mechanics! Note that above we were suspecting that somehow we are calling the wrong target function, as #19873 (comment) was saying that we should be calling dispatch_0, but the actual disassembly snippet at the crash looks more like dispatch_2, trying to store to a destination buffer that has a null base ptr, which could be explained by having called the wrong function. Also note above:

So i just found out that everything works when i set --iree-llvmcpu-link-embedded=false, if i set it to true the crash happens.

@ziereis
Copy link
Contributor Author

ziereis commented Feb 3, 2025

a0: 0x600001300028
a1: 0x16f31ed00
a2: 0x16f31ecc0
a0: 0x600001300028
a1: 0x16f31ed00
a2: 0x16f31ecc0
a0: 0x600001300028
a1: 0x16f31ed00
a2: 0x16f31ecc0
fish: Job 2, './build/tools/iree-run-module -…' terminated by signal SIGSEGV (Address boundary error)

more disassembly if it helps:

->  0x1002a492c: str    q6, [x18, x4]
    0x1002a4930: str    q7, [x0, x4]
    0x1002a4934: str    q16, [x1, x4]
    0x1002a4938: str    q0, [x2, x4]
    0x1002a493c: add    x4, x13, #0x8
    0x1002a4940: add    x3, x3, #0x100
    0x1002a4944: cmp    x13, #0x38
    0x1002a4948: mov    x13, x4
    0x1002a494c: b.lo   0x1002a4874
    0x1002a4950: add    x13, x8, #0x8
    0x1002a4954: add    x12, x12, #0x20, lsl #12 ; =0x20000 
    0x1002a4958: cmp    x8, #0x38
    0x1002a495c: mov    x8, x13
    0x1002a4960: b.lo   0x1002a4848
    0x1002a4964: mov    w0, #0x0 ; =0 
    0x1002a4968: mov    sp, x29
    0x1002a496c: ldr    x19, [sp, #0x10]
    0x1002a4970: ldp    x29, x30, [sp], #0x20
    0x1002a4974: ret    
    0x1002a4978: nop    
    0x1002a497c: adr    x8, 0x1002b4c60
    0x1002a4980: cmp    w0, #0x5
    0x1002a4984: csel   x0, x8, xzr, eq
    0x1002a4988: ret    
    0x1002a498c: movi.2d v0, #0000000000000000
    0x1002a4990: movi.2d v1, #0000000000000000
    0x1002a4994: ldrb   w8, [x3, #0x6d]
    0x1002a4998: movi.2d v2, #0000000000000000
    0x1002a499c: movi.2d v3, #0000000000000000
    0x1002a49a0: movi.2d v4, #0000000000000000
    0x1002a49a4: movi.2d v5, #0000000000000000
    0x1002a49a8: movi.2d v6, #0000000000000000
    0x1002a49ac: movi.2d v7, #0000000000000000
    0x1002a49b0: movi.2d v16, #0000000000000000
    0x1002a49b4: movi.2d v17, #0000000000000000
    0x1002a49b8: movi.2d v18, #0000000000000000
    0x1002a49bc: movi.2d v19, #0000000000000000
    0x1002a49c0: movi.2d v20, #0000000000000000
    0x1002a49c4: movi.2d v21, #0000000000000000
    0x1002a49c8: movi.2d v22, #0000000000000000
    0x1002a49cc: movi.2d v23, #0000000000000000
    0x1002a49d0: tbz    w8, #0x0, 0x1002a49f4
    0x1002a49d4: ldp    q19, q18, [x0]
    0x1002a49d8: ldp    q17, q16, [x0, #0x20]
    0x1002a49dc: ldp    q7, q6, [x0, #0x40]
    0x1002a49e0: ldp    q5, q4, [x0, #0x60]
    0x1002a49e4: ldp    q3, q2, [x0, #0x80]
    0x1002a49e8: ldp    q1, q0, [x0, #0xa0]
    0x1002a49ec: ldp    q20, q21, [x0, #0xc0]
    0x1002a49f0: ldp    q22, q23, [x0, #0xe0]
    0x1002a49f4: ldr    x8, [x3, #0x58]
    0x1002a49f8: cmp    x8, #0x1
    0x1002a49fc: b.lt   0x1002a4a60
    0x1002a4a00: ldp    d24, d25, [x2], #0x10
    0x1002a4a04: ldp    d26, d27, [x1], #0x10
    0x1002a4a08: subs   x8, x8, #0x1
    0x1002a4a0c: fcvtl  v24.4s, v24.4h
    0x1002a4a10: fcvtl  v25.4s, v25.4h
    0x1002a4a14: fcvtl  v26.4s, v26.4h
    0x1002a4a18: fcvtl  v27.4s, v27.4h
    0x1002a4a1c: fmla.4s v19, v24, v26[0]
    0x1002a4a20: fmla.4s v18, v25, v26[0]
    0x1002a4a24: fmla.4s v17, v24, v26[1]
    0x1002a4a28: fmla.4s v16, v25, v26[1]
    0x1002a4a2c: fmla.4s v7, v24, v26[2]
    0x1002a4a30: fmla.4s v6, v25, v26[2]
    0x1002a4a34: fmla.4s v5, v24, v26[3]
    0x1002a4a38: fmla.4s v4, v25, v26[3]
    0x1002a4a3c: fmla.4s v3, v24, v27[0]
    0x1002a4a40: fmla.4s v2, v25, v27[0]
    0x1002a4a44: fmla.4s v1, v24, v27[1]
    0x1002a4a48: fmla.4s v0, v25, v27[1]
    0x1002a4a4c: fmla.4s v20, v24, v27[2]
    0x1002a4a50: fmla.4s v21, v25, v27[2]
    0x1002a4a54: fmla.4s v22, v24, v27[3]
    0x1002a4a58: fmla.4s v23, v25, v27[3]
    0x1002a4a5c: b.ne   0x1002a4a00
    0x1002a4a60: stp    q19, q18, [x0]
    0x1002a4a64: stp    q17, q16, [x0, #0x20]
    0x1002a4a68: stp    q7, q6, [x0, #0x40]
    0x1002a4a6c: stp    q5, q4, [x0, #0x60]
    0x1002a4a70: stp    q3, q2, [x0, #0x80]
    0x1002a4a74: stp    q1, q0, [x0, #0xa0]
    0x1002a4a78: stp    q20, q21, [x0, #0xc0]
    0x1002a4a7c: stp    q22, q23, [x0, #0xe0]
    0x1002a4a80: ret    
    0x1002a4a84: movi.2d v0, #0000000000000000
    0x1002a4a88: movi.2d v1, #0000000000000000
    0x1002a4a8c: ldrb   w8, [x3, #0x6d]
    0x1002a4a90: movi.2d v2, #0000000000000000
    0x1002a4a94: movi.2d v3, #0000000000000000
    0x1002a4a98: movi.2d v4, #0000000000000000
    0x1002a4a9c: movi.2d v5, #0000000000000000
    0x1002a4aa0: movi.2d v6, #0000000000000000
    0x1002a4aa4: movi.2d v7, #0000000000000000
    0x1002a4aa8: movi.2d v16, #0000000000000000
    0x1002a4aac: movi.2d v17, #0000000000000000
    0x1002a4ab0: movi.2d v18, #0000000000000000
    0x1002a4ab4: movi.2d v19, #0000000000000000
    0x1002a4ab8: movi.2d v20, #0000000000000000

@bjacob
Copy link
Contributor

bjacob commented Feb 4, 2025

Thanks . The a1/a2 values are normal-looking non-null pointers. The a0: 0x600001300028, I don't know, maybe it's a stack pointer or a global? I don't know Mac address space layouts. @benvanik , do we expect a stack pointer or a global for a0 ?

The disassembly up to this point confirms it really is the dispatch_2 function, not dispatch_0:

->  0x1002a492c: str    q6, [x18, x4]
    0x1002a4930: str    q7, [x0, x4]
    0x1002a4934: str    q16, [x1, x4]
    0x1002a4938: str    q0, [x2, x4]
    0x1002a493c: add    x4, x13, #0x8
    0x1002a4940: add    x3, x3, #0x100
    0x1002a4944: cmp    x13, #0x38
    0x1002a4948: mov    x13, x4
    0x1002a494c: b.lo   0x1002a4874
    0x1002a4950: add    x13, x8, #0x8
    0x1002a4954: add    x12, x12, #0x20, lsl #12 ; =0x20000 
    0x1002a4958: cmp    x8, #0x38
    0x1002a495c: mov    x8, x13
    0x1002a4960: b.lo   0x1002a4848
    0x1002a4964: mov    w0, #0x0 ; =0 
    0x1002a4968: mov    sp, x29
    0x1002a496c: ldr    x19, [sp, #0x10]
    0x1002a4970: ldp    x29, x30, [sp], #0x20
    0x1002a4974: ret 

After this point, the rest of the disassembly is a new function, dispatch_3, the matrix multiplication kernel. We can see it converting f16 to f32 and performing the multiplications in f32:

    0x1002a4a0c: fcvtl  v24.4s, v24.4h
    0x1002a4a10: fcvtl  v25.4s, v25.4h
    0x1002a4a14: fcvtl  v26.4s, v26.4h
    0x1002a4a18: fcvtl  v27.4s, v27.4h
    0x1002a4a1c: fmla.4s v19, v24, v26[0]
    0x1002a4a20: fmla.4s v18, v25, v26[0]
    0x1002a4a24: fmla.4s v17, v24, v26[1]
...

Which means that it hasn't properly picked up the CPU features from host. Apple CPUs should definitely support f16 arithmetic natively, so we'll look into that after we've solved the crash. Meanwhile, a higher level point is that even with native support enabled, on arm64-architecture CPUs, f16 still won't be much faster than f32 (and on x86, it's far worse still), while bf16 would be 2x faster, so consider using that if you can ; the only reason to use f16 on CPU is when the exact workload must match a preexisting GPU workload which is more typically f16.

@bjacob
Copy link
Contributor

bjacob commented Feb 4, 2025

This reproduces on my M2 mac -- it's not M4-specific.

The intended callee really is dispatch_2 so there isn't a mismatch there after all.

(lldb) p library->exports.names[ordinal]
(const char *const) 0x0000000100380592 "main$async_dispatch_2_unpack_elementwise_4096x4096_f32xf16"

Going to finish debugging this now that I can reproduce.

@bjacob
Copy link
Contributor

bjacob commented Feb 4, 2025

Oh I know! x18 strikes again 🤦 It was right there in the disassembly in the original issue description above,... my CPU-fu is getting rusty. Finally got it after stepping instruction by instruction through the dispatch until it crashed and asking "wait, I thought that register was a valid pointer value".

(If you're curious --- the Aarch64 PCS reserves x18 for the operating system. Application code is not supposed to be using it at all; each OS uses it for a different purpose that generally leads to x18 magically changing value under the feet of a userspace program using it. For that reason, application code needs to be compiled with a LLVM CPU feature flag, reserve-x18, that prevents allocating that register).

Fix coming... FYI @benvanik , @hanhanW

@benvanik
Copy link
Collaborator

benvanik commented Feb 4, 2025

ah hah! nice find!

bjacob added a commit that referenced this issue Feb 4, 2025
The code was intending to add the `reserve-x18` flag, and it was being
executed... but the `hasFlag` method that it was calling wasn't doing
that it thought it was.

That `hasFlag` method is just a static member that returns `true` if its
string argument starts with a `+`.

There isn't an actual method on `SubtargetFeatures` to check if we
already have a given feature. Anyway, it doesn't matter - in this case,
we don't already have the feature, and even if we did, multiply
specified features are not a problem.

But it really, really matters that we don't accidentally allocate `x18`
again :-)

Fixes #19873.

---------

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@ziereis
Copy link
Contributor Author

ziereis commented Feb 4, 2025

Thank you, learned something new :)

@hanhanW
Copy link
Contributor

hanhanW commented Feb 4, 2025

Same here, I picked up the issue for learning something. And I learned more from Benoit today, thanks Benoit!

ita9naiwa pushed a commit to ita9naiwa/iree that referenced this issue Feb 4, 2025
A step towards iree-org#19873

---------

Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
ita9naiwa pushed a commit to ita9naiwa/iree that referenced this issue Feb 4, 2025
The code was intending to add the `reserve-x18` flag, and it was being
executed... but the `hasFlag` method that it was calling wasn't doing
that it thought it was.

That `hasFlag` method is just a static member that returns `true` if its
string argument starts with a `+`.

There isn't an actual method on `SubtargetFeatures` to check if we
already have a given feature. Anyway, it doesn't matter - in this case,
we don't already have the feature, and even if we did, multiply
specified features are not a problem.

But it really, really matters that we don't accidentally allocate `x18`
again :-)

Fixes iree-org#19873.

---------

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
bjacob added a commit that referenced this issue Feb 4, 2025
As newer aarch64 targets increasingly support SVE and SME, this clause
was preventing ukernels from being used in cases where they do speed
things up. The reason why this logic was out of place here is that what
it controls here is the enablement of ukernels, which are a detail of
lowering an already tiled workload. If we wanted to use SVE with a
variable vector length, or with a fixed vector length different from
NEON's 128bit, that decision needed to be made earlier; conversely, if
the workload at this point already has the right shaped to be matched to
a NEON ukernel, then SVE is not relevant to it anymore.

FYI @ziereis , this results in substantially faster code in your test
case from #19873.

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants