C++ exception when reshaping after transposing on GPU #104

rounak · 2024-06-24T07:34:41Z

The following code (or this sample project):

var arr = MLXArray([0,1,2,3,4,5,6,7,8,9,10,11]).reshaped([1, 2, 2, 3])
print(arr)

arr = arr.transposed(0, 2, 1, 3)
print(arr)
print(arr.reshaped([1, 4, 3]))

results in a c++ vector index out of bounds exception on the last reshape operation with the following backtrace:

Click to expand

* thread #5, stop reason = signal SIGABRT
    frame #0: 0x0000000186ff15e0 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000103e3bfa8 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000186f36908 libsystem_c.dylib`abort + 128
  * frame #3: 0x0000000100075cfc MLXPlayground`std::__1::vector<unsigned long, std::__1::allocator<unsigned long>>::operator[][abi:de180100](this=0x00006000024fb3d8 size=3, __n=3) const at vector:1400:3
    frame #4: 0x0000000100558778 MLXPlayground`std::__1::tuple<std::__1::vector<int, std::__1::allocator<int>>, std::__1::vector<std::__1::vector<unsigned long, std::__1::allocator<unsigned long>>, std::__1::allocator<std::__1::vector<unsigned long, std::__1::allocator<unsigned long>>>>> mlx::core::collapse_contiguous_dims<unsigned long>(shape=size=4, strides=size=2) at utils.h:76:32
    frame #5: 0x0000000100b38ec0 MLXPlayground`void mlx::core::copy_gpu_inplace<unsigned long>(in=0x00006000028a8400, out=0x00006000028b4140, data_shape=size=4, strides_in_pre=size=4, strides_out_pre=size=3, inp_offset=0, out_offset=0, ctype=General, s=0x00006000009a76d0) at copy.cpp:59:27
    frame #6: 0x0000000100b38da0 MLXPlayground`mlx::core::copy_gpu_inplace(in=0x00006000028a8400, out=0x00006000028b4140, ctype=General, s=0x00006000009a76d0) at copy.cpp:147:10
    frame #7: 0x0000000100b38ccc MLXPlayground`mlx::core::copy_gpu(in=0x00006000028a8400, out=0x00006000028b4140, ctype=General, s=0x00006000009a76d0) at copy.cpp:40:3
    frame #8: 0x0000000100b38dfc MLXPlayground`mlx::core::copy_gpu(in=0x00006000028a8400, out=0x00006000028b4140, ctype=General) at copy.cpp:44:3
    frame #9: 0x0000000100ba81ac MLXPlayground`mlx::core::Reshape::eval_gpu(this=0x00006000009a76c8, inputs=size=1, out=0x00006000028b4140) at primitives.cpp:823:5
    frame #10: 0x00000001004293a4 MLXPlayground`mlx::core::UnaryPrimitive::eval_gpu(this=0x00006000009a76c8, inputs=size=1, outputs=size=1) at primitives.h:145:5
    frame #11: 0x0000000100b88448 MLXPlayground`mlx::core::metal::make_task(mlx::core::array, bool)::$_0::operator()(this=0x0000600002ac4608) at metal.cpp:81:23
    frame #12: 0x0000000100b881a4 MLXPlayground`decltype(std::declval<mlx::core::metal::make_task(mlx::core::array, bool)::$_0&>()()) std::__1::__invoke[abi:de180100]<mlx::core::metal::make_task(mlx::core::array, bool)::$_0&>(__f=0x0000600002ac4608) at invoke.h:344:25
    frame #13: 0x0000000100b8815c MLXPlayground`void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:de180100]<mlx::core::metal::make_task(mlx::core::array, bool)::$_0&>(__args=0x0000600002ac4608) at invoke.h:419:5
    frame #14: 0x0000000100b88138 MLXPlayground`std::__1::__function::__alloc_func<mlx::core::metal::make_task(mlx::core::array, bool)::$_0, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_0>, void ()>::operator()[abi:de180100](this=0x0000600002ac4608) at function.h:169:12
    frame #15: 0x0000000100b86f80 MLXPlayground`std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array, bool)::$_0, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_0>, void ()>::operator()(this=0x0000600002ac4600) at function.h:311:10
    frame #16: 0x000000010046b6f0 MLXPlayground`std::__1::__function::__value_func<void ()>::operator()[abi:de180100](this=0x000000017002aee8) const at function.h:428:12
    frame #17: 0x000000010046af90 MLXPlayground`std::__1::function<void ()>::operator()(this=0x000000017002aee8) const at function.h:981:10
    frame #18: 0x0000000100d30828 MLXPlayground`mlx::core::scheduler::StreamThread::thread_fn(this=0x0000600001fa0000) at scheduler.h:54:7
    frame #19: 0x0000000100d30f80 MLXPlayground`decltype(*std::declval<mlx::core::scheduler::StreamThread*>().*std::declval<void (mlx::core::scheduler::StreamThread::*)()>()()) std::__1::__invoke[abi:de180100]<void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*, void>(__f=0x0000600002ab7da8, __a0=0x0000600002ab7db8) at invoke.h:312:25
    frame #20: 0x0000000100d30ef0 MLXPlayground`void std::__1::__thread_execute[abi:de180100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*, 2ul>(__t=size=3, (null)=__tuple_indices<2UL> @ 0x000000017002af7f) at thread.h:199:3
    frame #21: 0x0000000100d30b9c MLXPlayground`void* std::__1::__thread_proxy[abi:de180100]<std::__1::tuple<std::__1::unique_ptr
<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(__vp=0x0000600002ab7da0) at thread.h:208:3
    frame #22: 0x0000000103e3a9ac libsystem_pthread.dylib`_pthread_start + 136

Similar code in mlx python doesn't crash. When I do the same operation on the CPU, it doesn't crash.

I'm running this on Xcode 16 b1 with macOS Sequoia b1.

The text was updated successfully, but these errors were encountered:

davidkoski · 2024-06-24T15:31:30Z

I doesn't reproduce for me on macOS Sonoma 14.4:

array([[[0, 1, 2],
        [6, 7, 8],
        [3, 4, 5],
        [9, 10, 11]]], dtype=int32)

I will have to try it on Sequoia

awni · 2024-06-24T15:33:49Z

Seems related to ml-explore/mlx-c#30. I think we should wait until updating MLX Swift and MLX C to the latest MLX then try this again. MLX Core had a few updates to get it working with OS 15 ml-explore/mlx#1208

davidkoski · 2024-07-01T22:25:43Z

#101 puts mlx-swift on the latest mlx -- can you give this a try again?

DePasqualeOrg · 2024-07-02T11:10:04Z

I'm still getting the same crash in Xcode 16 with error vector[] index out of bounds when calling generate using the latest version of mlx-swift. Before the crash, this warning is shown multiple times:

Warning: Compilation succeeded with: 

program_source:261:31: warning: unused variable 'MAX_REDUCE_SPECIALIZED_DIMS' [-Wunused-const-variable]
static constant constexpr int MAX_REDUCE_SPECIALIZED_DIMS = 4;
                              ^
program_source:262:31: warning: unused variable 'REDUCE_N_READS' [-Wunused-const-variable]
static constant constexpr int REDUCE_N_READS = 16;
                              ^
program_source:263:31: warning: unused variable 'SOFTMAX_N_READS' [-Wunused-const-variable]
static constant constexpr int SOFTMAX_N_READS = 4;
                              ^
program_source:264:31: warning: unused variable 'RMS_N_READS' [-Wunused-const-variable]
static constant constexpr int RMS_N_READS = 4;
                              ^
program_source:265:31: warning: unused variable 'RMS_LOOPED_LIMIT' [-Wunused-const-variable]
static constant constexpr int RMS_LOOPED_LIMIT = 4096;
                              ^

Xcode 15 doesn't crash, but now shows the warnings, which wasn't the case before.

davidkoski · 2024-07-02T13:22:56Z

Those are from the JIT compile and are not related. OK, so it is still failing this particular test on Sequoia (macOS 15)

LiYanan2004 · 2024-07-02T13:40:54Z

Yeah. Still throw the error after updating to the latest version 0.15.2

rounak · 2024-07-03T02:41:43Z

I tried this on the main and the latest tag of mlx-swift, and still getting the same crash.

awni · 2024-07-03T14:34:31Z

This might be related to ml-explore/mlx-examples#642

davidkoski · 2024-07-03T20:03:00Z

OK, I can reproduce this on macOS 15 with Xcode 16. I find that it reproduces with Debug builds but not Release.

The problem is in:

collapse_contiguous_dims(

      out_strides[j].push_back(st[to_collapse[i - 1]]);

(lldb) p to_collapse[i - 1]
(std::vector<int>::value_type) 3

(lldb) p st
(const std::vector<unsigned long> &) size=3: {
  [0] = 12
  [1] = 3
  [2] = 1
}

The code executes the same way in Release but appears to silently pass when evaluating st[3]

It doesn't crash in Release because this macro in vector is empty:

  _LIBCPP_ASSERT_VALID_ELEMENT_ACCESS(__n < size(), "vector[] index out of bounds");

davidkoski · 2024-07-03T20:23:12Z

And this is turned on in Debug builds and appears to be new in macOS 15:

// Debug hardening mode checks.

#  elif _LIBCPP_HARDENING_MODE == _LIBCPP_HARDENING_MODE_DEBUG

Per https://libcxx.llvm.org/Hardening.html#notes-for-users

derekelewis · 2024-07-04T17:14:01Z

More info here in the Apple Xcode 16 C++ language support docs:

https://developer.apple.com/xcode/cpp/#library-hardening

Paramstr · 2024-07-05T02:49:59Z

Getting same error when calling:

...
                let result = await MLXLLM.generate(
                    promptTokens: promptTokens, parameters: GenerateParameters(), model: model,
                    tokenizer: tokenizer, extraEOSTokens: modelConfiguration.extraEOSTokens
                ) { tokens in
                    let text = tokenizer.decode(tokens: tokens)
                
                    modelOutputTokens = tokens.count
                    
                    // update the output -- this will make the view show the text as it generates
                    if tokens.count % displayEveryNTokens == 0 {
                
                           
                        await MainActor.run {
                            self.output = text
                        }
                        
                        ..

_LIBCPP_ASSERT_VALID_ELEMENT_ACCESS(__n < size(), "vector[] index out of bounds");

Mac Version 15.0 Beta (24A5279h)
Version 16.0 beta 2 (16A5171r)

davidkoski · 2024-07-05T06:14:57Z

Right, there is no fix yet, but we have a better handle on what is going on and why it shows up in macOS 15

davidkoski · 2024-07-08T14:49:59Z

Note: this is merged in the mlx core side but not picked up in mlx-swift yet.

Paramstr · 2024-07-12T00:59:37Z

Any idea when this will be in mlx-swift?

davidkoski · 2024-07-12T01:14:19Z

They just cut a release of the mlx core so I would need to integrate that. Hopefully next week.

You can avoid the assertion by building Release, though that doesn't actually avoid the underlying bug that the new assertions picked up.

davidkoski · 2024-07-15T16:51:25Z

This should be fixed once #115 merges

davidkoski · 2024-07-15T18:00:22Z

Merged #115, please try this out

DePasqualeOrg · 2024-07-15T19:26:09Z

This fixes the crash for me.

rounak · 2024-07-16T03:30:16Z

It fixes for me too, thanks!

rounak changed the title ~~C++ exception when reshaping after transposing~~ C++ exception when reshaping after transposing on GPU Jun 24, 2024

awni mentioned this issue Jun 27, 2024

Crash on Xcode 16.0 beta 2 when calling MLXLLM.generate ml-explore/mlx-swift-examples#86

Closed

angeloskath mentioned this issue Jul 6, 2024

Fix reshape copy bug ml-explore/mlx#1253

Merged

angeloskath closed this as completed in ml-explore/mlx#1253 Jul 8, 2024

davidkoski reopened this Jul 8, 2024

awni mentioned this issue Jul 15, 2024

Noting a macOS 15 Beta 3 crash #114

Closed

davidkoski closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ exception when reshaping after transposing on GPU #104

C++ exception when reshaping after transposing on GPU #104

rounak commented Jun 24, 2024 •

edited

Loading

davidkoski commented Jun 24, 2024

awni commented Jun 24, 2024

davidkoski commented Jul 1, 2024

DePasqualeOrg commented Jul 2, 2024 •

edited

Loading

davidkoski commented Jul 2, 2024

LiYanan2004 commented Jul 2, 2024

rounak commented Jul 3, 2024

awni commented Jul 3, 2024

davidkoski commented Jul 3, 2024 •

edited

Loading

davidkoski commented Jul 3, 2024

derekelewis commented Jul 4, 2024

Paramstr commented Jul 5, 2024 •

edited

Loading

davidkoski commented Jul 5, 2024

davidkoski commented Jul 8, 2024

Paramstr commented Jul 12, 2024

davidkoski commented Jul 12, 2024

davidkoski commented Jul 15, 2024

davidkoski commented Jul 15, 2024

DePasqualeOrg commented Jul 15, 2024

rounak commented Jul 16, 2024

C++ exception when reshaping after transposing on GPU #104

C++ exception when reshaping after transposing on GPU #104

Comments

rounak commented Jun 24, 2024 • edited Loading

davidkoski commented Jun 24, 2024

awni commented Jun 24, 2024

davidkoski commented Jul 1, 2024

DePasqualeOrg commented Jul 2, 2024 • edited Loading

davidkoski commented Jul 2, 2024

LiYanan2004 commented Jul 2, 2024

rounak commented Jul 3, 2024

awni commented Jul 3, 2024

davidkoski commented Jul 3, 2024 • edited Loading

davidkoski commented Jul 3, 2024

derekelewis commented Jul 4, 2024

Paramstr commented Jul 5, 2024 • edited Loading

davidkoski commented Jul 5, 2024

davidkoski commented Jul 8, 2024

Paramstr commented Jul 12, 2024

davidkoski commented Jul 12, 2024

davidkoski commented Jul 15, 2024

davidkoski commented Jul 15, 2024

DePasqualeOrg commented Jul 15, 2024

rounak commented Jul 16, 2024

rounak commented Jun 24, 2024 •

edited

Loading

DePasqualeOrg commented Jul 2, 2024 •

edited

Loading

davidkoski commented Jul 3, 2024 •

edited

Loading

Paramstr commented Jul 5, 2024 •

edited

Loading