Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failing: "terminate called without an active exception" #8018

Closed
teh-cmc opened this issue Nov 6, 2024 · 9 comments · Fixed by #8038
Closed

CI failing: "terminate called without an active exception" #8018

teh-cmc opened this issue Nov 6, 2024 · 9 comments · Fixed by #8038
Assignees
Labels
💣 crash crash, deadlock/freeze, do-no-start 🚢 CI

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Nov 6, 2024

😒

Seems like there are two issues at play here:

Both seem related to threading issues. Current hypothesis is that they are two manifestations of the same underlying problem.

@teh-cmc teh-cmc added 💣 crash crash, deadlock/freeze, do-no-start 🚢 CI labels Nov 6, 2024
@teh-cmc teh-cmc added this to the 0.20 - Maps, H.264, Undo milestone Nov 6, 2024
@teh-cmc teh-cmc changed the title Nightly failing: "terminate called without an active exception" CI failing: "terminate called without an active exception" Nov 6, 2024
@abey79
Copy link
Member

abey79 commented Nov 6, 2024

Full CI run on the offending PR: https://github.com/rerun-io/rerun/actions/runs/11704123409

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 6, 2024

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 6, 2024

Repro'd locally:

#0  0x00007a019fe533f4 in ?? () from /usr/lib/libc.so.6
#1  0x00007a019fdfa120 in raise () from /usr/lib/libc.so.6
#2  0x00007a019fde14c3 in abort () from /usr/lib/libc.so.6
#3  0x00007a016fcd5fae in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#4  0x00007a016fcd44f2 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x00007a016fcce2f3 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#6  0x00007a016fcd41c1 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=10, exception_class=0, ue_header=0x79ffb8600d30, context=0x79ffb85feff0)
    at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#7  0x00007a019ea03f35 in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0x79ffb8600d30, context=context@entry=0x79ffb85feff0, frames_p=frames_p@entry=0x79ffb85feef8) at ../../../libgcc/unwind.inc:183
#8  0x00007a019ea0429c in _Unwind_ForcedUnwind (exc=0x79ffb8600d30, stop=0x7a019fe59e80, stop_argument=<optimized out>) at ../../../libgcc/unwind.inc:218
#9  0x00007a019fe5a045 in ?? () from /usr/lib/libc.so.6
#10 0x00007a019fe52732 in pthread_exit () from /usr/lib/libc.so.6
#11 0x00005687e67260b7 in PyThread_exit_thread () at /usr/local/src/conda/python-3.11.10/Python/thread_pthread.h:366
#12 0x00005687e668e990 in take_gil (tstate=tstate@entry=0x79ff3c0012f0) at /usr/local/src/conda/python-3.11.10/Python/ceval_gil.h:207
#13 0x00005687e677bac2 in PyEval_RestoreThread (tstate=tstate@entry=0x79ff3c0012f0) at /usr/local/src/conda/python-3.11.10/Python/ceval.c:521
#14 0x00005687e685f91d in PyGILState_Ensure () at /usr/local/src/conda/python-3.11.10/Python/pystate.c:1712
#15 0x00007a017870cc9e in arrow::py::NumPyBuffer::~NumPyBuffer() () from /home/cmc/dev/rerun-io/rerun/.pixi/envs/py/lib/python3.11/site-packages/pyarrow/libarrow_python.so
#16 0x00007a01786e3073 in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/cmc/dev/rerun-io/rerun/.pixi/envs/py/lib/python3.11/site-packages/pyarrow/libarrow_python.so
#17 0x00007a01786e304b in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/cmc/dev/rerun-io/rerun/.pixi/envs/py/lib/python3.11/site-packages/pyarrow/libarrow_python.so
#18 0x00007a008e768bf7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/cmc/dev/rerun-io/rerun/.pixi/envs/py/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1800
#19 0x00007a008e960b3b in arrow::(anonymous namespace)::ReleaseExportedArray(ArrowArray*) () from /home/cmc/dev/rerun-io/rerun/.pixi/envs/py/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1800
#20 0x00007a0089c5db70 in core::ptr::drop_in_place<dyn arrow_buffer::alloc::Allocation> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#21 alloc::sync::Arc<dyn arrow_buffer::alloc::Allocation, alloc::alloc::Global>::drop_slow<dyn arrow_buffer::alloc::Allocation, alloc::alloc::Global> (self=<optimized out>)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/sync.rs:1812
#22 0x00007a0089c5dfac in alloc::sync::{impl#34}::drop<dyn arrow_buffer::alloc::Allocation, alloc::alloc::Global> (self=0x34b9f0) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/sync.rs:2470
#23 core::ptr::drop_in_place<alloc::sync::Arc<dyn arrow_buffer::alloc::Allocation, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#24 core::ptr::drop_in_place<arrow_buffer::alloc::Deallocation> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#25 core::ptr::drop_in_place<arrow_buffer::bytes::Bytes> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#26 alloc::sync::Arc<arrow_buffer::bytes::Bytes, alloc::alloc::Global>::drop_slow<arrow_buffer::bytes::Bytes, alloc::alloc::Global> (self=<optimized out>)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/sync.rs:1812
#27 0x00007a0089c5dd54 in core::ptr::drop_in_place<foreign_vec::ForeignVec<re_arrow2::buffer::BytesAllocator, u32>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#28 core::ptr::drop_in_place<re_arrow2::buffer::Bytes<u32>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#29 alloc::sync::Arc<re_arrow2::buffer::Bytes<u32>, alloc::alloc::Global>::drop_slow<re_arrow2::buffer::Bytes<u32>, alloc::alloc::Global> (self=<optimized out>)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/sync.rs:1812
#30 0x00007a0089c6e55f in alloc::sync::{impl#34}::drop<re_arrow2::buffer::Bytes<f32>, alloc::alloc::Global> (self=0x34b9f0) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/sync.rs:2470
#31 core::ptr::drop_in_place<alloc::sync::Arc<re_arrow2::buffer::Bytes<f32>, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#32 core::ptr::drop_in_place<re_arrow2::buffer::immutable::Buffer<f32>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#33 core::ptr::drop_in_place<re_arrow2::array::primitive::PrimitiveArray<f32>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#34 0x00007a0089c6e36f in core::ptr::drop_in_place<alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#35 core::ptr::drop_in_place<re_arrow2::array::fixed_size_list::FixedSizeListArray> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#36 0x00007a0089c0bf63 in core::ptr::drop_in_place<alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#37 core::ptr::drop_in_place<re_arrow2::array::list::ListArray<i32>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#38 0x00007a0089c0d768 in core::ptr::drop_in_place<alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#39 core::ptr::drop_in_place<[alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>]> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#40 alloc::vec::{impl#25}::drop<alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>, alloc::alloc::Global> (self=<optimized out>)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/vec/mod.rs:3283
#41 0x00007a0089c0ad6d in core::ptr::drop_in_place<alloc::vec::Vec<alloc::boxed::Box<dyn re_arrow2::array::Array, alloc::alloc::Global>, alloc::alloc::Global>> ()
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#42 0x00007a0089aa718a in core::ptr::drop_in_place<re_log_types::arrow_msg::ArrowMsg> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#43 0x00007a0089aa8291 in core::ptr::drop_in_place<re_log_types::LogMsg> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#44 core::ptr::drop_in_place<[re_log_types::LogMsg]> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#45 alloc::vec::{impl#25}::drop<re_log_types::LogMsg, alloc::alloc::Global> (self=<optimized out>) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/vec/mod.rs:3283
#46 0x00007a0089aa74fd in core::ptr::drop_in_place<alloc::vec::Vec<re_log_types::LogMsg, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#47 0x00007a00898dc4c7 in core::ptr::drop_in_place<alloc::boxed::Box<dyn re_sdk::log_sink::LogSink, alloc::alloc::Global>> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mod.rs:514
#48 re_sdk::recording_stream::forwarding_thread (info=..., sink=..., cmds_rx=..., chunks=..., on_release=...) at crates/top/re_sdk/src/recording_stream.rs:1449
#49 0x00007a00898ca8ad in re_sdk::recording_stream::{impl#5}::new::{closure#0} () at crates/top/re_sdk/src/recording_stream.rs:850
#50 std::sys_common::backtrace::__rust_begin_short_backtrace<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()> (f=...)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:155
#51 0x00007a00898e93b1 in std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure#0}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()> ()
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/mod.rs:542
#52 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()>> (
    self=<error reading variable: Cannot access memory at address 0x78>) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panic/unwind_safe.rs:272
#53 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()>>, ()>
    (data=<optimized out>) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559
#54 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()>>> (f=...)
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523
#55 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#2}::{closure_env#0}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()>>, ()> (
    f=<error reading variable: Cannot access memory at address 0x78>) at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149
#56 std::thread::{impl#0}::spawn_unchecked_::{closure#2}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()> () at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/mod.rs:541
#57 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#2}<re_sdk::recording_stream::{impl#5}::new::{closure_env#0}, ()>, ()> ()
    at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250
#58 0x00007a0089e849cb in alloc::boxed::{impl#48}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:2022
#59 alloc::boxed::{impl#48}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:2022
#60 std::sys::pal::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/pal/unix/thread.rs:108
#61 0x00007a019fe5139d in ?? () from /usr/lib/libc.so.6
#62 0x00007a019fed649c in ?? () from /usr/lib/libc.so.6

@teh-cmc teh-cmc self-assigned this Nov 6, 2024
@emilk
Copy link
Member

emilk commented Nov 6, 2024

Oh, this seems related to GIL and and arrow-buffers! We've been in similar territory before. The problem then was that the arrow buffers needed to acquire a GIL lock to free their contents, and that lead to a dead-lock in some situations. The solution was to free the arrow buffers explicitly at the "right time". I believe @jleibs was involved in this.

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 6, 2024

Ye that was me in 61c23af 😄

Just need to figure out how something has managed to escape the garbage queue, it should be impossible...

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 6, 2024

Ok I'm 99% sure it's because the dataframe APIs are now able to load recordings/chunks without going through the log functions and therefore bypass all these protections.\

Nuh-uh that cannot possibly make sense, there are traces of RecordingStream in that coredump 🤔

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 7, 2024

The issue is definitely timing based and, as far as I can tell, has always been there. This is not the result of some recently added code.

The reason it now fails consistently on CI for @abey79 's branch is because that branch introduced a bunch of new unit tests, which changed the timings of the test runner in a way that always triggers the issue.
Removing these tests makes the CI pass again on that branch.

I used to be able to reproduce the failing timing locally on my machine (that's how I grabbed the coredump posted above), but ever since I rebooted my machine this morning, I haven't been able to hit it again 🤪

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 7, 2024

There we go.

Take the following code:

import rerun as rr

iterations = range(0, 1)

for i in iterations:
    print(f"{i}")
    rr.init("rerun_example_points3d")
    rr.log("points", rr.Points3D([[0, 0, 0], [1, 1, 1]]))
  • iterations = range(0, 1) works.
  • iterations = range(0, 2) up to range(0, 7) fails with terminate called without an active exception.
  • iterations = range(0, 8) and anything above will deadlock.
  • Don't log the points and the problem goes away.

  • 0.20: broken
  • 0.19: broken
  • 0.18: broken
  • 0.17: broken
  • 0.16: works!!!

Add a rr.connect_tcp() in there and you can go for some really unhinged results 🫠

0
1
2
3
terminate called recursively
terminate called without an active exception
terminate called recursively

Unrelated, but I just remembered that the code above will use the same RecordingID for all iterations of the loop because of ye ol' authkey() shenanigans in the Python SDK.

This is not the source of the problem in this case, just a reminder that this thing exists and could very well had been.

@abey79
Copy link
Member

abey79 commented Nov 7, 2024

Funnily this would be prevented if test were isolated as per #8023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💣 crash crash, deadlock/freeze, do-no-start 🚢 CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants