Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashes when shutting down threads #517

Open
xrme opened this issue Sep 23, 2024 · 4 comments
Open

crashes when shutting down threads #517

xrme opened this issue Sep 23, 2024 · 4 comments

Comments

@xrme
Copy link
Member

xrme commented Sep 23, 2024

Two reported crashes, one with 1.13 release, and another with 1.12.2:

Here is a stack trace at the time of the signal with ccl-1.13:

# gdb /opt/src/ccl-1.13/lx86cl64 CoreDump
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
      <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/src/ccl-1.13/lx86cl64...
[New LWP 3685091]
[New LWP 3685084]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/[libthread_db.so](http://libthread_db.so/).1".
Core was generated by `/opt/src/ccl-1.13/lx86cl64 -I
/var/lib/jenkins/workspace/acl2-testing-user-01/s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  find_foreign_rsp (tcr=0x7cfdc1464530, foreign_area=0x0,
rsp=137429311175416) at ../x86-exceptions.c:1655
1655      if (((BytePtr)rsp < foreign_area->low) ||
[Current thread is 1 (Thread 0x7cfdc14646c0 (LWP 3685091))]
(gdb) where
#0  find_foreign_rsp (tcr=0x7cfdc1464530, foreign_area=0x0,
rsp=137429311175416) at ../x86-exceptions.c:1655
#1  handle_signal_on_foreign_stack (tcr=0x7cfdc1464530,
handler=0x422390
<interrupt_handler>, signum=30, info=0x7cfde1a92830,
context=0x7cfde1a92700, return_address=137429852115744) at
../x86-exceptions.c:1727
#2  <signal handler called>
#3  0x00007cfde1925c1b in __GI_mprotect () at
../sysdeps/unix/syscall-template.S:117
#4  0x0000000000428066 in UnProtectMemory (nbytes=<optimized out>,
addr=<optimized out>) at ../memory.c:284
#5  delete_protected_area (p=0x7cfdb8000c60) at ../memory.c:790
#6  condemn_area_holding_area_lock (a=0x7cfdb8000c90) at
../memory.c:826
#7  0x0000000000424772 in shutdown_thread_tcr
(arg=arg@entry=0x7cfdc1464530) at ../thread_manager.c:1397
#8  0x0000000000424a26 in tcr_cleanup (arg=arg@entry=0x7cfdc1464530) at
../thread_manager.c:1488
#9  0x0000000000424b6a in lisp_thread_entry (param=0x7ffd9aabdf90) at
../thread_manager.c:1688
#10 0x00007cfde189ca94 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:447
#11 0x00007cfde1929c3c in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)

There was also a crash dump from Sep. 10th, with ccl-1.12.2:

``` # gdb /opt/src/ccl-1.12.2/lx86cl64 CoreDump GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the GDB manual and other documentation resources online at: .

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/src/ccl-1.12.2/lx86cl64...
[New LWP 2087772]
[New LWP 2087765]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/src/ccl-1.12.2/lx86cl64 -I
/var/lib/jenkins/workspace/acl2-master/saved_ac'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 find_foreign_rsp (tcr=0x75a3ec464530, foreign_area=0x0,
rsp=129346904144632) at ../x86-exceptions.c:1651

warning: 1651 ../x86-exceptions.c: No such file or directory
[Current thread is 1 (Thread 0x75a3ec4646c0 (LWP 2087772))]
(gdb) where
#0 find_foreign_rsp (tcr=0x75a3ec464530, foreign_area=0x0,
rsp=129346904144632) at ../x86-exceptions.c:1651
#1 handle_signal_on_foreign_stack (tcr=0x75a3ec464530,
handler=0x422230
<interrupt_handler>, signum=30, info=0x75a40cb24830,
context=0x75a40cb24700, return_address=129347445084960) at
../x86-exceptions.c:1723
#2
#3 0x000075a40c925c1b in __GI_mprotect () at
../sysdeps/unix/syscall-template.S:117
#4 0x0000000000428017 in UnProtectMemory (nbytes=,
addr=) at ../memory.c:284
#5 delete_protected_area (p=0x75a3e4000c60) at ../memory.c:790
#6 condemn_area_holding_area_lock (a=0x75a3e4000c90) at
../memory.c:826
#7 0x000000000042461b in shutdown_thread_tcr
(arg=arg@entry=0x75a3ec464530) at ../thread_manager.c:1390
#8 0x00000000004248d6 in tcr_cleanup (arg=arg@entry=0x75a3ec464530) at
../thread_manager.c:1481
#9 0x0000000000424a1a in lisp_thread_entry (param=0x7ffceb394de0) at
../thread_manager.c:1681
#10 0x000075a40c89ca94 in start_thread (arg=) at
./nptl/pthread_create.c:447
#11 0x000075a40c929c3c in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)

</details>
@xrme
Copy link
Member Author

xrme commented Sep 23, 2024

This may happen when the lisp is exiting, but that's not completely clear.

@xrme
Copy link
Member Author

xrme commented Oct 17, 2024

The signum=30 refers to SIGPWR, which is the signal CCL uses on Linux for process interrupt.

I'm puzzled about how we end up receiving this signal, because there are only two threads running. The one that crashes is in the process of exiting, and the other seems to be minding its own business.

@xrme
Copy link
Member Author

xrme commented Oct 18, 2024

In looking at the stack trace from the 1.13 crash, we see that in frame 0, foreign_area is NULL, so that's the immediate cause of the crash. The function shutdown_thread_tcr has just cleared all of the area pointers from the TCR (tcr->vs_area, tcr->ts_area, tcr->cs_area) and we're in the middle of a call to condemn_area_holding_lock where we're apparently trying to condemn the vstack area.

@xrme
Copy link
Member Author

xrme commented Oct 19, 2024

In 524429b, mask SIGNAL_FOR_PROCESS_INTERRUPT (aka SIGPWR on Linux) as a thread is exiting.

I am hoping this will prevent crashes of the type we see above, but I don't understand how or why we're receiving the signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant