Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deadlock cycle from thread pool implementation #31

Open
adlai opened this issue Nov 3, 2024 · 3 comments
Open

[BUG] Deadlock cycle from thread pool implementation #31

adlai opened this issue Nov 3, 2024 · 3 comments
Assignees
Labels
bug Failures exists between the ears of programmers.

Comments

@adlai
Copy link
Collaborator

adlai commented Nov 3, 2024

Describe the bug
As initially reported in adlai#5, SBCL occasionally reports deadlock caused by the thread pool.

To Reproduce
Steps to reproduce the behavior:

  1. Run multiple parallel threads that constantly produce new tasks, similarly to scalpl's actors.
  2. Wait for RNGesus to schedule the deadlock cycle.
  3. Receive Condition of type SB-THREAD:THREAD-DEADLOCK
  4. Report this fact, and possibly additional findings, in this Issue; or invoke the ABORT restart and enjoy the hardworking actors.lisp rescheduling of the tasks.

Expected behavior
Ideally, either there should not be any deadlock; alternatively, tasks could be restartable automatically.

Screenshots or Error Logs

Deadlock cycle detected:
    #1=#<SB-THREAD:THREAD #2="ChanL Thread Pool Worker" RUNNING {1001369D53}>
  waited for:
    #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
  owned by:
    #<SB-THREAD:THREAD "ChanL Thread Pool Worker" waiting on: #3=#<MUTEX "thread leader lock" contested owner=#2#> {100163B403}>
  waited for:
    (#3#)
  owned by:
    #1#
   [Condition of type SB-THREAD:THREAD-DEADLOCK]

Restarts:
 0: [ABORT] abort thread (#<THREAD "ChanL Thread Pool Worker" RUNNING {1001369D53}>)

Backtrace:
  0: (SB-THREAD::CHECK-DEADLOCK)
  1: (SB-THREAD::%WAIT-FOR-MUTEX #<SB-THREAD:MUTEX "thread pool lock" free owner=0> NIL NIL NIL NIL NIL NIL)
  2: ((FLET "CLEANUP-FUN-8" :IN SB-THREAD::%CONDITION-WAIT)) [cleanup]
  3: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-THREAD::%CONDITION-WAIT))
  4: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
  5: (BORDEAUX-THREADS:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
      Locals:
        CONDITION-VARIABLE = #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}>
        LOCK = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
        TIMEOUT = NIL
  6: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
      [No Locals]
  7: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
  8: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}> #<SB-THREAD:MUTEX "thread pool lock" free own$
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
        SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}>
        TIMEOUT = NIL
        WAITP = T
  9: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
      [No Locals]
 10: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
 11: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}> #<SB-THREAD:MUTEX "thread leader lock" contes$
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
        SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}>
        TIMEOUT = NIL
        WAITP = T
 12: ((LAMBDA NIL :IN CHANL::NEW-WORKER-THREAD))
      Locals:
        TASK = #<TASK 6:55:56 3P4 USDCAUD fee tracker [TERMINATED] {1006535A93}>
        CHANL::THREAD-POOL = #<CHANL::THREAD-POOL {1003857FC3}>
 13: ((LABELS BORDEAUX-THREADS::%BINDING-DEFAULT-SPECIALS-WRAPPER :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))

Source locations for frames 12, 9, and 6:

(defun new-worker-thread (thread-pool &optional task)
  (push (bt:make-thread
         (lambda ()
           (unwind-protect
                (loop
                   (when task
                     (unwind-protect
                          (progn (setf (task-thread task) (current-thread))
                                 (setf (task-status task) :alive)
                                 (funcall (task-function task)))
                       (setf (task-thread task) nil)
                       (setf (task-status task) :terminated)
                       (bt:with-lock-held ((pool-lock thread-pool))
                         (setf (pool-tasks thread-pool)
                               (remove task (pool-tasks thread-pool))))))
                   (bt:with-lock-held ((pool-lock thread-pool))
                     (if (and (pool-soft-limit thread-pool)
                              (> (length (pool-threads thread-pool))
                                 (pool-soft-limit thread-pool)))
                         (return)
                         (incf (free-thread-counter thread-pool))))
                   (bt:with-lock-held ((pool-leader-lock thread-pool))                  ; [12]
                     (bt:with-lock-held ((pool-lock thread-pool))                        ; [9]
                       (setf task
                             (loop until (pool-pending-tasks thread-pool)
                                do (bt:condition-wait (pool-leader-notifier thread-pool) ; [6]
                                                      (pool-lock thread-pool))
                                finally (return (pop (pool-pending-tasks thread-pool)))))
                       (decf (free-thread-counter thread-pool)))))
             (bt:with-lock-held ((pool-lock thread-pool))
               (setf (pool-threads thread-pool)
                     (remove (bt:current-thread) (pool-threads thread-pool))))))
         :name "ChanL Thread Pool Worker")
        (pool-threads thread-pool)))

System (all systems where this has been observed):

  • Hardware: X86-64 AMD EPYC 7543 32-Core Processor
  • OS: Arch Linux, Debian
  • Compiler: SBCL
  • Version: multiple; most recently, 2.2.9.debian

Test Suite
Regrettably, the test suite does not detect this deadlock. I'm considering whether it is worth including a sufficiently vigorous test case that the deadlock occurs with high probability.

@adlai adlai self-assigned this Nov 3, 2024
adlai added a commit to adlai/chanl that referenced this issue Nov 4, 2024
It should help with zkat#31 and #5 although this should be confirmed.
@adlai
Copy link
Collaborator Author

adlai commented Nov 5, 2024

#32 and 4cbbcda begin fixing this

@adlai
Copy link
Collaborator Author

adlai commented Nov 17, 2024

I strongly suspect this is an SBCL bug, rather than a ChanL bug; specifically, although the thread pool uses two locks, they are always taken in the same order -- by threads running the same code! -- and thus deadlock should not be possible. Some complication from the inner lock's release and reacquisition being triggered by a condition variable is probably at fault for the appearant and unintended deadlocks.

I wrote above "apparant and unintended", to slightly shift blame away from SBCL's deadlock detector. Deadlocks definitely do occur.

EDIT: see the printed representation of the deadlock cycle, in the bug report; owner=0 suggests to me that, by the time the cycle is printed, there is not a deadlock anymore, and thus the actual deadlock is caused by some race condition between the condition variables and mutexes provided by SBCL.

@adlai adlai added WONTFIX PEBKAC bug Failures exists between the ears of programmers. and removed WONTFIX PEBKAC labels Nov 20, 2024
@adlai
Copy link
Collaborator Author

adlai commented Nov 24, 2024

There is progress, it is not quite as hopeless as "Stop using SBCL"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Failures exists between the ears of programmers.
Projects
None yet
Development

No branches or pull requests

1 participant