Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread spuriously marked dead after interrupting a join call #90882

Open
KevinShweh mannequin opened this issue Feb 11, 2022 · 24 comments
Open

Thread spuriously marked dead after interrupting a join call #90882

KevinShweh mannequin opened this issue Feb 11, 2022 · 24 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@KevinShweh
Copy link
Mannequin

KevinShweh mannequin commented Feb 11, 2022

BPO 46726
Nosy @tim-one, @pitrou, @vstinner, @serhiy-storchaka, @eryksun, @bensimner, @snoopjedi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-02-11.23:47:13.610>
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = 'Thread spuriously marked dead after interrupting a join call'
updated_at = <Date 2022-02-14.02:08:58.437>
user = 'https://bugs.python.org/KevinShweh'

bugs.python.org fields:

activity = <Date 2022-02-14.02:08:58.437>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2022-02-11.23:47:13.610>
creator = 'Kevin Shweh'
dependencies = []
files = []
hgrepos = []
issue_num = 46726
keywords = []
message_count = 17.0
messages = ['413106', '413107', '413108', '413109', '413110', '413118', '413140', '413146', '413155', '413162', '413164', '413166', '413174', '413191', '413202', '413203', '413208']
nosy_count = 8.0
nosy_names = ['tim.peters', 'pitrou', 'vstinner', 'serhiy.storchaka', 'eryksun', 'Kevin Shweh', 'bjs', 'SnoopJeDi']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue46726'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

@KevinShweh
Copy link
Mannequin Author

KevinShweh mannequin commented Feb 11, 2022

This code in Thread._wait_for_tstate_lock:

try:
    if lock.acquire(block, timeout):
        lock.release()
        self._stop()
except:
    if lock.locked():
        # bpo-45274: lock.acquire() acquired the lock, but the function
        # was interrupted with an exception before reaching the
        # lock.release(). It can happen if a signal handler raises an
        # exception, like CTRL+C which raises KeyboardInterrupt.
        lock.release()
        self._stop()
    raise

has a bug. The "if lock.locked()" check doesn't check whether this code managed to acquire the lock. It checks if *anyone at all* is holding the lock. The lock is almost always locked, so this code will perform a spurious call to self._stop() if it gets interrupted while trying to acquire the lock.

Thread.join uses this method to wait for a thread to finish, so a thread will spuriously be marked dead if you interrupt a join call with Ctrl-C while it's trying to acquire the lock. Here's a reproducer:

import time
import threading
 
event = threading.Event()
 
def target():
    event.wait()
    print('thread done')
 
t = threading.Thread(target=target)
t.start()
print('joining now')
try:
    t.join()
except KeyboardInterrupt:
    pass
print(t.is_alive())
event.set()

Interrupt this code with Ctrl-C during the join(), and print(t.is_alive()) will print False.

@KevinShweh KevinShweh mannequin added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 11, 2022
@bensimner
Copy link
Mannequin

bensimner mannequin commented Feb 12, 2022

This is a duplicate of https://bugs.python.org/issue45274
but the patch there did not fix it

I've just added a PR there (or should it go here?) that (i think) fixes this.

The issue is that the lock.locked() call just checks that *someone* has the lock, not that the previous acquire() is what got the lock.
If it's just that the tstate lock is held because the thread is still running, then it's premature to release() the lock.

@KevinShweh
Copy link
Mannequin Author

KevinShweh mannequin commented Feb 12, 2022

bpo-45274 was a subtly different issue. That was a problem that happened if the thread got interrupted *between* the acquire and the release, causing it to *not* release the lock and *not* perform end-of-thread cleanup.

The fix for that issue caused this issue, which happens if the thread gets interrupted *during* the acquire, in which case it *does* release the lock (that someone else is holding) and *does* perform end-of-thread cleanup even though it's not supposed to do either of those things.

@KevinShweh
Copy link
Mannequin Author

KevinShweh mannequin commented Feb 12, 2022

The PR you submitted doesn't work, unfortunately. It essentially reintroduces bpo-45274. If this line:

    if locked := lock.acquire(block, timeout):

gets interrupted between the acquire and the assignment, locked is still False. That's rare, but so is an interruption between the acquire and the release, which is the original form of bpo-45274.

@bensimner
Copy link
Mannequin

bensimner mannequin commented Feb 12, 2022

You are right,

What one really needs here is a way to know *who* owns the lock,
but threading.Lock does not provide that.

The race on := is much smaller than the original race and I suspect in practice will be very hard to hit.

As the original bpo notes, it may not be possible to write a complete fix for this in pure Python, after all there may always be another interrupt between the except and the second attempted release.

The issue with the solution was that it turned a relatively hard-to-hit race condition into a really easy-to-hit one, but perhaps the outcome is slightly less worse?

@eryksun
Copy link
Contributor

eryksun commented Feb 12, 2022

The race on := is much smaller than the original race
and I suspect in practice will be very hard to hit.

In Windows, the acquire() method of a lock can't be interrupted. Thus, in the main thread, an exception from Ctrl+C gets raised as soon as acquire() returns. This exception definitely will interrupt the assignment. Here's a workaround:

global scope:

    _WINDOWS = _sys.platform == 'win32'

in _wait_for_tstate_lock():

            acquired = None
        try:
            if acquired := lock.acquire(block, timeout):
                lock.release()
                self._stop()
        except:
            if _WINDOWS and acquired is None:
                acquired = True
            if acquired:
                lock.release()
                self._stop()
            raise

This doesn't help in POSIX if the STORE_FAST instruction that assigns acquired gets interrupted. This can't be distinguished from acquire() itself getting interrupted. But at least the window for this is as small as possible.

@tim-one
Copy link
Member

tim-one commented Feb 12, 2022

Eryk, I don't think that workaround is solid on Windows in all cases. For example, if .join() is called with a timeout, the same timeout is passed to lock.acquire(block, timeout). If the acquire() in fact times out, but the store to the acquired variable is interrupted, if _WINDOWS and acquired is None will succeed, despite that the lock is still locked. Then we go on to - again - incorrectly release the lock and call _stop().

But please don't "repair" that: platform-specific tricks aren't on a path to an actual solution ;-) If the latter requires some new C code, fine.

@eryksun
Copy link
Contributor

eryksun commented Feb 12, 2022

If the acquire() in fact times out, but the store to the acquired
variable is interrupted, if _WINDOWS and acquired is None will
succeed, despite that the lock is still locked

Yeah, my proposed workaround is no good, so we can't resolve this without a more fundamental solution. Are you looking into a way to prevent the STORE_FAST instruction from getting interrupted by an asynchronous exception?

@tim-one
Copy link
Member

tim-one commented Feb 13, 2022

Na, we've been doing excruciatingly clever stuff to deal with thread shutdown for decades, and it always proves to be wrong in some way. Even if code like

except:
    if lock.locked():
        lock.release()
        self._stop()
    raise

did work as hoped for, it would still be broken, but in a different way: suppose we really did acquire the tstate lock because the thread actually ended while we were in acquire(). But then a signal dumped us into the except block before doing the release. Oops! There's nothing I can see to stop another (say) KeyboardInterrupt preventing us from doing the "failsafe" release too. So the lock remains locked forever after (we hold the lock now, and missed our chances to release it). And I think that's pretty likely: if I don't see an instant response to Ctrl-C, I'm likely to do another very soon after.

So I don't think catching exceptions can be made to work for this. Or finally blocks either. Indeed, it appears that any way whatsoever of spelling lock.release() in Python can be defeated by an unfortunately timed signal.

Which isn't unique to this code, of course. The failure modes of this code just happen to be unusually visible ;-)

Two other approaches come to mind:

  • Wrap everything needed in a custom C function. CPython can't interfere with its control flow.

  • Add new sys gimmicks to suppress and re-enable raising Python level exceptions for signals. Then, e.g., something here like:

with sys.delay_signal_exceptions():
    # Dead simple code, like _before_ we "fixed" it ;-)
    # In particular, while Ctrl-C may terminate the `acquire()` call,
    # KeyboardInterrupt will not be raised until the `with` block
    # exits.
    # Possibly intractable: arranging then for the traceback to
    # point at the code where the exception would have been raised
    # had temporary suspension not been enabled. Then again, since
    # it's not _actually_ raised there at the Python level, maybe
    # it's a Good Thing to ignore.
    if lock.acquire(block, timeout):
        lock.release()
        self._stop()

The second way is more general, but would probably require a PEP.

@eryksun
Copy link
Contributor

eryksun commented Feb 13, 2022

Wrap everything needed in a custom C function.

Maybe add an acquire_and_release() method:

    static PyObject *
    lock_PyThread_acquire_and_release_lock(
            lockobject *self, PyObject *args, PyObject *kwds)
    {
        _PyTime_t timeout;
    if (lock_acquire_parse_args(args, kwds, &timeout) < 0)
        return NULL;
        PyLockStatus r = acquire_timed(self->lock_lock, timeout);

        if (r == PY_LOCK_INTR) {
            return NULL;
        }

        if (r == PY_LOCK_ACQUIRED) {
            PyThread_release_lock(self->lock_lock);
            self->locked = 0;
        }
        
        return PyBool_FromLong(r == PY_LOCK_ACQUIRED);
    }

@tim-one
Copy link
Member

tim-one commented Feb 13, 2022

Maybe add an acquire_and_release() method

Bingo - that should do the trick, in an "obviously correct" way. Of course it's of limited applicability, but fine by me.

Will you open a PR with this code?

@tim-one
Copy link
Member

tim-one commented Feb 13, 2022

While bundling the lock.release() into C makes that bulletproof, is there a bulletproof way to guarantee that self._stop() gets called if the acquire_and_release() succeeds? Offhand, I don't see a reason for why that isn't just as vulnerable to getting skipped due to an unfortunate signal.

Indeed, huge mounds of threading.py can leave things in insane states in the presence of by-magic exceptions. Even if code is very careful to put crucial cleanup code in finally blocks so it gets executed "no matter what", there's nothing to stop code in finally blocks from getting skipped over due to by-magic exceptions too.

It's an eternal marvel that anything ever works at all ;-)

@eryksun
Copy link
Contributor

eryksun commented Feb 13, 2022

is there a bulletproof way to guarantee that self._stop() gets
called if the acquire_and_release() succeeds?

I don't think it's critical. But we still should deal with the common case in Windows in which a KeyboardInterrupt is raised immediately after the method returns. For example:

    try:
        if lock.acquire_and_release(block, timeout):
            self._stop
    except:
        if not lock.locked():
            self._stop()

The proposed acquire_and_release() method can also be used in threading._shutdown(), when it iterates _shutdown_locks to join non-daemon threads.

@tim-one
Copy link
Member

tim-one commented Feb 13, 2022

is there a bulletproof way to guarantee that self._stop() gets
called if the acquire_and_release() succeeds?

I don't think it's critical.

Agreed! Anything at the Python level that cares whether the thread is still alive will call _wait_for_tstate_lock() again, and get another chance each time to notice that the tstate lock has been freed.

Instead of:

    try:
        if lock.acquire_and_release(block, timeout):
            self._stop
    except:
        if not lock.locked():
            self._stop()

I'd prefer:

    try:
        lock.acquire_and_release(block, timeout)
    finally:
        if not lock.locked():
            self._stop()

Because, at the Python level, whether acquire_and_release() succeeded at its task depends entirely on whether the lock is free afterward. That's what we're _really_ looking for, and is so on all possible platforms.

@eryksun
Copy link
Contributor

eryksun commented Feb 14, 2022

Anything at the Python level that cares whether the thread is
still alive will call _wait_for_tstate_lock() again

It's nice that _maintain_shutdown_locks() gets called in _stop(), but the more important call site is in _set_tstate_lock().

I typed up the example initially with try/finally. Then I changed it to avoid the extra lock.locked() call when there's no exception, but I forgot to add a raise statement:

    try:
        if lock.acquire_and_release(block, timeout):
            self._stop
    except:
        if not lock.locked():
            self._stop()
        raise

@tim-one
Copy link
Member

tim-one commented Feb 14, 2022

It's nice that _maintain_shutdown_locks() gets
called in _stop(), but the more important call site is in
_set_tstate_lock().

I didn't have that in mind at all. What at the Python level cares whether the thread is alive? Well. is_alive() does, and it calls _wait_for_tstate_lock() __repr__() does, and it calls is_alive() (which I added years ago, else repr could keep claiming a thread was alive weeks ;-) after it actually ended). join() does, and it calls _wait_for_tstate_lock().

Anything else? Not off the top of my head. The point is that if _wait_for_tstate_lock() fails to call _stop() for some reason when it "should have" called it, it's not fatal. Anything that _cares_ will call it again. Indeed, it was all designed so that it could be called any number of times, redundantly or not, and without any explicit synchronization.

For the rest, I value clarity above all else here. This code has a long history of being the subject of bug reports. The cost of an "extra" call to .locked() is just trivial, especially given that calls to _wait_for_tstate_lock() are typically rare.

@eryksun
Copy link
Contributor

eryksun commented Feb 14, 2022

I didn't have that in mind at all.

I understood what you had in mind, and I don't disagree. I was just highlighting that the only somewhat important work done in _stop() is to clean up the _shutdown_locks set, to keep it from growing too large. But that task is already handled when creating a new thread, and it's arguably more important to handle it there.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@vstinner
Copy link
Member

OMG handling locks and threads in pure Python code where any tiny operation can be interrupted by a signal makes the code vulnerable to a lot of race conditions.

Maybe the most critical functions like _wait_for_tstate_lock() should have an implementation written in C where signal are "handled explicitly" by calling PyErr_CheckSignals(): a function does not exit in the middle of the code in case of an exception being raised.

The code is quite simple, so it should be hard to rewrite it in C:

    def _wait_for_tstate_lock(self, block=True, timeout=-1):
        lock = self._tstate_lock
        if lock is None:
            return
        if lock.acquire(block, timeout):
            lock.release()
            self._stop()

Moreover, there are C functions to lock/unlock a lock object (by accessing directly to the lockobject.lock_lock member).

I dislike adding a public acquire_and_release() method to all lock objects just to fix a single method of the threading module. The threading module is special: it's the code to handle threads, so race conditions are more likely there. Using the threading module is pure Python code should be safe, especially using context managers.

@bbatliner
Copy link

bbatliner commented Jul 20, 2022

Add new sys gimmicks to suppress and re-enable raising Python level exceptions for signals.

I took a stab at writing something like this that I call a "signal fence": https://stackoverflow.com/a/71330357/5136076 It only works on the main thread, since that's where signals are handled.

Two caveats that maybe you all have some feedback on:

  1. It only handles one exception. As @tim-one aptly points out: "Even if code is very careful to put crucial cleanup code in finally blocks so it gets executed "no matter what", there's nothing to stop code in finally blocks from getting skipped over due to by-magic exceptions too." I think you could theoretically nest as many try-except-finally blocks as you feel is necessary to handle one, two, N "by-magic" exceptions, but this will never achieve 100% correctness. Is there a better way? (edit: seems like the better way is to write this in C)
  2. It only works on the main thread. For exceptions caused by signals this is not a problem. For exceptions caused by the thread's async exception state (PyThreadState_SetAsyncExc), there is no way that I know of to disable these exceptions from interrupting anywhere "by-magic." I wish there was a context manager like sys.delay_async_exceptions() for this case.

@pitrou
Copy link
Member

pitrou commented Jul 20, 2022

@bbatliner I think such a generic facility should be written in C for much greater and easier control. I'm not sure about the API, but taking callable arguments sounds like a good idea.

@bbatliner
Copy link

I'll admit I don't have experience writing Python C extensions. If this was implemented in C it couldn't be interrupted by an exception raised by a signal, right?

@pitrou
Copy link
Member

pitrou commented Jul 20, 2022

Exactly, because regular C code doesn't get interrupted by Python signal handlers (except when calling certain CPython C API functions).

If you don't have much experience then perhaps someone else can do it, though that requires a volunteer to step in :-)

espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Jul 21, 2023
idf.py spawns gdb process within a thread and uses Thread.join() to wait
for the gdb process to finish. As CTRL+C(SIGINT) is used by gdb to interrupt the
running program, we catch the SIGINT while waiting on the gdb to finish,
and try Thread.join() again.

With cpython's commit

	commit a22be4943c119fecf5433d999227ff78fc2e5741
	Author: Victor Stinner <vstinner@python.org>
	Date:   Mon Sep 27 14:20:31 2021 +0200

	    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)

this logic doesn't work anymore, because cpython internally marks the
thread as stopped when join() is interrupted with an exception. IMHO
this is broken in cpython and there is a bug report about this
python/cpython#90882. Problem is that
waiting on a thread to finish is based on acquiring a lock. Meaning
join() is waiting on _tstate_lock. If this wait is interrupted, the
above referenced commit adds a logic that checks if the lock is help,
meaning the thread is done and marks the thread as stopped. But there is
no way to tell if the lock was acquired by us running join() or if it's
held by someone else e.g. still by the thread bootstrap code. Meaning
the thread is still running.

I may be missing something, but I don't see any reason why to spawn gdb
process within a thread. This change removes the thread and spawns gdb
directly. Instead waiting on a thread, we wait on the process to finish,
replacing join() with wait() and avoiding this problem.

Closes #11871

Signed-off-by: Frantisek Hrbata <frantisek.hrbata@espressif.com>
ilutchenko pushed a commit to ilutchenko/esp-idf that referenced this issue Jul 24, 2023
idf.py spawns gdb process within a thread and uses Thread.join() to wait
for the gdb process to finish. As CTRL+C(SIGINT) is used by gdb to interrupt the
running program, we catch the SIGINT while waiting on the gdb to finish,
and try Thread.join() again.

With cpython's commit

	commit a22be4943c119fecf5433d999227ff78fc2e5741
	Author: Victor Stinner <vstinner@python.org>
	Date:   Mon Sep 27 14:20:31 2021 +0200

	    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)

this logic doesn't work anymore, because cpython internally marks the
thread as stopped when join() is interrupted with an exception. IMHO
this is broken in cpython and there is a bug report about this
python/cpython#90882. Problem is that
waiting on a thread to finish is based on acquiring a lock. Meaning
join() is waiting on _tstate_lock. If this wait is interrupted, the
above referenced commit adds a logic that checks if the lock is help,
meaning the thread is done and marks the thread as stopped. But there is
no way to tell if the lock was acquired by us running join() or if it's
held by someone else e.g. still by the thread bootstrap code. Meaning
the thread is still running.

I may be missing something, but I don't see any reason why to spawn gdb
process within a thread. This change removes the thread and spawns gdb
directly. Instead waiting on a thread, we wait on the process to finish,
replacing join() with wait() and avoiding this problem.

Closes espressif#11871

Signed-off-by: Frantisek Hrbata <frantisek.hrbata@espressif.com>
ilutchenko pushed a commit to ilutchenko/esp-idf that referenced this issue Jul 24, 2023
idf.py spawns gdb process within a thread and uses Thread.join() to wait
for the gdb process to finish. As CTRL+C(SIGINT) is used by gdb to interrupt the
running program, we catch the SIGINT while waiting on the gdb to finish,
and try Thread.join() again.

With cpython's commit

	commit a22be4943c119fecf5433d999227ff78fc2e5741
	Author: Victor Stinner <vstinner@python.org>
	Date:   Mon Sep 27 14:20:31 2021 +0200

	    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)

this logic doesn't work anymore, because cpython internally marks the
thread as stopped when join() is interrupted with an exception. IMHO
this is broken in cpython and there is a bug report about this
python/cpython#90882. Problem is that
waiting on a thread to finish is based on acquiring a lock. Meaning
join() is waiting on _tstate_lock. If this wait is interrupted, the
above referenced commit adds a logic that checks if the lock is help,
meaning the thread is done and marks the thread as stopped. But there is
no way to tell if the lock was acquired by us running join() or if it's
held by someone else e.g. still by the thread bootstrap code. Meaning
the thread is still running.

I may be missing something, but I don't see any reason why to spawn gdb
process within a thread. This change removes the thread and spawns gdb
directly. Instead waiting on a thread, we wait on the process to finish,
replacing join() with wait() and avoiding this problem.

Closes espressif#11871

Signed-off-by: Frantisek Hrbata <frantisek.hrbata@espressif.com>
espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Jul 27, 2023
idf.py spawns gdb process within a thread and uses Thread.join() to wait
for the gdb process to finish. As CTRL+C(SIGINT) is used by gdb to interrupt the
running program, we catch the SIGINT while waiting on the gdb to finish,
and try Thread.join() again.

With cpython's commit

	commit a22be4943c119fecf5433d999227ff78fc2e5741
	Author: Victor Stinner <vstinner@python.org>
	Date:   Mon Sep 27 14:20:31 2021 +0200

	    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)

this logic doesn't work anymore, because cpython internally marks the
thread as stopped when join() is interrupted with an exception. IMHO
this is broken in cpython and there is a bug report about this
python/cpython#90882. Problem is that
waiting on a thread to finish is based on acquiring a lock. Meaning
join() is waiting on _tstate_lock. If this wait is interrupted, the
above referenced commit adds a logic that checks if the lock is help,
meaning the thread is done and marks the thread as stopped. But there is
no way to tell if the lock was acquired by us running join() or if it's
held by someone else e.g. still by the thread bootstrap code. Meaning
the thread is still running.

I may be missing something, but I don't see any reason why to spawn gdb
process within a thread. This change removes the thread and spawns gdb
directly. Instead waiting on a thread, we wait on the process to finish,
replacing join() with wait() and avoiding this problem.

Closes #11871

Signed-off-by: Frantisek Hrbata <frantisek.hrbata@espressif.com>
espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Sep 1, 2023
idf.py spawns gdb process within a thread and uses Thread.join() to wait
for the gdb process to finish. As CTRL+C(SIGINT) is used by gdb to interrupt the
running program, we catch the SIGINT while waiting on the gdb to finish,
and try Thread.join() again.

With cpython's commit

	commit a22be4943c119fecf5433d999227ff78fc2e5741
	Author: Victor Stinner <vstinner@python.org>
	Date:   Mon Sep 27 14:20:31 2021 +0200

	    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)

this logic doesn't work anymore, because cpython internally marks the
thread as stopped when join() is interrupted with an exception. IMHO
this is broken in cpython and there is a bug report about this
python/cpython#90882. Problem is that
waiting on a thread to finish is based on acquiring a lock. Meaning
join() is waiting on _tstate_lock. If this wait is interrupted, the
above referenced commit adds a logic that checks if the lock is help,
meaning the thread is done and marks the thread as stopped. But there is
no way to tell if the lock was acquired by us running join() or if it's
held by someone else e.g. still by the thread bootstrap code. Meaning
the thread is still running.

I may be missing something, but I don't see any reason why to spawn gdb
process within a thread. This change removes the thread and spawns gdb
directly. Instead waiting on a thread, we wait on the process to finish,
replacing join() with wait() and avoiding this problem.

Closes #11871

Signed-off-by: Frantisek Hrbata <frantisek.hrbata@espressif.com>
jonathan-j-lee added a commit to jonathan-j-lee/wpt that referenced this issue Apr 6, 2024
A `KeyboardInterrupt` raised during `Thread.join()` will incorrectly
mark that thread as stopped [0, 1], causing subsequent `join()`s to not
block. This can cause wptrunner to exit with a `TestRunnerManager`
thread never having stopped its browser [2], which is leaked [3].

Use a barrier to explicitly synchronize the exit of all threads. A
`KeyboardInterrupt` raised during `Barrier.wait()` should re-decrement
the number of waiters [4], meaning the `ManagerGroup` can `wait()`
again.

Marking the `TestRunnerManager` threads as non-`daemon`ic would require
them to exit for the process to exit (thus fixing browser leakage), but
those threads may still log after `mozlog` shutdown.

[0]: python/cpython#90882
[1]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L1146-L1178
[2]: https://github.com/web-platform-tests/wpt/blob/f8fb78a2/tools/wptrunner/wptrunner/testrunner.py#L441
[3]: https://crbug.com/330236796
[4]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L702-L728
jonathan-j-lee added a commit to jonathan-j-lee/wpt that referenced this issue Apr 6, 2024
A `KeyboardInterrupt` raised during `Thread.join()` will incorrectly
mark that thread as stopped [0, 1], causing subsequent `join()`s to not
block. This can cause wptrunner to exit with a `TestRunnerManager`
thread never having stopped its browser [2], which is leaked [3].

Use a barrier to explicitly synchronize the exit of all threads. A
`KeyboardInterrupt` raised during `Barrier.wait()` should re-decrement
the number of waiters [4], meaning the `ManagerGroup` can `wait()`
again.

Marking the `TestRunnerManager` threads as non-`daemon`ic would require
them to exit for the process to exit (thus fixing browser leakage), but
those threads may still log after `mozlog` shutdown.

[0]: python/cpython#90882
[1]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L1146-L1178
[2]: https://github.com/web-platform-tests/wpt/blob/f8fb78a2/tools/wptrunner/wptrunner/testrunner.py#L441
[3]: https://crbug.com/330236796
[4]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L702-L728
jonathan-j-lee added a commit to web-platform-tests/wpt that referenced this issue Apr 8, 2024
#45593)

A `KeyboardInterrupt` raised during `Thread.join()` will incorrectly
mark that thread as stopped [0, 1], causing subsequent `join()`s to not
block. This can cause wptrunner to exit with a `TestRunnerManager`
thread never having stopped its browser [2], which is leaked [3].

Use a barrier to explicitly synchronize the exit of all threads. A
`KeyboardInterrupt` raised during `Barrier.wait()` should re-decrement
the number of waiters [4], meaning the `ManagerGroup` can `wait()`
again.

Marking the `TestRunnerManager` threads as non-`daemon`ic would require
them to exit for the process to exit (thus fixing browser leakage), but
those threads may still log after `mozlog` shutdown.

[0]: python/cpython#90882
[1]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L1146-L1178
[2]: https://github.com/web-platform-tests/wpt/blob/f8fb78a2/tools/wptrunner/wptrunner/testrunner.py#L441
[3]: https://crbug.com/330236796
[4]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L702-L728
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Apr 15, 2024
… waits for all threads to stop, a=testonly

Automatic update from web-platform-tests
[wptrunner] Ensure `ManagerGroup.wait()` waits for all threads to stop (#45593)

A `KeyboardInterrupt` raised during `Thread.join()` will incorrectly
mark that thread as stopped [0, 1], causing subsequent `join()`s to not
block. This can cause wptrunner to exit with a `TestRunnerManager`
thread never having stopped its browser [2], which is leaked [3].

Use a barrier to explicitly synchronize the exit of all threads. A
`KeyboardInterrupt` raised during `Barrier.wait()` should re-decrement
the number of waiters [4], meaning the `ManagerGroup` can `wait()`
again.

Marking the `TestRunnerManager` threads as non-`daemon`ic would require
them to exit for the process to exit (thus fixing browser leakage), but
those threads may still log after `mozlog` shutdown.

[0]: python/cpython#90882
[1]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L1146-L1178
[2]: https://github.com/web-platform-tests/wpt/blob/f8fb78a2/tools/wptrunner/wptrunner/testrunner.py#L441
[3]: https://crbug.com/330236796
[4]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L702-L728
--

wpt-commits: 4efc21ca79825ef2538c2dd703dfd57d91559a7f
wpt-pr: 45593
ErichDonGubler pushed a commit to erichdongubler-mozilla/firefox that referenced this issue Apr 16, 2024
… waits for all threads to stop, a=testonly

Automatic update from web-platform-tests
[wptrunner] Ensure `ManagerGroup.wait()` waits for all threads to stop (#45593)

A `KeyboardInterrupt` raised during `Thread.join()` will incorrectly
mark that thread as stopped [0, 1], causing subsequent `join()`s to not
block. This can cause wptrunner to exit with a `TestRunnerManager`
thread never having stopped its browser [2], which is leaked [3].

Use a barrier to explicitly synchronize the exit of all threads. A
`KeyboardInterrupt` raised during `Barrier.wait()` should re-decrement
the number of waiters [4], meaning the `ManagerGroup` can `wait()`
again.

Marking the `TestRunnerManager` threads as non-`daemon`ic would require
them to exit for the process to exit (thus fixing browser leakage), but
those threads may still log after `mozlog` shutdown.

[0]: python/cpython#90882
[1]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L1146-L1178
[2]: https://github.com/web-platform-tests/wpt/blob/f8fb78a2/tools/wptrunner/wptrunner/testrunner.py#L441
[3]: https://crbug.com/330236796
[4]: https://github.com/python/cpython/blob/558b517b/Lib/threading.py#L702-L728
--

wpt-commits: 4efc21ca79825ef2538c2dd703dfd57d91559a7f
wpt-pr: 45593
@abrahammurciano
Copy link

In the 3.13 branch the implementation of Thread.join is just a wrapper around _thread._ThreadHandle.join, which is implemented in C.

https://github.com/python/cpython/blob/3.13/Lib/threading.py#L1090

self._handle.join(timeout)

Does this mean this issue will be fixed by this change in python3.13?

@fofoni
Copy link

fofoni commented May 22, 2024

Looks like it's really fixed! Just tested the reproducer in the OP:

$ PYENV_VERSION=3.6 python threading-bug.py  # also same result in 3.7, 3.8, and 3.13.0b1
joining now
^CTrue
thread done
$

and:

$ PYENV_VERSION=3.9 python threading-bug.py  # also same result in 3.10, 3.11, and 3.12
joining now
^CFalse
$

(Note that this bug was really introduced only in 3.9.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

7 participants