Add lightweight locking C API #108724

colesbury · 2023-08-31T15:50:37Z

Feature or enhancement

Implementing PEP 703 will require adding additional fine grained locks and other synchronization mechanisms. For good performance, it's important that these locks be "lightweight" in the sense that they don't take up much space and don't require memory allocations to create. Additionally, it's important that these locks are fast in the common uncontended case, perform reasonably under contention, and avoid thread starvation.

Platform provided mutexes like pthread_mutex_t are large (40 bytes on x86-64 Linux) and our current cross-platform wrappers ([1], [2], [3]) require additional memory allocations.

I'm proposing a lightweight mutex (PyMutex) along with internal-only APIs used for building an efficient PyMutex as well as other synchronization primitives. The design is based on WebKit's WTF::Lock and WTF::ParkingLot, which is described in detail in the Locking in WebKit blog post. (The design has also been ported to Rust in the parking_lot crate.)

Public API

The public API (in Include/cpython) would provide a PyMutex that occupies one byte and can be zero-initialized:

typedef struct PyMutex { uint8_t state; } PyMutex;
void PyMutex_Lock(PyMutex *m);
void PyMutex_Unlock(PyMutex *m);

I'm proposing making PyMutex public because it's useful in C extensions, such as NumPy, (as opposed to C++) where it can be a pain to wrap cross-platform synchronization primitives.

Internal APIs

The internal only API (in Include/internal) would provide APIs for building PyMutex and other synchronization primitives. The main addition is a compare-and-wait primitive, like Linux's futex or Window's WaitOnAdress.

int _PyParkingLot_Park(const void *address, const void *expected, size_t address_size,
                       _PyTime_t timeout_ns, void *arg, int detach)

The API closely matches WaitOnAddress but with two additions: arg is an optional, arbitrary pointer passed to the wake-up thread and detach indicates whether to release the GIL (or detach in --disable-gil builds) while waiting. The additional arg pointer allows the locks to be only one byte (instead of at least pointer sized), since it allows passing additional (stack allocated) data between the waiting and the waking thread.

The wakeup API looks like:

// wake up all threads waiting on `address`
void _PyParkingLot_UnparkAll(const void *address);

// or wake up a single thread
_PyParkingLot_Unpark(address, unpark, {
    // code here is executed after the thread to be woken up is identified but before we wake it up
    void *arg = unpark->arg;
    int more_waiters = unpark->more_waiters;
    ...
});

_PyParkingLot_Unpark is currently a macro that takes a code block. For PyMutex we need to update the mutex bits after we identify the thread but before we actually wake it up.

cc @ericsnowcurrently

Linked PRs

The text was updated successfully, but these errors were encountered:

vstinner · 2023-09-01T16:09:45Z

Would it be possible to start with no public API for now, and consider adding the public API later?

colesbury · 2023-09-01T16:57:13Z

Yes, it would be possible to put everything in the internal API for now.

vstinner · 2023-09-01T19:24:02Z

Yes, it would be possible to put everything in the internal API for now.

It's way easier for me to review a change if the API is restricted to the internal C API, than adding a public C API (need doc, tests, review the design, error cases, etc.).

`PyMutex` is a one byte lock with fast, inlineable lock and unlock functions for the common uncontended case. The design is based on WebKit's `WTF::Lock`. PyMutex is built using the `_PyParkingLot` APIs, which provides a cross-platform futex-like API (based on WebKit's `WTF::ParkingLot`). This internal API will be used for building other synchronization primitives used to implement PEP 703, such as one-time initialization and events.

ericsnowcurrently · 2023-09-13T20:24:36Z

Having read through https://webkit.org/blog/6161/locking-in-webkit/ now, I'm wondering if we could make wide use of the new lock API in place of PyThread_lock_type.

Also, the benchmark results in that paper imply that the picture isn't quite so rosy with 2 or 3 threads.

colesbury · 2023-09-13T21:27:42Z

I'm wondering if we could make wide use of the new lock API in place of PyThread_lock_type

Yes. I did this to some extent in nogil-3.12 (e.g., colesbury/nogil-3.12@45bdd27). I did not do this as part of the linked PR to avoid increasing the scope.

Also, the benchmark results in that paper imply that the picture isn't quite so rosy with 2 or 3 threads

Assuming you're referring to this benchmark result, the results for 2-3 threads are actually quite good. The most important comparison is WTF::Mutex vs. OS mutex and WTF::Mutex performs much better in that benchmark across the entire range.

One thing to keep in mind is that the benchmark is inherently serial because it's testing lock contention with a very short critical section. It's not a general purpose benchmark -- real applications may have some mix of contention and and parallelizable code -- this benchmark is all contention. This means it will always perform best with 1 thread, and perhaps unintuitively, better with one core than on a multi core machine. I think that's why there's a dip around 3-4 threads for many of the locks implementations (the blog post mentions it's run on a machine with 4 cores).

Here's the equivalent benchmark for PyMutex vs PyThread_type_lock on Linux using 16 cores:

(interactive link)

ericsnowcurrently · 2023-09-13T22:18:59Z

I think that's why there's a dip around 3-4 threads for many of the locks implementations

My understanding is that the dip is at least partly due to the approach WTF::Lock takes to "fairness" (and "barging"), where the next parked thread will lose out to an unparked thread that tries to acquire the lock at the right moment. IIUC, that would lead to skewed results the fewer additional threads you have. (See https://webkit.org/wp-content/uploads/fairness_wtflock_small.png in the "Lock Fairness" section.)

Regardless, you're completely correct that this appears to be a drastic improvement over the OS mutex. 😄

PyMutex is a one byte lock with fast, inlineable lock and unlock functions for the common uncontended case. The design is based on WebKit's WTF::Lock. PyMutex is built using the _PyParkingLot APIs, which provides a cross-platform futex-like API (based on WebKit's WTF::ParkingLot). This internal API will be used for building other synchronization primitives used to implement PEP 703, such as one-time initialization and events. This also includes tests and a mini benchmark in Tools/lockbench/lockbench.py to compare with the existing PyThread_type_lock. Uncontended acquisition + release: * Linux (x86-64): PyMutex: 11 ns, PyThread_type_lock: 44 ns * macOS (arm64): PyMutex: 13 ns, PyThread_type_lock: 18 ns * Windows (x86-64): PyMutex: 13 ns, PyThread_type_lock: 38 ns PR Overview: The primary purpose of this PR is to implement PyMutex, but there are a number of support pieces (described below). * PyMutex: A 1-byte lock that doesn't require memory allocation to initialize and is generally faster than the existing PyThread_type_lock. The API is internal only for now. * _PyParking_Lot: A futex-like API based on the API of the same name in WebKit. Used to implement PyMutex. * _PyRawMutex: A word sized lock used to implement _PyParking_Lot. * PyEvent: A one time event. This was used a bunch in the "nogil" fork and is useful for testing the PyMutex implementation, so I've included it as part of the PR. * pycore_llist.h: Defines common operations on doubly-linked list. Not strictly necessary (could do the list operations manually), but they come up frequently in the "nogil" fork. ( Similar to https://man.freebsd.org/cgi/man.cgi?queue) --------- Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

Some WASM platforms have POSIX semaphores, but not sem_timedwait.

PyMutex is a one byte lock with fast, inlineable lock and unlock functions for the common uncontended case. The design is based on WebKit's WTF::Lock. PyMutex is built using the _PyParkingLot APIs, which provides a cross-platform futex-like API (based on WebKit's WTF::ParkingLot). This internal API will be used for building other synchronization primitives used to implement PEP 703, such as one-time initialization and events. This also includes tests and a mini benchmark in Tools/lockbench/lockbench.py to compare with the existing PyThread_type_lock. Uncontended acquisition + release: * Linux (x86-64): PyMutex: 11 ns, PyThread_type_lock: 44 ns * macOS (arm64): PyMutex: 13 ns, PyThread_type_lock: 18 ns * Windows (x86-64): PyMutex: 13 ns, PyThread_type_lock: 38 ns PR Overview: The primary purpose of this PR is to implement PyMutex, but there are a number of support pieces (described below). * PyMutex: A 1-byte lock that doesn't require memory allocation to initialize and is generally faster than the existing PyThread_type_lock. The API is internal only for now. * _PyParking_Lot: A futex-like API based on the API of the same name in WebKit. Used to implement PyMutex. * _PyRawMutex: A word sized lock used to implement _PyParking_Lot. * PyEvent: A one time event. This was used a bunch in the "nogil" fork and is useful for testing the PyMutex implementation, so I've included it as part of the PR. * pycore_llist.h: Defines common operations on doubly-linked list. Not strictly necessary (could do the list operations manually), but they come up frequently in the "nogil" fork. ( Similar to https://man.freebsd.org/cgi/man.cgi?queue) --------- Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

) Some WASM platforms have POSIX semaphores, but not sem_timedwait.

ngoldbaum · 2023-11-02T16:27:23Z

Hi @colesbury, I hate to resurrect a closed issue, but wanted to raise this point as something worth spending some time on:

I'm proposing making PyMutex public because it's useful in C extensions, such as NumPy, (as opposed to C++) where it can be a pain to wrap cross-platform synchronization primitives.

I'm currently looking at what it would take to add proper locking to NumPy (thinking about NEP 55, but NumPy has a number of threading bugs right now that are hidden by the GIL) and it looks like indeed there are very few cross-platform synchronization libraries I could use - it would be very nice for NumPy if CPython added a public version of PyMutex.

colesbury · 2023-11-02T16:46:28Z

@ngoldbaum Yeah, I'm planning on coming back to this in a bit to propose the public APIs. I'll probably wait until #111569 and maybe a few other changes land first.

Using a system clock causes issues when it's being updated by the system administrator, NTP, Daylight Saving Time, or anything else. Sadly, on Windows, the monotonic clock resolution is around 15.6 ms. Example: python#85876

In general, when `_PyThreadState_GET()` is non-NULL then the current thread is "attached", but there is a small window during `PyThreadState_DeleteCurrent()` where that's not true: tstate_delete_common is called when the thread is detached, but before current_fast_clear(). This updates _PySemaphore_Wait() to handle that case.

In general, when `_PyThreadState_GET()` is non-NULL then the current thread is "attached", but there is a small window during `PyThreadState_DeleteCurrent()` where that's not true: tstate_delete_common() is called when the thread is detached, but before current_fast_clear(). Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

…ython#116483) In general, when `_PyThreadState_GET()` is non-NULL then the current thread is "attached", but there is a small window during `PyThreadState_DeleteCurrent()` where that's not true: tstate_delete_common() is called when the thread is detached, but before current_fast_clear(). Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

colesbury added type-feature A feature request or enhancement 3.13 bugs and security fixes labels Aug 31, 2023

colesbury mentioned this issue Aug 31, 2023

PEP 703 -- Making the Global Interpreter Lock Optional in CPython #108219

Open

colesbury added the topic-C-API label Aug 31, 2023

colesbury mentioned this issue Sep 1, 2023

Split up _testinternalcapi.c #108777

Closed

colesbury self-assigned this Sep 11, 2023

bedevere-app bot mentioned this issue Sep 12, 2023

gh-108724: Add PyMutex and _PyParkingLot APIs #109344

Merged

colesbury added a commit to colesbury/cpython that referenced this issue Sep 19, 2023

pythongh-108724: Fix _PySemaphore compile error on WASM

a9c0a5b

Some WASM platforms have POSIX semaphores, but not sem_timedwait.

bedevere-app bot mentioned this issue Sep 19, 2023

gh-108724: Fix _PySemaphore compile error on WASM #109583

Merged

ericsnowcurrently pushed a commit that referenced this issue Sep 19, 2023

gh-108724: Fix _PySemaphore compile error on WASM (gh-109583)

9df6712

Some WASM platforms have POSIX semaphores, but not sem_timedwait.

colesbury closed this as completed Sep 21, 2023

csm10495 pushed a commit to csm10495/cpython that referenced this issue Sep 28, 2023

pythongh-108724: Fix _PySemaphore compile error on WASM (pythongh-109583

cd452fb

) Some WASM platforms have POSIX semaphores, but not sem_timedwait.

bedevere-app bot mentioned this issue Nov 17, 2023

gh-108724: Use _PyTime_GetMonotonicClock() in parking_lot.c #112222

Closed

bedevere-app bot mentioned this issue Mar 7, 2024

gh-108724: Fix _PySemaphore_Wait call during thread deletion #116483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lightweight locking C API #108724

Add lightweight locking C API #108724

colesbury commented Aug 31, 2023 •

edited by bedevere-app bot

Loading

vstinner commented Sep 1, 2023

colesbury commented Sep 1, 2023

vstinner commented Sep 1, 2023

ericsnowcurrently commented Sep 13, 2023 •

edited

Loading

colesbury commented Sep 13, 2023

ericsnowcurrently commented Sep 13, 2023

ngoldbaum commented Nov 2, 2023 •

edited

Loading

colesbury commented Nov 2, 2023

Add lightweight locking C API #108724

Add lightweight locking C API #108724

Comments

colesbury commented Aug 31, 2023 • edited by bedevere-app bot Loading

Feature or enhancement

Public API

Internal APIs

Linked PRs

vstinner commented Sep 1, 2023

colesbury commented Sep 1, 2023

vstinner commented Sep 1, 2023

ericsnowcurrently commented Sep 13, 2023 • edited Loading

colesbury commented Sep 13, 2023

ericsnowcurrently commented Sep 13, 2023

ngoldbaum commented Nov 2, 2023 • edited Loading

colesbury commented Nov 2, 2023

colesbury commented Aug 31, 2023 •

edited by bedevere-app bot

Loading

ericsnowcurrently commented Sep 13, 2023 •

edited

Loading

ngoldbaum commented Nov 2, 2023 •

edited

Loading