Fix issue #418. #419

pingzhaozz · 2023-11-06T06:43:19Z

Fixing issue #418,
To avoid entering Parking too frequently in case of cache contention, adding sleep 1ms, 4 times before parking and after old 'spin()'.

The lock-bench result shows:

$cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.276158ms min 103.870893ms max 131.823024ms
parking_lot::Mutex   avg 81.669426ms  min 72.584055ms  max 88.3535ms
spin::Mutex          avg 161.586476ms min 152.302867ms max 184.132674ms
AmdSpinlock          avg 157.446488ms min 147.091038ms max 180.832205ms

vs. Before:

$ cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.030539ms min 105.760154ms max 131.87258ms
parking_lot::Mutex   avg 403.756547ms min 326.026509ms max 533.260014ms
spin::Mutex          avg 161.125953ms min 151.708034ms max 177.132377ms
AmdSpinlock          avg 158.042233ms min 148.723058ms max 171.265994ms

also fixed issue #338

Before:

- Running with 9 threads
- 5 iterations inside lock, 5 iterations outside lock
- 2 seconds per test
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    363.574 kHz |    330.716 kHz |    137.608 kHz
std::sync::Mutex     |    699.105 kHz |    613.367 kHz |    223.500 kHz
pthread_mutex_t      |   1125.346 kHz |   1115.156 kHz |     46.531 kHz
- Running with 18 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     26.346 kHz |     26.358 kHz |      0.407 kHz
std::sync::Mutex     |    353.564 kHz |    336.981 kHz |     61.858 kHz
pthread_mutex_t      |    529.260 kHz |    529.450 kHz |     19.034 kHz
- Running with 27 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     17.232 kHz |     17.306 kHz |      0.428 kHz
std::sync::Mutex     |    226.792 kHz |    213.642 kHz |     41.591 kHz
pthread_mutex_t      |    315.459 kHz |    316.935 kHz |     19.549 kHz
- Running with 36 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     12.920 kHz |     12.921 kHz |      0.287 kHz
std::sync::Mutex     |    167.901 kHz |    158.800 kHz |     34.085 kHz
pthread_mutex_t      |    221.542 kHz |    212.404 kHz |     17.621 kHz

After

- Running with 9 threads
- 5 iterations inside lock, 5 iterations outside lock
- 2 seconds per test
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |   1179.909 kHz |   1075.771 kHz |    345.058 kHz
std::sync::Mutex     |    808.353 kHz |    853.010 kHz |    128.165 kHz
pthread_mutex_t      |   1382.591 kHz |   1388.152 kHz |     22.390 kHz
- Running with 18 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    571.435 kHz |    539.379 kHz |    157.537 kHz
std::sync::Mutex     |    384.533 kHz |    397.613 kHz |     86.923 kHz
pthread_mutex_t      |    572.527 kHz |    572.878 kHz |     26.068 kHz
- Running with 27 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    368.414 kHz |    345.560 kHz |     85.260 kHz
std::sync::Mutex     |    238.103 kHz |    223.207 kHz |     59.118 kHz
pthread_mutex_t      |    337.518 kHz |    344.361 kHz |     23.691 kHz
- Running with 36 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    268.471 kHz |    259.280 kHz |     50.276 kHz
std::sync::Mutex     |    175.217 kHz |    166.470 kHz |     41.695 kHz
pthread_mutex_t      |    237.988 kHz |    225.297 kHz |     25.596 kHz

To avoid entering Parking too frequently in case of cache contention, adding sleep 1ms, 4 times before parking and after old 'spin()'. Signed-off-by: Ping Zhao <ping.zhao@intel.com>

pingzhaozz · 2023-11-21T03:35:54Z

@Amanieu Sorry to push. I would like to hear some feedbacks about issue #418 and this PR. Will you take a look on it?

Amanieu · 2023-11-21T03:47:25Z

The benchmark numbers here are actually very misleading. Essentially, this benchmark measures how fast a thread can repeatedly acquire and release a lock.

The best performance is obtained when only 1 thread is accessing the lock memory, so there are no cache conflicts, no need to fall back to the slow path, etc. This is the number you get when running the benchmark with only 1 thread.

When running the benchmark with multiple threads, you get the best performance in the same way: by ensuring a single thread can repeatedly lock/unlock without interference from other threads. The easiest way to achieve this is to put all other threads to sleep. The longer the sleep, the better the benchmark numbers.

The problem is that this only makes the benchmark look good on paper, and actually hurts real applications. Making a thread sleep for 1ms (~3000000 CPU cycles) if it misses a lock can have very noticeable impacts on application performance, especially for timing-sensitive code such as rendering or audio. Instead, it is better to park the thread so that it can be woken up by the OS as soon as the lock is free.

pingzhaozz · 2023-11-21T06:29:51Z

The Mutex is designed for multi-threaded data access, ideally program without the need for data sharing. However, in practice, some critical structures require sharing among multiple threads. Based on our experience, this lock benchmark can effectively represent scenarios of lock contention, as illustrated by the given case reflecting Mutex behavior in high contention situations. While reducing the number of threads will have better results, the issue persists, resulting in lower performance compared to std::Mutex. I have chosen this case to better highlight the problem.

The performance drop in parking-lot::Mutex, compared to std::Mutex, occurs in the slow path. Before entering the parking state, there should be a smooth transition, and using sleep appears to be a suitable way to transition between spin_wait, yield, and parking. The logic remains consistent with before, meaning that the slow path still attempts spin_wait and yield first. If these also fail and before entering the parking state, it will try sleeping to make the transition smoother. This is because parking takes the OS significantly more time to wake than a simple sleep. Based on that it won’t hurt the timing-sensitive program but help them smoothly transition in the case of high lock contention.

pingzhaozz · 2023-12-07T01:02:40Z

This change also fixe issue #338

Before:

- Running with 9 threads
- 5 iterations inside lock, 5 iterations outside lock
- 2 seconds per test
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    363.574 kHz |    330.716 kHz |    137.608 kHz
std::sync::Mutex     |    699.105 kHz |    613.367 kHz |    223.500 kHz
pthread_mutex_t      |   1125.346 kHz |   1115.156 kHz |     46.531 kHz
- Running with 18 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     26.346 kHz |     26.358 kHz |      0.407 kHz
std::sync::Mutex     |    353.564 kHz |    336.981 kHz |     61.858 kHz
pthread_mutex_t      |    529.260 kHz |    529.450 kHz |     19.034 kHz
- Running with 27 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     17.232 kHz |     17.306 kHz |      0.428 kHz
std::sync::Mutex     |    226.792 kHz |    213.642 kHz |     41.591 kHz
pthread_mutex_t      |    315.459 kHz |    316.935 kHz |     19.549 kHz
- Running with 36 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |     12.920 kHz |     12.921 kHz |      0.287 kHz
std::sync::Mutex     |    167.901 kHz |    158.800 kHz |     34.085 kHz
pthread_mutex_t      |    221.542 kHz |    212.404 kHz |     17.621 kHz

After

- Running with 9 threads
- 5 iterations inside lock, 5 iterations outside lock
- 2 seconds per test
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |   1179.909 kHz |   1075.771 kHz |    345.058 kHz
std::sync::Mutex     |    808.353 kHz |    853.010 kHz |    128.165 kHz
pthread_mutex_t      |   1382.591 kHz |   1388.152 kHz |     22.390 kHz
- Running with 18 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    571.435 kHz |    539.379 kHz |    157.537 kHz
std::sync::Mutex     |    384.533 kHz |    397.613 kHz |     86.923 kHz
pthread_mutex_t      |    572.527 kHz |    572.878 kHz |     26.068 kHz
- Running with 27 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    368.414 kHz |    345.560 kHz |     85.260 kHz
std::sync::Mutex     |    238.103 kHz |    223.207 kHz |     59.118 kHz
pthread_mutex_t      |    337.518 kHz |    344.361 kHz |     23.691 kHz
- Running with 36 threads
        name         |    average     |     median     |    std.dev.
parking_lot::Mutex   |    268.471 kHz |    259.280 kHz |     50.276 kHz
std::sync::Mutex     |    175.217 kHz |    166.470 kHz |     41.695 kHz
pthread_mutex_t      |    237.988 kHz |    225.297 kHz |     25.596 kHz

dista · 2024-04-16T08:25:43Z

We have a video/audio streaming application build on tokio(which enable parking_lot by default), when parking_lot is enabled, when we use wrk to bench http output of the streaming application, the application is bottlenecked in cpu, no matter
how many threads(32 threads for example, it should reach 3200% at most) we assign to tokio, the cpu of our application can not exceed 600%.

after disable parking_lot, we can reach the number we anticipated.

the benchmark for parking_lot is very pool in our server(AMD EPYC 7502, Rocky linux 9.3, Kernel 5.14.0-362.8.1.el9_3.x86_64)
cargo run --release 32 2 10000 100

std::sync::Mutex     avg 24.545373ms  min 19.094168ms  max 27.440505ms 
parking_lot::Mutex   avg 437.688227ms min 403.22461ms  max 493.899443ms
spin::Mutex          avg 32.515297ms  min 22.190278ms  max 56.560499ms 
AmdSpinlock          avg 32.379324ms  min 23.217082ms  max 48.166241ms

Amanieu · 2024-04-16T09:16:14Z

@dista Can you open a separate issue for this? Also, would you be able to test some modifications of parking_lot on your application to see how it affects CPU usage?

dista · 2024-04-16T10:17:22Z

@dista Can you open a separate issue for this? Also, would you be able to test some modifications of parking_lot on your application to see how it affects CPU usage?

issue filed #437

Fix issue Amanieu#418.

a2d8a98

To avoid entering Parking too frequently in case of cache contention, adding sleep 1ms, 4 times before parking and after old 'spin()'. Signed-off-by: Ping Zhao <ping.zhao@intel.com>

pingzhaozz mentioned this pull request Dec 7, 2023

Mutex is slower than standard library's on Intel i9 and ARM CPUs #338

Open

pingzhaozz mentioned this pull request Apr 18, 2024

low benchmark and cpu bottleneck issue #437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #418. #419

Fix issue #418. #419

pingzhaozz commented Nov 6, 2023 •

edited

Loading

pingzhaozz commented Nov 21, 2023

Amanieu commented Nov 21, 2023

pingzhaozz commented Nov 21, 2023

pingzhaozz commented Dec 7, 2023

dista commented Apr 16, 2024 •

edited

Loading

Amanieu commented Apr 16, 2024

dista commented Apr 16, 2024

Fix issue #418. #419

Are you sure you want to change the base?

Fix issue #418. #419

Conversation

pingzhaozz commented Nov 6, 2023 • edited Loading

pingzhaozz commented Nov 21, 2023

Amanieu commented Nov 21, 2023

pingzhaozz commented Nov 21, 2023

pingzhaozz commented Dec 7, 2023

dista commented Apr 16, 2024 • edited Loading

Amanieu commented Apr 16, 2024

dista commented Apr 16, 2024

pingzhaozz commented Nov 6, 2023 •

edited

Loading

dista commented Apr 16, 2024 •

edited

Loading