Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: mpsc performance optimization avoid false sharing #5829

Merged
merged 10 commits into from
Aug 3, 2023
Merged

sync: mpsc performance optimization avoid false sharing #5829

merged 10 commits into from
Aug 3, 2023

Conversation

wathenjiang
Copy link
Contributor

Motivation

Cache line optimization on mpsc.

Solution

Cacheline optimization on mpsc by using CachePadded.

Benchmark

The origin benchmark result:

    $ cargo bench --bench sync_mpsc
    running 10 tests
    test contention_bounded      ... bench:   1,008,359 ns/iter (+/- 412,814)
    test contention_bounded_full ... bench:   1,427,243 ns/iter (+/- 500,287)
    test contention_unbounded    ... bench:     845,013 ns/iter (+/- 394,673)
    test create_100_000_medium   ... bench:         182 ns/iter (+/- 1)
    test create_100_medium       ... bench:         182 ns/iter (+/- 1)
    test create_1_medium         ... bench:         181 ns/iter (+/- 2)
    test send_large              ... bench:      16,525 ns/iter (+/- 329)
    test send_medium             ... bench:         628 ns/iter (+/- 5)
    test uncontented_bounded     ... bench:     478,514 ns/iter (+/- 1,923)
    test uncontented_unbounded   ... bench:     303,990 ns/iter (+/- 1,607)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

The current benchmark result:

 $ cargo bench --bench sync_mpsc
    running 10 tests
    test contention_bounded      ... bench:     606,516 ns/iter (+/- 402,326)
    test contention_bounded_full ... bench:     727,239 ns/iter (+/- 340,756)
    test contention_unbounded    ... bench:     760,523 ns/iter (+/- 482,628)
    test create_100_000_medium   ... bench:         315 ns/iter (+/- 5)
    test create_100_medium       ... bench:         317 ns/iter (+/- 6)
    test create_1_medium         ... bench:         315 ns/iter (+/- 5)
    test send_large              ... bench:      16,166 ns/iter (+/- 516)
    test send_medium             ... bench:         695 ns/iter (+/- 6)
    test uncontented_bounded     ... bench:     456,975 ns/iter (+/- 18,969)
    test uncontented_unbounded   ... bench:     306,282 ns/iter (+/- 3,058)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

wathenjiang and others added 3 commits June 25, 2023 15:13
The origin benchmark result:

$ cargo bench --bench sync_mpsc
running 10 tests
test contention_bounded      ... bench:   1,008,359 ns/iter (+/- 412,814)
test contention_bounded_full ... bench:   1,427,243 ns/iter (+/- 500,287)
test contention_unbounded    ... bench:     845,013 ns/iter (+/- 394,673)
test create_100_000_medium   ... bench:         182 ns/iter (+/- 1)
test create_100_medium       ... bench:         182 ns/iter (+/- 1)
test create_1_medium         ... bench:         181 ns/iter (+/- 2)
test send_large              ... bench:      16,525 ns/iter (+/- 329)
test send_medium             ... bench:         628 ns/iter (+/- 5)
test uncontented_bounded     ... bench:     478,514 ns/iter (+/- 1,923)
test uncontented_unbounded   ... bench:     303,990 ns/iter (+/- 1,607)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

The current benchmark result:

$ cargo bench --bench sync_mpsc
running 10 tests
test contention_bounded      ... bench:     606,516 ns/iter (+/- 402,326)
test contention_bounded_full ... bench:     727,239 ns/iter (+/- 340,756)
test contention_unbounded    ... bench:     760,523 ns/iter (+/- 482,628)
test create_100_000_medium   ... bench:         315 ns/iter (+/- 5)
test create_100_medium       ... bench:         317 ns/iter (+/- 6)
test create_1_medium         ... bench:         315 ns/iter (+/- 5)
test send_large              ... bench:      16,166 ns/iter (+/- 516)
test send_medium             ... bench:         695 ns/iter (+/- 6)
test uncontented_bounded     ... bench:     456,975 ns/iter (+/- 18,969)
test uncontented_unbounded   ... bench:     306,282 ns/iter (+/- 3,058)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured
@github-actions github-actions bot added the R-loom Run loom tests on this PR label Jun 28, 2023
@wathenjiang wathenjiang changed the title Cache line optimization on mpsc. mpsc performance performance optimization about cache line Jun 28, 2023
@wathenjiang wathenjiang changed the title mpsc performance performance optimization about cache line mpsc performance optimization about cache line Jun 28, 2023
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-sync Module: tokio/sync labels Jun 28, 2023
@carllerche
Copy link
Member

What system are you running the benchmark on?

@carllerche
Copy link
Member

The unfortunate issue with this change is it makes creating a channel much more expensive for "one shot" operations.

@carllerche
Copy link
Member

Could you experiment w/ only padding 64 instead of 128. Does it make a difference?

@wathenjiang
Copy link
Contributor Author

What system are you running the benchmark on?

On both Linux and Darwin.

@wathenjiang
Copy link
Contributor Author

What system are you running the benchmark on?

The benchmark difference is almost the same on both systems.

@wathenjiang
Copy link
Contributor Author

The unfortunate issue with this change is it makes creating a channel much more expensive for "one shot" operations.

I known, but this may not the first of all issue. Because of zero cost in rust, in oneshot scene users may use oneshot instead. The external cost may accepted, though I would like to found how it happen.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Jun 29, 2023

Could you experiment w/ only padding 64 instead of 128. Does it make a difference?

Which size to pad, it depends on the which cpu you run. The size of cache line is designed by hardware.

On my mac, the cache line is 128.

If the CachePadded<T> use #[repr(align(128))], the benchmark is as followed:

$ cargo bench --bench sync_mpsc
running 10 tests
test contention_bounded      ... bench:     749,449 ns/iter (+/- 146,014)
test contention_bounded_full ... bench:     673,661 ns/iter (+/- 115,419)
test contention_unbounded    ... bench:     469,466 ns/iter (+/- 49,005)
test create_100_000_medium   ... bench:         177 ns/iter (+/- 6)
test create_100_medium       ... bench:         181 ns/iter (+/- 29)
test create_1_medium         ... bench:         152 ns/iter (+/- 37)
test send_large              ... bench:      15,549 ns/iter (+/- 348)
test send_medium             ... bench:         480 ns/iter (+/- 15)
test uncontented_bounded     ... bench:     221,580 ns/iter (+/- 978)
test uncontented_unbounded   ... bench:     159,800 ns/iter (+/- 3,674)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

But when it use #[repr(align(64))], the benchmark is as followed:

running 10 tests
test contention_bounded      ... bench:     745,966 ns/iter (+/- 259,831)
test contention_bounded_full ... bench:     672,112 ns/iter (+/- 48,057)
test contention_unbounded    ... bench:     579,621 ns/iter (+/- 73,529)
test create_100_000_medium   ... bench:         167 ns/iter (+/- 2)
test create_100_medium       ... bench:         143 ns/iter (+/- 2)
test create_1_medium         ... bench:         167 ns/iter (+/- 26)
test send_large              ... bench:      15,639 ns/iter (+/- 1,047)
test send_medium             ... bench:         481 ns/iter (+/- 99)
test uncontented_bounded     ... bench:     222,053 ns/iter (+/- 1,895)
test uncontented_unbounded   ... bench:     161,342 ns/iter (+/- 12,946)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

And on my linux (which cache line size is 64), if we set it by #[repr(align(32))], the benchmark is as followed:

running 10 tests
test contention_bounded      ... bench:     801,299 ns/iter (+/- 509,036)
test contention_bounded_full ... bench:     751,686 ns/iter (+/- 481,196)
test contention_unbounded    ... bench:     692,478 ns/iter (+/- 367,244)
test create_100_000_medium   ... bench:         275 ns/iter (+/- 5)
test create_100_medium       ... bench:         275 ns/iter (+/- 4)
test create_1_medium         ... bench:         274 ns/iter (+/- 6)
test send_large              ... bench:      16,200 ns/iter (+/- 212)
test send_medium             ... bench:         750 ns/iter (+/- 22)
test uncontented_bounded     ... bench:     488,436 ns/iter (+/- 4,813)
test uncontented_unbounded   ... bench:     305,340 ns/iter (+/- 8,128)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

Benchmark performance results might not be stable enough, but aligning the some fields in struct according to the size of the cache line may have better performance in the most benchmark test cases.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Jun 29, 2023

The reason why creating CachePadded<T> is more expensive than creating T might be creating CachePadded<T> uses extra function call memcpy.

The source code:

fn get_s() -> Box<S> {
    Box::new(S { i: 0, j: 0 })
}

struct S {
    i: usize,
    j: usize,
}


#[repr(align(128))]
struct Align{
    i:usize,
    j:usize,
}

fn get_align()-> Box::<Align> {
    Box::new(Align { i: 0, j:0 })
}

fn main(){
    let s = get_s();
    let a = get_align();
}

The assembly code of two function get_s() and get_align() is

playground::get_s: # @playground::get_s
# %bb.0:
	subq	$72, %rsp
	movq	$0, 24(%rsp)
	movq	$0, 32(%rsp)
	movq	24(%rsp), %rcx
	movq	%rcx, (%rsp)                    # 8-byte Spill
	movq	32(%rsp), %rax
	movq	%rax, 8(%rsp)                   # 8-byte Spill
	movq	%rcx, 40(%rsp)
	movq	%rax, 48(%rsp)
	movl	$16, %edi
	movl	$8, %esi
	callq	alloc::alloc::exchange_malloc
	movq	%rax, 16(%rsp)                  # 8-byte Spill
	jmp	.LBB15_2
	movq	%rax, %rcx
	movl	%edx, %eax
	movq	%rcx, 56(%rsp)
	movl	%eax, 64(%rsp)
	movq	56(%rsp), %rdi
	callq	_Unwind_Resume@PLT
	ud2

.LBB15_2:
	movq	16(%rsp), %rax                  # 8-byte Reload
	movq	8(%rsp), %rcx                   # 8-byte Reload
	movq	(%rsp), %rdx                    # 8-byte Reload
	movq	%rdx, (%rax)
	movq	%rcx, 8(%rax)
	addq	$72, %rsp
	retq
                                        # -- End function

playground::get_align: # @playground::get_align
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	andq	$-128, %rsp
	subq	$384, %rsp                      # imm = 0x180
	movq	$0, 128(%rsp)
	movq	$0, 136(%rsp)
	movl	$128, %esi
	movq	%rsi, %rdi
	callq	alloc::alloc::exchange_malloc
	movq	%rax, 120(%rsp)                 # 8-byte Spill
	jmp	.LBB16_2
	movq	%rax, %rcx
	movl	%edx, %eax
	movq	%rcx, 352(%rsp)
	movl	%eax, 360(%rsp)
	movq	352(%rsp), %rdi
	callq	_Unwind_Resume@PLT
	ud2

.LBB16_2:
	movq	120(%rsp), %rdi                 # 8-byte Reload
	leaq	128(%rsp), %rsi
	movl	$128, %edx
	callq	memcpy@PLT
	movq	120(%rsp), %rax                 # 8-byte Reload
	movq	%rbp, %rsp
	popq	%rbp
	retq
                                        # -- End function

As we can see, the funciton get_align() use external memcpy to copy 128 bytes from stack to heap, but get_s() just use basic instructions like movq. This difference may cause the more CPU cost when using CachePadded.

You can see more detail about the above assembly code in https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=7ddf997fbb5e23ceaf4a9ab485f01771

@wathenjiang wathenjiang changed the title mpsc performance optimization about cache line sync: mpsc performance optimization about cache line Jun 29, 2023
@Darksonn
Copy link
Contributor

Darksonn commented Jul 3, 2023

I'm not getting a memcpy in the playground when I try it.

playground::get_s: # @playground::get_s
# %bb.0:
	pushq	%rax
	movq	__rust_no_alloc_shim_is_unstable@GOTPCREL(%rip), %rax
	movzbl	(%rax), %eax
	movl	$16, %edi
	movl	$8, %esi
	callq	*__rust_alloc@GOTPCREL(%rip)
	testq	%rax, %rax
	je	.LBB0_1
# %bb.2:
	xorps	%xmm0, %xmm0
	movups	%xmm0, (%rax)
	popq	%rcx
	retq

.LBB0_1:
	movl	$8, %edi
	movl	$16, %esi
	callq	*alloc::alloc::handle_alloc_error@GOTPCREL(%rip)
	ud2
                                        # -- End function

playground::get_align2: # @playground::get_align2
# %bb.0:
	pushq	%rax
	movq	__rust_no_alloc_shim_is_unstable@GOTPCREL(%rip), %rax
	movzbl	(%rax), %eax
	movl	$128, %edi
	movl	$128, %esi
	callq	*__rust_alloc@GOTPCREL(%rip)
	testq	%rax, %rax
	je	.LBB1_1
# %bb.2:
	xorps	%xmm0, %xmm0
	movaps	%xmm0, (%rax)
	popq	%rcx
	retq

.LBB1_1:
	movl	$128, %edi
	movl	$128, %esi
	callq	*alloc::alloc::handle_alloc_error@GOTPCREL(%rip)
	ud2
                                        # -- End function

playground

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Jul 3, 2023

Sorry for that, it is a mistake, we should assemble the above code in the release mode.
So the reason why creating an aligned struct is less efficient might be the parameter of alloc::alloc::__rust_alloc is different. The performance of allocation on heap is affected by size and align.

@wathenjiang
Copy link
Contributor Author

This MR has has been at a standstill for a long time, what is the current progress?

@Darksonn
Copy link
Contributor

There are some CI failures. Please ensure that CI passes. (And you probably want to merge in master so that you are up-to-date with CI changes.)

@github-actions github-actions bot added the R-loom-sync Run loom sync tests on this PR label Jul 29, 2023
@wathenjiang
Copy link
Contributor Author

There are some CI failures. Please ensure that CI passes. (And you probably want to merge in master so that you are up-to-date with CI changes.)

The current CI error is

Run cargo hack test --each-feature
info: running `cargo test --no-default-features` on tests-integration (1/10)
   Compiling tokio v1.29.1 (/home/runner/work/tokio/tokio/tokio)
error: associated function `new` is never used
  --> tokio/src/util/cacheline.rs:77:19
   |
75 | impl<T> CachePadded<T> {
   | ---------------------- associated function in this implementation
76 |     /// Pads and aligns a value to the length of a cache line.
77 |     pub(crate) fn new(value: T) -> CachePadded<T> {
   |                   ^^^
   |
   = note: `-D dead-code` implied by `-D warnings`

error: could not compile `tokio` (lib) due to previous error
error: process didn't exit successfully: `/home/runner/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/cargo test --manifest-path Cargo.toml --no-default-features` (exit status: 101)
Error: Process completed with exit code 1.

And another error is the same.
But we do call the CachePadded::new in mpsc::channel() method.
Do you have some suggestions to make it pass?

@Darksonn
Copy link
Contributor

If the sync feature is disabled, then you're not using CachePadded. You'll need to add conditional compilation to remove it in those cases.

@wathenjiang wathenjiang changed the title sync: mpsc performance optimization about cache line sync: mpsc performance optimization avoid false sharing Jul 30, 2023
@wathenjiang
Copy link
Contributor Author

The CI passed.

@Darksonn
Copy link
Contributor

I'm ok with moving forward with this, but it is worth to consider whether this is the best way to pad the fields. Did you try any other configurations? Why did you pick this one in particular?

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Jul 30, 2023

I believe there are three questions, if I do not misunderstand:

  • (1) Why use CachePadded to avoid false sharing?
  • (2) Why just choose tokio::sync::mpsc to apply CachePadded?
  • (3) Why wrap both field tx and field rx_waker in CachePadded?

(1) Why choose CachePadded to avoid false sharing?

I believe the CachePadded is introduced first by crossbeam, whose docs is https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html.

And here are some historical discussions on it in crossbeam:

The mpsc in crossbeam is merged into std::sync::mpsc in rust-lang/rust#93563, so the current std::sync::mpsc does currently use the CachePadded in https://github.com/rust-lang/rust/blob/2e0136a131f6ed5f6071adf36db08dd8d2205d19/library/std/src/sync/mpmc/list.rs#L146-L149, and https://github.com/rust-lang/rust/blob/master/library/std/src/sync/mpmc/utils.rs

I belive the best size of CachePadded is diffcult to choose, so directly copying from CachePadded in std may be the good way to avoid false sharing.

Using perf c2c tool to detect false sharing may be a good way to test, but I do not own machine with CPU supports the perf c2c command.

Although no detailed testing of false sharing can be provided, the performance is indeed improved, and the std library of high quality and can be trusted.

(2) Why just choose tokio::sync::mpsc to apply CachePadded?

Using CachePadded to avoid false sharing may be needed in other places, but I currently would like to use it in mpsc first. This MR should be small enough.
I would be more than happy to help see if other components can be optimized in this way.

(3) Why wrap both field tx and field rx_waker in CachePadded?

Which field to choose to be wrapped in CachePadded depends on obtaining the best concurrency performance, looking for the least number of fields wrapped by CachePadded. Excessive CachePadded wrap occupies too many CPU cache lines and is bad for creating performance.

For the use case of tokio::sync::mpsc, the hot path is the following fields:

  • Chan.tx: It is concurrently accessed by multiple senders, by Sender.send()
  • Chan.rx_waker: It is concurrently accessed by the receiver and multiple senders, by Sender.send() and Receiver.recv()
  • Chan.semaphore: It is concurrently accessed by the receiver and multiple senders, by Sender.send() and Receiver.recv()
  • Chan.rx_fields: It is accessed by receiver in single thread, by Receiver.recv()

Theoretically, fileds tx, rx_waker and semaphore are wrapped in CachePadded may improve mpsc performance.
If the other fields are wrapped in CachePadded, then the rx_fields will be automatically in single cache line and avoid the false sharing.

In my test (5.4.119-1-tlinux4-0010.3 #1 SMP Thu Jan 5 17:31:23 CST 2023 x86_64 x86_64 x86_64 GNU/Linux), the test results are as follows.

Original version benchmark :

# cargo bench --bench sync_mpsc
    Finished bench [optimized] target(s) in 0.08s
     Running sync_mpsc.rs (target/release/deps/sync_mpsc-6a83a3f390639fde)

running 10 tests
test contention_bounded      ... bench:   1,604,618 ns/iter (+/- 222,711)
test contention_bounded_full ... bench:   2,067,815 ns/iter (+/- 292,991)
test contention_unbounded    ... bench:   1,517,959 ns/iter (+/- 370,697)
test create_100_000_medium   ... bench:         248 ns/iter (+/- 2)
test create_100_medium       ... bench:         248 ns/iter (+/- 1)
test create_1_medium         ... bench:         248 ns/iter (+/- 0)
test send_large              ... bench:      23,554 ns/iter (+/- 513)
test send_medium             ... bench:         846 ns/iter (+/- 11)
test uncontented_bounded     ... bench:     660,797 ns/iter (+/- 8,461)
test uncontented_unbounded   ... bench:     407,797 ns/iter (+/- 10,469)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

Only tx is wrapped in CachePadded:

# cargo bench --bench sync_mpsc
    Finished bench [optimized] target(s) in 0.07s
     Running sync_mpsc.rs (target/release/deps/sync_mpsc-638df6c3e06cd387)

running 10 tests
test contention_bounded      ... bench:   1,402,014 ns/iter (+/- 153,941)
test contention_bounded_full ... bench:   1,786,304 ns/iter (+/- 146,097)
test contention_unbounded    ... bench:   1,507,211 ns/iter (+/- 356,297)
test create_100_000_medium   ... bench:         450 ns/iter (+/- 4)
test create_100_medium       ... bench:         451 ns/iter (+/- 5)
test create_1_medium         ... bench:         451 ns/iter (+/- 4)
test send_large              ... bench:      23,927 ns/iter (+/- 281)
test send_medium             ... bench:         949 ns/iter (+/- 12)
test uncontented_bounded     ... bench:     659,927 ns/iter (+/- 3,506)
test uncontented_unbounded   ... bench:     403,732 ns/iter (+/- 28,563)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

tx and rx_waker are wrapped in CachePadded:

# cargo bench --bench sync_mpsc
    Finished bench [optimized] target(s) in 0.07s
     Running sync_mpsc.rs (target/release/deps/sync_mpsc-638df6c3e06cd387)

running 10 tests
test contention_bounded      ... bench:   1,204,969 ns/iter (+/- 99,970)
test contention_bounded_full ... bench:   1,379,226 ns/iter (+/- 153,794)
test contention_unbounded    ... bench:   1,347,599 ns/iter (+/- 262,240)
test create_100_000_medium   ... bench:         435 ns/iter (+/- 3)
test create_100_medium       ... bench:         435 ns/iter (+/- 1)
test create_1_medium         ... bench:         436 ns/iter (+/- 9)
test send_large              ... bench:      23,593 ns/iter (+/- 311)
test send_medium             ... bench:         937 ns/iter (+/- 15)
test uncontented_bounded     ... bench:     651,098 ns/iter (+/- 4,037)
test uncontented_unbounded   ... bench:     404,862 ns/iter (+/- 3,986)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

tx, rx_waker and semaphore are wrapped in CachePadded:

# cargo bench --bench sync_mpsc
    Finished bench [optimized] target(s) in 0.07s
     Running sync_mpsc.rs (target/release/deps/sync_mpsc-638df6c3e06cd387)

running 10 tests
test contention_bounded      ... bench:   1,199,603 ns/iter (+/- 107,472)
test contention_bounded_full ... bench:   1,291,294 ns/iter (+/- 220,522)
test contention_unbounded    ... bench:   1,318,049 ns/iter (+/- 207,409)
test create_100_000_medium   ... bench:         456 ns/iter (+/- 1)
test create_100_medium       ... bench:         456 ns/iter (+/- 2)
test create_1_medium         ... bench:         456 ns/iter (+/- 3)
test send_large              ... bench:      23,390 ns/iter (+/- 347)
test send_medium             ... bench:         935 ns/iter (+/- 17)
test uncontented_bounded     ... bench:     660,564 ns/iter (+/- 5,663)
test uncontented_unbounded   ... bench:     403,446 ns/iter (+/- 5,539)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

We only focus on the results of test contention_bounded, test contention_bounded_full and test contention_unbounded, we can find out that:

The performance of original version test,tx test, tx + rx_waker test, and tx + rx_waker + semaphore test are sequentially improved compared with the former:

  • contention_bounded: 12.6%, 14.1% and 0.4%
  • contention_bounded_full: 13.6%, 22.8% and 6.4%
  • contention_unbounded: 0.7%, 10.6% and 2.2%

Making semaphore wrapped in CachePadded improves performance sightly, and reduces performance of create slightly. It made me hesitant to wrapsemaphore in CachePadded. This can be discussed further.

@wathenjiang
Copy link
Contributor Author

The other sync tools in tokio::sync may could benefit from applying CachePadded. For example, the tokio::sync::broadcast.

I initially wrote a benchmark test:https://github.com/wathenjiang/tokio-2.8.2-coderead/blob/bfe26af3b56fe8eac81b6087829eb4c17c88fb49/benches/sync_broadcast.rs#L1-L46

And apply CachePadded in https://github.com/wathenjiang/tokio-2.8.2-coderead/blob/bfe26af3b56fe8eac81b6087829eb4c17c88fb49/tokio/src/sync/broadcast.rs#L306-L318

The following is a performance comparison test before and after applying CachePadded

Before

# cargo bench --bench sync_broadcast
    Finished bench [optimized] target(s) in 0.05s
     Running sync_broadcast.rs (/root/rustproject/tokio-2.8.2-coderead/target/release/deps/sync_broadcast-a42a48a6aede9bb7)

running 1 test
test contention_unbounded ... bench:  46,916,459 ns/iter (+/- 5,925,933)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

after

# cargo bench --bench sync_broadcast
    Finished bench [optimized] target(s) in 0.05s
     Running sync_broadcast.rs (/root/rustproject/tokio-2.8.2-coderead/target/release/deps/sync_broadcast-a42a48a6aede9bb7)

running 1 test
test contention_unbounded ... bench:  41,510,877 ns/iter (+/- 7,768,113)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

An increase of approximately 11%.

@Noah-Kennedy
Copy link
Contributor

It may be worth testing on a few different systems here. I'd like to see:

  • AMD Zen
  • Intel
  • Something aarch64

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making semaphore wrapped in CachePadded improves performance sightly, and reduces performance of create slightly. It made me hesitant to wrap semaphore in CachePadded. This can be discussed further.

Okay, it sounds like the fields you used in the PR are close to optimal. That's good enough for me. If wrapping semaphore doesn't give much additional advantage, then lets keep it non-wrapped.

The other sync tools in tokio::sync may could benefit from applying CachePadded. For example, the tokio::sync::broadcast.

I don't think it makes sense to wrap buffer in CachePadded. The field is an immutable pointer. The things it points at are modified, but that's in a different cacheline. Let's not do broadcast for now.

It may be worth testing on a few different systems here.

Sure, that would be interesting. But I don't want to block this PR over that.

Let me know if you want to do any more testing, or if you just want me to merge this.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Aug 3, 2023

@Noah-Kennedy @Darksonn @carllerche

The above tests are on Intel(Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz) Linux and Intel Mac(Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz).

The following test is on M1 mac, whose CPU architecture is arrch64. The test result tells the same result.

Original version:

# cargo bench --bench sync_mpsc

running 10 tests
test contention_bounded      ... bench:     896,906 ns/iter (+/- 50,993)
test contention_bounded_full ... bench:     720,526 ns/iter (+/- 37,487)
test contention_unbounded    ... bench:     687,416 ns/iter (+/- 229,797)
test create_100_000_medium   ... bench:          87 ns/iter (+/- 1)
test create_100_medium       ... bench:          87 ns/iter (+/- 1)
test create_1_medium         ... bench:          87 ns/iter (+/- 1)
test send_large              ... bench:      13,969 ns/iter (+/- 462)
test send_medium             ... bench:         389 ns/iter (+/- 6)
test uncontented_bounded     ... bench:     219,662 ns/iter (+/- 2,003)
test uncontented_unbounded   ... bench:     162,508 ns/iter (+/- 3,385)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

tx and rx_waker are wrapped in CachePadded:

# cargo bench --bench sync_mpsc

running 10 tests
test contention_bounded      ... bench:     710,716 ns/iter (+/- 63,422)
test contention_bounded_full ... bench:     666,249 ns/iter (+/- 84,471)
test contention_unbounded    ... bench:     490,628 ns/iter (+/- 77,227)
test create_100_000_medium   ... bench:         144 ns/iter (+/- 36)
test create_100_medium       ... bench:         144 ns/iter (+/- 4)
test create_1_medium         ... bench:         144 ns/iter (+/- 4)
test send_large              ... bench:      14,899 ns/iter (+/- 735)
test send_medium             ... bench:         481 ns/iter (+/- 35)
test uncontented_bounded     ... bench:     228,849 ns/iter (+/- 2,135)
test uncontented_unbounded   ... bench:     165,967 ns/iter (+/- 937)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

The performance of tx + rx_waker version test is improved compared with the original version:

  • contention_bounded: 20.8%
  • contention_bounded_full: 7.5%
  • contention_unbounded: 28.6%

The following test is on Windows(AMD Ryzen 7 5800X3D 8-Core Processor 3.40 GHz), whose CPU architecture is AMD zen. The test result tells the same result.

Original version:

# cargo bench --bench sync_mpsc

running 10 tests
test contention_bounded      ... bench:     507,807 ns/iter (+/- 21,739)
test contention_bounded_full ... bench:     569,656 ns/iter (+/- 28,889)
test contention_unbounded    ... bench:     679,136 ns/iter (+/- 69,558)
test create_100_000_medium   ... bench:         179 ns/iter (+/- 15)
test create_100_medium       ... bench:         178 ns/iter (+/- 46)
test create_1_medium         ... bench:         181 ns/iter (+/- 15)
test send_large              ... bench:      35,337 ns/iter (+/- 4,821)
test send_medium             ... bench:         460 ns/iter (+/- 27)
test uncontented_bounded     ... bench:     265,355 ns/iter (+/- 28,723)
test uncontented_unbounded   ... bench:     150,795 ns/iter (+/- 18,788)

tx and rx_waker are wrapped in CachePadded:

test contention_bounded      ... bench:     443,181 ns/iter (+/- 26,152)
test contention_bounded_full ... bench:     446,956 ns/iter (+/- 25,801)
test contention_unbounded    ... bench:     623,033 ns/iter (+/- 8,451)
test create_100_000_medium   ... bench:         183 ns/iter (+/- 4)
test create_100_medium       ... bench:         183 ns/iter (+/- 16)
test create_1_medium         ... bench:         182 ns/iter (+/- 3)
test send_large              ... bench:      36,913 ns/iter (+/- 4,159)
test send_medium             ... bench:         486 ns/iter (+/- 66)
test uncontented_bounded     ... bench:     259,559 ns/iter (+/- 28,149)
test uncontented_unbounded   ... bench:     151,807 ns/iter (+/- 19,473)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured

The performance of tx + rx_waker version test is improved compared with the original version:

  • contention_bounded: 12.7%
  • contention_bounded_full: 21.5%
  • contention_unbounded: 8.3%

I believe the benchmark tests should be enough for now, so please merge this.

@Darksonn Darksonn merged commit 38d1bcd into tokio-rs:master Aug 3, 2023
@Darksonn
Copy link
Contributor

Darksonn commented Aug 3, 2023

Thank you for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-sync Module: tokio/sync R-loom Run loom tests on this PR R-loom-sync Run loom sync tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants