Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(timer): check subset of self.items #1780

Closed
wants to merge 11 commits into from
Closed

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented Mar 28, 2024

Timer::take_next iterates each VecDeque in the Vec self.items. It then checks the first item in those VecDeques.

Under the assumption that all items in self.items[t] are smaller than all items in self.items[t+1] (ignoring wrap around), only a subset of self.items needs to be iterated, namely only from self.cursor to self.delta(until).

Is this assumption correct?

This commit changes take_nextl to only check this subset.


Why is Timer::take_next's performance relevant?

Whenever Server::process_next_output has no more other work to do, it checks for expired timers.

while let Some(c) = self.timers.take_next(now) {

A Server has at most one item per connection in Timer. Thus, a Server with a single connection has a single item in Timer total.

The Timer timer wheel has 16_384 slots.

const TIMER_CAPACITY: usize = 16384;

Thus whenever Server::process_next_output has no more other work to do, it iterates a Vec of length 16_384, only to find at most one timer, which might or might not be expired.

This shows up in CPU profiles with up to 33%. See e.g. https://github.com/mozilla/neqo/actions/runs/8452074231/artifacts/1363138571. On my local machine, a call to Timer::take_next takes between 5-10 micro seconds.

Note that the profiles do not always show take_next as it is oftentimes inlined. Add #[inline(never)] to make sure it isn't.

modified   neqo-common/src/timer.rs
@@ -193,6 +193,7 @@ impl<T> Timer<T> {

     /// Take the next item, unless there are no items with
     /// a timeout in the past relative to `until`.
+    #[inline(never)]
     pub fn take_next(&mut self, until: Instant) -> Option<T> {

With this patch Timer::take_next takes significantly less CPU time on my machine (~2.8%).

Arguably a 16_384 slot timer wheel is overkill for a single timer. Maybe, to cover the use-case of a Server with a small amount of connections, a hierarchical timer wheel is helpful?

@mxinden mxinden changed the title perf(common/timer): check subset of self.items perf(timer): check subset of self.items Mar 28, 2024
`Timer::take_until` iterates each `VecDeque` in the `Vec` `self.items`. It then
checks the first item in those `VecDeque`s.

Under the assumption that all items in `self.items[t]` are smaller than all
items in `self.items[t+1]`, only a subset of `self.items` needs to be iterated.
Namely only from `self.cursor` to `self.delta(until)`.

This commit changes `take_until` to only check this subset.

---

Why is `Timer::take_until`'s performance relevant?

Whenever `Server::process_next_output` has no more other work to do, it checks
for expired timers.

https://github.com/mozilla/neqo/blob/3151adc53e71273eed1319114380119c70e169a2/neqo-transport/src/server.rs#L650

A `Server` has at most one item per connection in `Timer`. Thus, a `Server` with
a single connection has a single item in `Timer` total.

The `Timer` timer wheel has 16_384 slots.

https://github.com/mozilla/neqo/blob/3151adc53e71273eed1319114380119c70e169a2/neqo-transport/src/server.rs#L55

Thus whenever `Server::process_next_output` has no more other work to do, it
iterates a `Vec` of length `16_384`, only to find at most one timer, which might
or might not be expired.

This shows up in CPU profiles with up to 33%. See e.g. https://github.com/mozilla/neqo/actions/runs/8452074231/artifacts/1363138571.

Note that the profiles do not always show `take_until` as it is oftentimes
inlined. Add `#[inline(never)]` to make sure it isn't.

``` diff
modified   neqo-common/src/timer.rs
@@ -193,6 +193,7 @@ impl<T> Timer<T> {

     /// Take the next item, unless there are no items with
     /// a timeout in the past relative to `until`.
+    #[inline(never)]
     pub fn take_next(&mut self, until: Instant) -> Option<T> {
```

Arguably a 16_384 slot timer wheel is overkill for a single timer. Maybe, to
cover the use-case of a `Server` with a small amount of connections, a
hierarchical timer wheel is helpful?
let res = maybe_take(&mut self.items[i], until);
for i in self.cursor..(self.cursor + self.delta(until)) {
let i = i % self.items.len();
let res = maybe_take(&mut self.items[i], until)?;
Copy link
Member

@martinthomson martinthomson Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, then the function can be inlined.

However, this will regress one aspect of performance to save another. The reason for two loops is to avoid the modulus operation inside the loop. I think that what you want is more complex yet.

You want the first loop to be self.bucket(0)..min(self.items.len(), self.bucket(0) + self.delta(until)) and the second to be 0..(self.bucket(0) + self.delta(until) - self.items.len()).clamp(0, self.bucket(0)) ... I think. With appropriate refactoring so that calls to bucket() and delta() only run once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I should have said: this is an obvious win, so it's worth doing. I just want to make sure that we get the whole win.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @martinthomson for hinting at % complexity! Updated version does not use % at all.

@larseggert
Copy link
Collaborator

I used https://github.com/wahern/timeout in quant, but I'm not sure there is a Rust crate. Someone must have surely implemented a timing wheel we can use?

Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 93.06%. Comparing base (3151adc) to head (20f6880).

Files Patch % Lines
neqo-common/src/timer.rs 90.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1780   +/-   ##
=======================================
  Coverage   93.05%   93.06%           
=======================================
  Files         117      117           
  Lines       36368    36370    +2     
=======================================
+ Hits        33843    33847    +4     
+ Misses       2525     2523    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Mar 29, 2024

Benchmark results

Performance differences relative to 3151adc.

  • drain a timer quickly time: [696.22 ns 702.27 ns 707.77 ns]
    change: [+79.201% +81.607% +83.963%] (p = 0.00 < 0.05)
    💔 Performance has regressed.

  • drain an empty timer time: [263.51 ns 267.90 ns 271.44 ns]

  • coalesce_acked_from_zero 1+1 entries
    time: [195.71 ns 196.12 ns 196.57 ns]
    change: [-1.7953% -1.3489% -0.8314%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • coalesce_acked_from_zero 3+1 entries
    time: [235.68 ns 236.21 ns 236.79 ns]
    change: [-0.7428% -0.2657% +0.2582%] (p = 0.31 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 10+1 entries
    time: [235.11 ns 235.93 ns 236.89 ns]
    change: [-2.1751% -0.7135% +0.2815%] (p = 0.33 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries
    time: [216.54 ns 216.68 ns 216.84 ns]
    change: [-0.8359% -0.2196% +0.4295%] (p = 0.51 > 0.05)
    No change in performance detected.

  • RxStreamOrderer::inbound_frame()
    time: [118.76 ms 118.84 ms 118.94 ms]
    change: [-0.6228% -0.5300% -0.4384%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds
    time: [116.35 ms 116.62 ms 116.89 ms]
    thrpt: [34.222 MiB/s 34.300 MiB/s 34.379 MiB/s]
    change:
    time: [-2.3005% -1.9871% -1.6738%] (p = 0.00 < 0.05)
    thrpt: [+1.7023% +2.0273% +2.3547%]
    Change within noise threshold.

  • transfer/Run multiple transfers with the same seed
    time: [117.41 ms 117.57 ms 117.73 ms]
    thrpt: [33.977 MiB/s 34.022 MiB/s 34.068 MiB/s]
    change:
    time: [-2.0995% -1.9254% -1.7361%] (p = 0.00 < 0.05)
    thrpt: [+1.7667% +1.9632% +2.1446%]
    Change within noise threshold.

  • 1-conn/1-100mb-resp (aka. Download)/client
    time: [1.0895 s 1.1214 s 1.1571 s]
    thrpt: [86.425 MiB/s 89.171 MiB/s 91.784 MiB/s]
    change:
    time: [+0.5171% +3.6938% +7.1410%] (p = 0.06 > 0.05)
    thrpt: [-6.6651% -3.5622% -0.5144%]
    No change in performance detected.

  • 1-conn/10_000-1b-seq-resp (aka. RPS)/client
    time: [422.32 ms 424.27 ms 426.22 ms]
    thrpt: [23.462 Kelem/s 23.570 Kelem/s 23.679 Kelem/s]
    change:
    time: [-1.3911% -0.7764% -0.1021%] (p = 0.02 < 0.05)
    thrpt: [+0.1022% +0.7825% +1.4108%]
    Change within noise threshold.

  • 100-seq-conn/1-1b-resp (aka. HPS)/client
    time: [3.3736 s 3.3770 s 3.3805 s]
    thrpt: [29.582 elem/s 29.612 elem/s 29.642 elem/s]
    change:
    time: [-0.9045% -0.7738% -0.6497%] (p = 0.00 < 0.05)
    thrpt: [+0.6540% +0.7798% +0.9128%]
    Change within noise threshold.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 437.6 ± 51.0 366.1 545.3 1.00
neqo msquic reno on 1932.9 ± 53.0 1879.0 2069.8 1.00
neqo msquic reno 1900.3 ± 21.7 1871.9 1937.8 1.00
neqo msquic cubic on 1899.8 ± 46.7 1778.6 1936.8 1.00
neqo msquic cubic 1883.9 ± 27.9 1819.6 1908.4 1.00
msquic neqo reno on 4091.0 ± 401.3 3525.4 4738.7 1.00
msquic neqo reno 4058.4 ± 416.3 3440.2 4754.3 1.00
msquic neqo cubic on 4106.9 ± 409.8 3589.1 4730.5 1.00
msquic neqo cubic 4095.2 ± 421.3 3464.3 4787.9 1.00
neqo neqo reno on 3281.7 ± 288.2 2860.8 3862.9 1.00
neqo neqo reno 3358.6 ± 365.2 2849.5 3854.1 1.00
neqo neqo cubic on 3892.3 ± 425.6 3169.0 4494.6 1.00
neqo neqo cubic 3869.4 ± 442.1 3319.5 4486.8 1.00

⬇️ Download logs

@mxinden
Copy link
Collaborator Author

mxinden commented Apr 1, 2024

I updated the pull request to do no % at all. Still, it regresses on the drain timer quickly benchmark.

  • drain a timer quickly time: [696.22 ns 702.27 ns 707.77 ns]
    change: [+79.201% +81.607% +83.963%] (p = 0.00 < 0.05)
    💔 Performance has regressed.

I assume this is due to the additional division in Timer::delta now needed to find the subset of self.items to iterate.

/// For the given `time`, get the number of whole buckets in the future that is.
#[inline]
#[allow(clippy::cast_possible_truncation)] // guarded by assertion
fn delta(&self, time: Instant) -> usize {
// This really should use Duration::div_duration_f??(), but it can't yet.
((time - self.now).as_nanos() / self.granularity.as_nanos()) as usize
}

Though note that the drain a timer quickly benchmark only checks existing timers, i.e. never makes use of the optimization of this pull request and that it is tiny (100) compared to the Timer instance used in server.rs (16384). Thus the additional division has an artificially large impact in the benchmark.

I added an additional benchmark drain an empty timer which shows a significant improvement compared to main:

$ neqo git:(main) ✗ critcmp main timer                                                                                               
group                                          main                                     timer
-----                                          ----                                     -----
drain an empty timer                           1.57    40.6±11.14µs        ? ?/sec      1.00     25.8±9.07µs        ? ?/sec

@mxinden
Copy link
Collaborator Author

mxinden commented Apr 1, 2024

All that said, once we optimized Timer::take_next, we still spend a significant portion of CPU time on Timer::next_time.

/// Return a reference to the time of the next entry.
#[must_use]
pub fn next_time(&self) -> Option<Instant> {

See e.g. this flamegraph off of main with 4.77% spend in next_time and 18.86% in take_next:
flamegraph

@mxinden
Copy link
Collaborator Author

mxinden commented Apr 1, 2024

Timer is only used in neqo_transport::server. And if I understand correctly, we only ever insert a single time per connection.

Do I understand correctly that the main goal of neqo-server is to be able to test neqo-client and thus neqo as it is used in Firefox? If so, isn't the 16384 slot timer wheel overkill and a simple std::collections::BTreeMap (or the likes) suffices? Thoughts @martinthomson and @larseggert?


I used https://github.com/wahern/timeout in quant, but I'm not sure there is a Rust crate. Someone must have surely implemented a timing wheel we can use?

There are timer wheel crates. Though none that stand out to me with large adoption. In addition, if feasible, I would much rather like to do without the additional dependency, and instead just wrap std::collection::BTreeMap.

mxinden added a commit to mxinden/neqo that referenced this pull request Apr 1, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 1, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 1, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 1, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 1, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 2, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 2, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
mxinden added a commit to mxinden/neqo that referenced this pull request Apr 2, 2024
The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
mozilla#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.
github-merge-queue bot pushed a commit that referenced this pull request Apr 4, 2024
* perf(transport): remove Server::timers

The current `neqo_transport::server::Server::timers` has a large performance
overhead, especially when serving small amount of connections. See
#1780 for details.

This commit optimizes for the small-number-of-connections case, keeping a single
callback timestamp only, iterating each connection when there is no other work
to be done.

* Cleanups

* Rename to wake_at

* Introduce ServerConnectionState::{set_wake_at,needs_waking,woken}
@mxinden
Copy link
Collaborator Author

mxinden commented Apr 4, 2024

Closing here since neqo_common::timer was removed with #1784.

Thank you for the help!

@mxinden mxinden closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants