-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(timer): check subset of self.items #1780
Conversation
`Timer::take_until` iterates each `VecDeque` in the `Vec` `self.items`. It then checks the first item in those `VecDeque`s. Under the assumption that all items in `self.items[t]` are smaller than all items in `self.items[t+1]`, only a subset of `self.items` needs to be iterated. Namely only from `self.cursor` to `self.delta(until)`. This commit changes `take_until` to only check this subset. --- Why is `Timer::take_until`'s performance relevant? Whenever `Server::process_next_output` has no more other work to do, it checks for expired timers. https://github.com/mozilla/neqo/blob/3151adc53e71273eed1319114380119c70e169a2/neqo-transport/src/server.rs#L650 A `Server` has at most one item per connection in `Timer`. Thus, a `Server` with a single connection has a single item in `Timer` total. The `Timer` timer wheel has 16_384 slots. https://github.com/mozilla/neqo/blob/3151adc53e71273eed1319114380119c70e169a2/neqo-transport/src/server.rs#L55 Thus whenever `Server::process_next_output` has no more other work to do, it iterates a `Vec` of length `16_384`, only to find at most one timer, which might or might not be expired. This shows up in CPU profiles with up to 33%. See e.g. https://github.com/mozilla/neqo/actions/runs/8452074231/artifacts/1363138571. Note that the profiles do not always show `take_until` as it is oftentimes inlined. Add `#[inline(never)]` to make sure it isn't. ``` diff modified neqo-common/src/timer.rs @@ -193,6 +193,7 @@ impl<T> Timer<T> { /// Take the next item, unless there are no items with /// a timeout in the past relative to `until`. + #[inline(never)] pub fn take_next(&mut self, until: Instant) -> Option<T> { ``` Arguably a 16_384 slot timer wheel is overkill for a single timer. Maybe, to cover the use-case of a `Server` with a small amount of connections, a hierarchical timer wheel is helpful?
neqo-common/src/timer.rs
Outdated
let res = maybe_take(&mut self.items[i], until); | ||
for i in self.cursor..(self.cursor + self.delta(until)) { | ||
let i = i % self.items.len(); | ||
let res = maybe_take(&mut self.items[i], until)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this, then the function can be inlined.
However, this will regress one aspect of performance to save another. The reason for two loops is to avoid the modulus operation inside the loop. I think that what you want is more complex yet.
You want the first loop to be self.bucket(0)..min(self.items.len(), self.bucket(0) + self.delta(until))
and the second to be 0..(self.bucket(0) + self.delta(until) - self.items.len()).clamp(0, self.bucket(0))
... I think. With appropriate refactoring so that calls to bucket()
and delta()
only run once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I should have said: this is an obvious win, so it's worth doing. I just want to make sure that we get the whole win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @martinthomson for hinting at %
complexity! Updated version does not use %
at all.
I used https://github.com/wahern/timeout in quant, but I'm not sure there is a Rust crate. Someone must have surely implemented a timing wheel we can use? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1780 +/- ##
=======================================
Coverage 93.05% 93.06%
=======================================
Files 117 117
Lines 36368 36370 +2
=======================================
+ Hits 33843 33847 +4
+ Misses 2525 2523 -2 ☔ View full report in Codecov by Sentry. |
Benchmark resultsPerformance differences relative to 3151adc.
Client/server transfer resultsTransfer of 134217728 bytes over loopback.
|
I updated the pull request to do no
I assume this is due to the additional division in Lines 83 to 89 in 3151adc
Though note that the drain a timer quickly benchmark only checks existing timers, i.e. never makes use of the optimization of this pull request and that it is tiny ( I added an additional benchmark drain an empty timer which shows a significant improvement compared to
|
All that said, once we optimized Lines 56 to 58 in 3151adc
See e.g. this flamegraph off of |
Do I understand correctly that the main goal of
There are timer wheel crates. Though none that stand out to me with large adoption. In addition, if feasible, I would much rather like to do without the additional dependency, and instead just wrap |
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See mozilla#1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done.
* perf(transport): remove Server::timers The current `neqo_transport::server::Server::timers` has a large performance overhead, especially when serving small amount of connections. See #1780 for details. This commit optimizes for the small-number-of-connections case, keeping a single callback timestamp only, iterating each connection when there is no other work to be done. * Cleanups * Rename to wake_at * Introduce ServerConnectionState::{set_wake_at,needs_waking,woken}
Closing here since Thank you for the help! |
Timer::take_next
iterates eachVecDeque
in theVec
self.items
. It then checks the first item in thoseVecDeque
s.Under the assumption that all items in
self.items[t]
are smaller than all items inself.items[t+1]
(ignoring wrap around), only a subset ofself.items
needs to be iterated, namely only fromself.cursor
toself.delta(until)
.Is this assumption correct?
This commit changes
take_nextl
to only check this subset.Why is
Timer::take_next
's performance relevant?Whenever
Server::process_next_output
has no more other work to do, it checks for expired timers.neqo/neqo-transport/src/server.rs
Line 650 in 3151adc
A
Server
has at most one item per connection inTimer
. Thus, aServer
with a single connection has a single item inTimer
total.The
Timer
timer wheel has 16_384 slots.neqo/neqo-transport/src/server.rs
Line 55 in 3151adc
Thus whenever
Server::process_next_output
has no more other work to do, it iterates aVec
of length16_384
, only to find at most one timer, which might or might not be expired.This shows up in CPU profiles with up to 33%. See e.g. https://github.com/mozilla/neqo/actions/runs/8452074231/artifacts/1363138571. On my local machine, a call to
Timer::take_next
takes between 5-10 micro seconds.Note that the profiles do not always show
take_next
as it is oftentimes inlined. Add#[inline(never)]
to make sure it isn't.With this patch
Timer::take_next
takes significantly less CPU time on my machine (~2.8%).Arguably a 16_384 slot timer wheel is overkill for a single timer. Maybe, to cover the use-case of a
Server
with a small amount of connections, a hierarchical timer wheel is helpful?