-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoder improvements #2259
Decoder improvements #2259
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2259 +/- ##
=======================================
Coverage 95.39% 95.39%
=======================================
Files 113 113
Lines 36683 36695 +12
=======================================
+ Hits 34994 35006 +12
Misses 1689 1689 ☔ View full report in Codecov by Sentry. |
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to a758177. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to a758177. decode 4096 bytes, mask ff: 💚 Performance has improved.time: [11.856 µs 11.893 µs 11.937 µs] change: [-25.741% -25.326% -24.930%] (p = 0.00 < 0.05) decode 1048576 bytes, mask ff: 💚 Performance has improved.time: [3.0993 ms 3.1093 ms 3.1208 ms] change: [-23.870% -23.518% -23.160%] (p = 0.00 < 0.05) decode 4096 bytes, mask 7f: 💚 Performance has improved.time: [19.751 µs 19.789 µs 19.835 µs] change: [-26.973% -26.675% -26.421%] (p = 0.00 < 0.05) decode 1048576 bytes, mask 7f: 💚 Performance has improved.time: [5.1705 ms 5.1839 ms 5.1990 ms] change: [-24.796% -24.581% -24.342%] (p = 0.00 < 0.05) decode 4096 bytes, mask 3f: 💚 Performance has improved.time: [6.8903 µs 6.9097 µs 6.9356 µs] change: [-44.975% -44.693% -44.422%] (p = 0.00 < 0.05) decode 1048576 bytes, mask 3f: 💚 Performance has improved.time: [1.7607 ms 1.7663 ms 1.7732 ms] change: [-44.903% -44.595% -44.293%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1+1 entries: Change within noise threshold.time: [105.03 ns 105.62 ns 106.43 ns] change: [+0.7092% +1.3228% +1.9298%] (p = 0.00 < 0.05) coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.time: [120.55 ns 120.94 ns 121.36 ns] change: [-24.446% -21.663% -20.026%] (p = 0.00 < 0.05) coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.time: [120.28 ns 120.76 ns 121.34 ns] change: [-20.945% -20.599% -20.252%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.time: [100.42 ns 100.57 ns 100.75 ns] change: [-12.705% -11.977% -11.286%] (p = 0.00 < 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [110.93 ms 111.00 ms 111.07 ms] change: [-0.4590% -0.3790% -0.2956%] (p = 0.00 < 0.05) SentPackets::take_ranges: No change in performance detected.time: [5.4496 µs 5.5427 µs 5.6391 µs] change: [-1.8777% +0.3295% +2.6045%] (p = 0.78 > 0.05) transfer/pacing-false/varying-seeds: Change within noise threshold.time: [25.020 ms 26.041 ms 27.042 ms] change: [-11.446% -6.6122% -1.5381%] (p = 0.01 < 0.05) transfer/pacing-true/varying-seeds: No change in performance detected.time: [33.858 ms 35.527 ms 37.214 ms] change: [-6.1967% +0.1607% +7.4906%] (p = 0.96 > 0.05) transfer/pacing-false/same-seed: 💚 Performance has improved.time: [24.957 ms 25.782 ms 26.604 ms] change: [-11.689% -7.9744% -3.8072%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: No change in performance detected.time: [41.465 ms 43.470 ms 45.482 ms] change: [-5.6490% +0.5411% +7.1607%] (p = 0.87 > 0.05) 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: No change in performance detected.time: [889.38 ms 898.27 ms 907.33 ms] thrpt: [110.21 MiB/s 111.32 MiB/s 112.44 MiB/s] change: time: [-0.6331% +0.8132% +2.2670%] (p = 0.26 > 0.05) thrpt: [-2.2168% -0.8067% +0.6371%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: No change in performance detected.time: [319.12 ms 322.31 ms 325.51 ms] thrpt: [30.721 Kelem/s 31.026 Kelem/s 31.336 Kelem/s] change: time: [-1.2254% +0.1856% +1.6470%] (p = 0.81 > 0.05) thrpt: [-1.6203% -0.1852% +1.2406%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.time: [33.735 ms 33.941 ms 34.165 ms] thrpt: [29.270 elem/s 29.463 elem/s 29.642 elem/s] change: time: [-0.5776% +0.2964% +1.1869%] (p = 0.53 > 0.05) thrpt: [-1.1730% -0.2955% +0.5809%] 1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: No change in performance detected.time: [1.6844 s 1.7024 s 1.7205 s] thrpt: [58.121 MiB/s 58.741 MiB/s 59.367 MiB/s] change: time: [-2.8664% -1.4719% -0.0296%] (p = 0.05 > 0.05) thrpt: [+0.0296% +1.4939% +2.9510%] Client/server transfer resultsTransfer of 33554432 bytes over loopback.
|
I had been meaning to take a look at what s2n-quic is doing; I somehow remember them having a pretty optimized implementation for this: https://github.com/aws/s2n-quic/blob/main/quic/s2n-quic-core/src/varint/mod.rs (also Apache-licensed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me!
I took at look at the s2n implementation, which peeks a byte, then reads a full 1, 2, 4, or 8 bytes. For some reason, the following code is much slower (this PR is 19% faster for an 0xff mask, 27% faster for 0x7f, and 65% faster for 0x3f). pub fn decode_varint(&mut self) -> Option<u64> {
- let b1 = self.decode_n(1)?;
+ let b1 = self.peek_byte()?;
match b1 >> 6 {
- 0 => Some(b1),
- 1 => Some((b1 & 0x3f) << 8 | self.decode_n(1)?),
- 2 => Some((b1 & 0x3f) << 24 | self.decode_n(3)?),
- 3 => Some((b1 & 0x3f) << 56 | self.decode_n(7)?),
+ 0 => {
+ self.offset += 1;
+ Some(u64::from(b1))
+ }
+ 1 => Some(u64::from(self.decode_uint::<u16>()? & 0x3fff)),
+ 2 => Some(u64::from(self.decode_uint::<u32>()? & 0x3fff_ffff)),
+ 3 => Some(self.decode_uint::<u64>()? & 0x3fff_ffff_ffff_ffff),
_ => unreachable!(),
}
} |
Also wonder if https://docs.rs/zerocopy/latest/zerocopy/ would further help here, but probably not, given that we only ever read a single value. |
Signed-off-by: Lars Eggert <lars@eggert.org>
Co-authored-by: Lars Eggert <lars@eggert.org> Signed-off-by: Martin Thomson <mt@lowentropy.net>
This started out as an attempt to make the decoder API slightly easier to use.
The
decode_uint()
function now takes a type as an argument, so if you want 1 byte, you saylet x: u8 = dec.decode_uint()
rather thanlet x = dec.decode_unit(1)
. This is nice because it eliminates code like this:let version = WireVersion::try_from(dec.decode_uint(4)?)?
(or with.unwrap()
calls). Now those are often justlet version = dec.decode_uint()
, or at leastlet version: WireVersion = dec.decode_uint()
.To that end, I've eliminated
decode_byte()
as well.The big news is that the benchmark shows a massive improvement for varint decoding performance. I was surprised at how big of an improvement it was.
Performance improvements