Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoder improvements #2259

Merged
merged 14 commits into from
Dec 8, 2024
Merged

Conversation

martinthomson
Copy link
Member

This started out as an attempt to make the decoder API slightly easier to use.

The decode_uint() function now takes a type as an argument, so if you want 1 byte, you say let x: u8 = dec.decode_uint() rather than let x = dec.decode_unit(1). This is nice because it eliminates code like this: let version = WireVersion::try_from(dec.decode_uint(4)?)? (or with .unwrap() calls). Now those are often just let version = dec.decode_uint(), or at least let version: WireVersion = dec.decode_uint().

To that end, I've eliminated decode_byte() as well.

The big news is that the benchmark shows a massive improvement for varint decoding performance. I was surprised at how big of an improvement it was.

Performance improvements
decode 1024 bytes, mask ff
                        time:   [1.3609 µs 1.3636 µs 1.3666 µs]
                        change: [-23.308% -23.055% -22.813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

decode 32768 bytes, mask ff
                        time:   [43.383 µs 43.487 µs 43.609 µs]
                        change: [-23.465% -23.173% -22.899%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

decode 1048576 bytes, mask ff
                        time:   [1.3416 ms 1.3448 ms 1.3483 ms]
                        change: [-24.853% -24.593% -24.334%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

decode 1024 bytes, mask 7f
                        time:   [2.4318 µs 2.4380 µs 2.4451 µs]
                        change: [-26.561% -26.237% -25.855%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

decode 32768 bytes, mask 7f
                        time:   [77.761 µs 77.907 µs 78.068 µs]
                        change: [-26.971% -26.604% -26.251%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  4 (4.00%) high mild

decode 1048576 bytes, mask 7f
                        time:   [2.3282 ms 2.3358 ms 2.3438 ms]
                        change: [-28.019% -27.685% -27.356%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

decode 1024 bytes, mask 3f
                        time:   [574.52 ns 578.26 ns 582.69 ns]
                        change: [-66.618% -66.344% -66.053%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

decode 32768 bytes, mask 3f
                        time:   [18.192 µs 18.257 µs 18.329 µs]
                        change: [-66.335% -66.097% -65.844%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

decode 1048576 bytes, mask 3f
                        time:   [579.71 µs 581.40 µs 583.33 µs]
                        change: [-60.449% -60.182% -59.911%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 98.61111% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.39%. Comparing base (a758177) to head (9c419b9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
neqo-transport/src/connection/mod.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2259   +/-   ##
=======================================
  Coverage   95.39%   95.39%           
=======================================
  Files         113      113           
  Lines       36683    36695   +12     
=======================================
+ Hits        34994    35006   +12     
  Misses       1689     1689           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 29, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to a758177.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented Nov 29, 2024

Benchmark results

Performance differences relative to a758177.

decode 4096 bytes, mask ff: 💚 Performance has improved.
       time:   [11.856 µs 11.893 µs 11.937 µs]
       change: [-25.741% -25.326% -24.930%] (p = 0.00 < 0.05)

Found 19 outliers among 100 measurements (19.00%)
2 (2.00%) low severe
4 (4.00%) low mild
2 (2.00%) high mild
11 (11.00%) high severe

decode 1048576 bytes, mask ff: 💚 Performance has improved.
       time:   [3.0993 ms 3.1093 ms 3.1208 ms]
       change: [-23.870% -23.518% -23.160%] (p = 0.00 < 0.05)

Found 10 outliers among 100 measurements (10.00%)
10 (10.00%) high severe

decode 4096 bytes, mask 7f: 💚 Performance has improved.
       time:   [19.751 µs 19.789 µs 19.835 µs]
       change: [-26.973% -26.675% -26.421%] (p = 0.00 < 0.05)

Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
9 (9.00%) high severe

decode 1048576 bytes, mask 7f: 💚 Performance has improved.
       time:   [5.1705 ms 5.1839 ms 5.1990 ms]
       change: [-24.796% -24.581% -24.342%] (p = 0.00 < 0.05)

Found 16 outliers among 100 measurements (16.00%)
16 (16.00%) high severe

decode 4096 bytes, mask 3f: 💚 Performance has improved.
       time:   [6.8903 µs 6.9097 µs 6.9356 µs]
       change: [-44.975% -44.693% -44.422%] (p = 0.00 < 0.05)

Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) low mild
2 (2.00%) high mild
5 (5.00%) high severe

decode 1048576 bytes, mask 3f: 💚 Performance has improved.
       time:   [1.7607 ms 1.7663 ms 1.7732 ms]
       change: [-44.903% -44.595% -44.293%] (p = 0.00 < 0.05)

Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe

coalesce_acked_from_zero 1+1 entries: Change within noise threshold.
       time:   [105.03 ns 105.62 ns 106.43 ns]
       change: [+0.7092% +1.3228% +1.9298%] (p = 0.00 < 0.05)

Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe

coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.
       time:   [120.55 ns 120.94 ns 121.36 ns]
       change: [-24.446% -21.663% -20.026%] (p = 0.00 < 0.05)

Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) low mild
11 (11.00%) high severe

coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.
       time:   [120.28 ns 120.76 ns 121.34 ns]
       change: [-20.945% -20.599% -20.252%] (p = 0.00 < 0.05)

Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe

coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.
       time:   [100.42 ns 100.57 ns 100.75 ns]
       change: [-12.705% -11.977% -11.286%] (p = 0.00 < 0.05)

Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [110.93 ms 111.00 ms 111.07 ms]
       change: [-0.4590% -0.3790% -0.2956%] (p = 0.00 < 0.05)

Found 21 outliers among 100 measurements (21.00%)
10 (10.00%) low mild
7 (7.00%) high mild
4 (4.00%) high severe

SentPackets::take_ranges: No change in performance detected.
       time:   [5.4496 µs 5.5427 µs 5.6391 µs]
       change: [-1.8777% +0.3295% +2.6045%] (p = 0.78 > 0.05)

Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [25.020 ms 26.041 ms 27.042 ms]
       change: [-11.446% -6.6122% -1.5381%] (p = 0.01 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

transfer/pacing-true/varying-seeds: No change in performance detected.
       time:   [33.858 ms 35.527 ms 37.214 ms]
       change: [-6.1967% +0.1607% +7.4906%] (p = 0.96 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

transfer/pacing-false/same-seed: 💚 Performance has improved.
       time:   [24.957 ms 25.782 ms 26.604 ms]
       change: [-11.689% -7.9744% -3.8072%] (p = 0.00 < 0.05)
transfer/pacing-true/same-seed: No change in performance detected.
       time:   [41.465 ms 43.470 ms 45.482 ms]
       change: [-5.6490% +0.5411% +7.1607%] (p = 0.87 > 0.05)
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: No change in performance detected.
       time:   [889.38 ms 898.27 ms 907.33 ms]
       thrpt:  [110.21 MiB/s 111.32 MiB/s 112.44 MiB/s]
change:
       time:   [-0.6331% +0.8132% +2.2670%] (p = 0.26 > 0.05)
       thrpt:  [-2.2168% -0.8067% +0.6371%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: No change in performance detected.
       time:   [319.12 ms 322.31 ms 325.51 ms]
       thrpt:  [30.721 Kelem/s 31.026 Kelem/s 31.336 Kelem/s]
change:
       time:   [-1.2254% +0.1856% +1.6470%] (p = 0.81 > 0.05)
       thrpt:  [-1.6203% -0.1852% +1.2406%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [33.735 ms 33.941 ms 34.165 ms]
       thrpt:  [29.270  elem/s 29.463  elem/s 29.642  elem/s]
change:
       time:   [-0.5776% +0.2964% +1.1869%] (p = 0.53 > 0.05)
       thrpt:  [-1.1730% -0.2955% +0.5809%]

Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe

1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: No change in performance detected.
       time:   [1.6844 s 1.7024 s 1.7205 s]
       thrpt:  [58.121 MiB/s 58.741 MiB/s 59.367 MiB/s]
change:
       time:   [-2.8664% -1.4719% -0.0296%] (p = 0.05 > 0.05)
       thrpt:  [+0.0296% +1.4939% +2.9510%]

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing MTU Mean [ms] Min [ms] Max [ms]
gquiche gquiche 1504 583.6 ± 92.4 502.7 761.5
neqo gquiche reno on 1504 800.1 ± 87.9 744.0 979.7
neqo gquiche reno 1504 769.5 ± 17.9 748.7 809.8
neqo gquiche cubic on 1504 826.2 ± 81.1 767.1 973.3
neqo gquiche cubic 1504 844.0 ± 104.2 760.7 1016.0
msquic msquic 1504 164.8 ± 83.1 94.4 365.7
neqo msquic reno on 1504 222.2 ± 9.7 202.1 235.9
neqo msquic reno 1504 310.4 ± 98.3 211.0 459.1
neqo msquic cubic on 1504 268.3 ± 89.4 204.0 442.8
neqo msquic cubic 1504 263.1 ± 111.7 215.1 630.9
gquiche neqo reno on 1504 737.4 ± 182.8 560.2 1181.1
gquiche neqo reno 1504 705.4 ± 113.0 553.4 883.5
gquiche neqo cubic on 1504 692.2 ± 86.2 560.4 809.2
gquiche neqo cubic 1504 748.4 ± 137.0 557.4 1008.6
msquic neqo reno on 1504 502.9 ± 44.6 471.4 614.3
msquic neqo reno 1504 511.5 ± 57.9 484.7 675.0
msquic neqo cubic on 1504 495.2 ± 57.4 464.9 656.9
msquic neqo cubic 1504 488.8 ± 9.6 473.1 499.9
neqo neqo reno on 1504 564.4 ± 62.9 508.8 730.1
neqo neqo reno 1504 558.5 ± 50.0 493.8 657.0
neqo neqo cubic on 1504 516.9 ± 30.3 457.3 552.4
neqo neqo cubic 1504 566.2 ± 93.5 466.9 807.9

⬇️ Download logs

@larseggert
Copy link
Collaborator

larseggert commented Nov 29, 2024

I had been meaning to take a look at what s2n-quic is doing; I somehow remember them having a pretty optimized implementation for this: https://github.com/aws/s2n-quic/blob/main/quic/s2n-quic-core/src/varint/mod.rs (also Apache-licensed).

neqo-common/src/codec.rs Show resolved Hide resolved
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me!

neqo-common/src/codec.rs Show resolved Hide resolved
test-fixture/src/header_protection.rs Outdated Show resolved Hide resolved
neqo-common/src/codec.rs Show resolved Hide resolved
@martinthomson
Copy link
Member Author

I took at look at the s2n implementation, which peeks a byte, then reads a full 1, 2, 4, or 8 bytes. For some reason, the following code is much slower (this PR is 19% faster for an 0xff mask, 27% faster for 0x7f, and 65% faster for 0x3f).

     pub fn decode_varint(&mut self) -> Option<u64> {
-        let b1 = self.decode_n(1)?;
+        let b1 = self.peek_byte()?;
         match b1 >> 6 {
-            0 => Some(b1),
-            1 => Some((b1 & 0x3f) << 8 | self.decode_n(1)?),
-            2 => Some((b1 & 0x3f) << 24 | self.decode_n(3)?),
-            3 => Some((b1 & 0x3f) << 56 | self.decode_n(7)?),
+            0 => {
+                self.offset += 1;
+                Some(u64::from(b1))
+            }
+            1 => Some(u64::from(self.decode_uint::<u16>()? & 0x3fff)),
+            2 => Some(u64::from(self.decode_uint::<u32>()? & 0x3fff_ffff)),
+            3 => Some(self.decode_uint::<u64>()? & 0x3fff_ffff_ffff_ffff),
             _ => unreachable!(),
         }
     }

@larseggert
Copy link
Collaborator

Also wonder if https://docs.rs/zerocopy/latest/zerocopy/ would further help here, but probably not, given that we only ever read a single value.

Signed-off-by: Lars Eggert <lars@eggert.org>
neqo-transport/src/tracking.rs Outdated Show resolved Hide resolved
test-fixture/src/header_protection.rs Outdated Show resolved Hide resolved
martinthomson and others added 3 commits December 9, 2024 08:30
Co-authored-by: Lars Eggert <lars@eggert.org>
Signed-off-by: Martin Thomson <mt@lowentropy.net>
@martinthomson martinthomson added this pull request to the merge queue Dec 8, 2024
Merged via the queue into mozilla:main with commit dd8e801 Dec 8, 2024
63 checks passed
@martinthomson martinthomson deleted the decoder-improvements branch December 8, 2024 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants