Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track sent items using SmallVec #1657

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

martinthomson
Copy link
Member

DO NOT MERGE THIS - IT MAKES THINGS SLOWER

I thought that this was worth sharing though.

I was looking at the profiles that @KershawChang shared and noticed that write_stream_frame was eating a decent amount of CPU because there was an allocation in there. It took me a while to track it down, but it turns out that this is the first place that we add a RecoveryToken to the Vec of those that we hold in each SentPacket.

The obvious thing to test in this case is SmallVec. SmallVec is designed for places where you mostly add a few items to a collection. It holds a small array into which values are deposited unless more than the array size is added, when those values are allocated on the heap, just like a normal Vec. This means that if you stay within the array, you avoid a heap allocation, which can save a bunch of work. The cost is a little complexity, the extra stack allocation, and that your stuff needs to move if you ever overflow the backing array. So if your prediction is wrong, it is more expensive.

I tried this out and it makes performance worse. With a backing array of size 1, the performance change isn't observable. But larger values progressively degrade performance. Increasing the array size to 8 makes it worse than the value of 4 I started with. My theory is that the increased size of the base object starts is the problem. This collection is typically kept in a SentPacket. Our acknowledgements code needs extra time to move the larger struct around.

The size of the performance regression suggests that there are some ways we could further improve how SentPacket is handled.

Benchmark results
Run multiple transfers with varying seeds
                        time:   [42.395 ms 42.547 ms 42.712 ms]
                        change: [+3.3754% +4.3870% +5.2776%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Run multiple transfers with the same seed
                        time:   [42.513 ms 42.584 ms 42.657 ms]
                        change: [+3.4443% +3.7777% +4.0976%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

@codecov-commenter
Copy link

codecov-commenter commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 98.96907% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.39%. Comparing base (a1b9364) to head (f8e2c98).

Files with missing lines Patch % Lines
neqo-transport/src/send_stream.rs 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1657      +/-   ##
==========================================
- Coverage   95.40%   95.39%   -0.01%     
==========================================
  Files         113      113              
  Lines       36721    36717       -4     
==========================================
- Hits        35032    35028       -4     
  Misses       1689     1689              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martinthomson and others added 2 commits February 19, 2024 22:15
Co-authored-by: Lars Eggert <lars@eggert.org>
Signed-off-by: Martin Thomson <mt@lowentropy.net>
@martinthomson
Copy link
Member Author

@larseggert, I'm looking at the profiles that were generated here and it looks like these are off the main branch. The flamegraph for transfer shows calls into BTreeMap, but I don't see any way that could happen.

@larseggert
Copy link
Collaborator

I can't get to it this week. Wonder if the checkout action needs another parameter?

@larseggert
Copy link
Collaborator

Let's see what the new flamegraphs show (also because LTO is now enabled.)

Copy link

Benchmark results

Performance differences relative to 76630a5.

  • drain a timer quickly time: [395.29 ns 402.55 ns 409.33 ns]
    change: [-0.6787% +0.9859% +2.7112%] (p = 0.26 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1+1 entries
    time: [193.97 ns 194.36 ns 194.79 ns]
    change: [-0.5237% -0.1538% +0.2215%] (p = 0.41 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 3+1 entries
    time: [235.93 ns 236.55 ns 237.18 ns]
    change: [-0.8063% -0.4103% -0.0496%] (p = 0.04 < 0.05)
    Change within noise threshold.

  • coalesce_acked_from_zero 10+1 entries
    time: [235.78 ns 236.59 ns 237.55 ns]
    change: [-0.4030% +3.6386% +13.911%] (p = 0.42 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries
    time: [217.27 ns 222.01 ns 232.88 ns]
    change: [+1.4742% +5.6444% +13.185%] (p = 0.06 > 0.05)
    No change in performance detected.

  • RxStreamOrderer::inbound_frame()
    time: [117.57 ms 117.65 ms 117.73 ms]
    change: [+0.0927% +0.1982% +0.2949%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds
    time: [124.35 ms 124.61 ms 124.88 ms]
    thrpt: [32.031 MiB/s 32.099 MiB/s 32.169 MiB/s]
    change:
    time: [+5.4576% +5.7824% +6.1037%] (p = 0.00 < 0.05)
    thrpt: [-5.7526% -5.4664% -5.1751%]
    💔 Performance has regressed.

  • transfer/Run multiple transfers with the same seed
    time: [124.82 ms 125.00 ms 125.18 ms]
    thrpt: [31.953 MiB/s 32.000 MiB/s 32.047 MiB/s]
    change:
    time: [+5.2953% +5.5022% +5.6981%] (p = 0.00 < 0.05)
    thrpt: [-5.3909% -5.2152% -5.0290%]
    💔 Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 428.3 ± 58.8 359.7 538.1 1.00
neqo msquic reno on 1921.6 ± 35.0 1850.3 1960.9 1.00
neqo msquic reno 1917.7 ± 30.3 1857.9 1959.7 1.00
neqo msquic cubic on 1898.8 ± 56.6 1775.9 1971.2 1.00
neqo msquic cubic 1901.9 ± 46.7 1811.5 1960.3 1.00
msquic neqo reno on 4418.4 ± 170.0 4222.4 4692.8 1.00
msquic neqo reno 4510.8 ± 232.7 4225.0 4969.3 1.00
msquic neqo cubic on 4475.1 ± 154.6 4259.7 4672.4 1.00
msquic neqo cubic 4434.9 ± 168.7 4204.7 4705.9 1.00
neqo neqo reno on 3664.9 ± 172.2 3387.1 3957.8 1.00
neqo neqo reno 3805.0 ± 275.7 3529.5 4374.1 1.00
neqo neqo cubic on 4329.7 ± 141.6 4114.5 4523.6 1.00
neqo neqo cubic 4372.4 ± 141.9 4184.3 4553.5 1.00

⬇️ Download logs

@larseggert
Copy link
Collaborator

It might be worthwhile revisiting this now that @mxinden made some improvements related to SentPacket.

@martinthomson
Copy link
Member Author

It still looks pretty bad
RxStreamOrderer::inbound_frame()
                        time:   [51.698 ms 51.889 ms 52.085 ms]
                        change: [+4.3290% +5.1052% +5.9211%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

     Running benches/sent_packets.rs (target/release/deps/sent_packets-a9a340a4dfd0be33)
SentPackets::take_ranges
                        time:   [12.966 µs 13.264 µs 13.516 µs]
                        change: [+226.83% +258.23% +293.90%] (p = 0.00 < 0.05)
                        Performance has regressed.

     Running benches/transfer.rs (target/release/deps/transfer-8555e2732eb99220)
transfer/pacing-false/varying-seeds
                        time:   [14.021 ms 14.244 ms 14.462 ms]
                        change: [+5.1022% +7.4030% +9.5750%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

transfer/pacing-true/varying-seeds
                        time:   [14.828 ms 15.058 ms 15.281 ms]
                        change: [+4.0012% +6.3270% +8.5372%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

transfer/pacing-false/same-seed
                        time:   [12.132 ms 12.391 ms 12.643 ms]
                        change: [+4.6848% +7.8215% +10.869%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

transfer/pacing-true/same-seed
                        time:   [13.924 ms 14.178 ms 14.432 ms]
                        change: [+4.0431% +6.9063% +9.9521%] (p = 0.00 < 0.05)
                        Performance has regressed.

Copy link

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 3e65261.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants