Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#729: proto: write outgoing packets to caller-supplied memory #1697

Merged
merged 9 commits into from
Nov 1, 2023

Conversation

lmpn
Copy link
Contributor

@lmpn lmpn commented Oct 25, 2023

I've used BytesMut for convenience but I can change to &mut [u8].

- buffer as argument
- Proto transmit struct has size of the payload instead of buffer
Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty nice! Have you measured the performance impact?

quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
quinn-proto/src/tests/util.rs Outdated Show resolved Hide resolved
quinn-proto/src/tests/util.rs Outdated Show resolved Hide resolved
quinn/src/endpoint.rs Outdated Show resolved Hide resolved
quinn/src/endpoint.rs Outdated Show resolved Hide resolved
Luis Neto added 2 commits October 25, 2023 22:59
- use quinn-proto to get buffer size
- remove unnecessary includes
- revert Transmit varibles order
quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
quinn/src/connection.rs Outdated Show resolved Hide resolved
@lmpn
Copy link
Contributor Author

lmpn commented Oct 26, 2023

I've not forgotten about the perf. I will share the comparison later 👍

@Ralith
Copy link
Collaborator

Ralith commented Oct 26, 2023

I'd be surprised if there was a significant performance improvement from this alone, but it brings us much closer to being able to reuse transmit buffers (perhaps by doing UDP transmits directly from connection tasks), which is more likely to have a direct impact.

Still, always good to check for the impact of changes on the I/O path.

@lmpn
Copy link
Contributor Author

lmpn commented Oct 26, 2023

I've ran cargo bench for main and this branch. Results below:

poll_transmit_pre_alloc
test large_data_10_streams ... bench: 30,968,912 ns/iter (+/- 11,592,202) = 338 MB/s
test large_data_1_stream ... bench: 26,580,878 ns/iter (+/- 20,358,295) = 39 MB/s
test small_data_100_streams ... bench: 14,363,537 ns/iter (+/- 17,569,828)
test small_data_1_stream ... bench: 19,154,729 ns/iter (+/- 5,427,772)

latest main
test large_data_10_streams ... bench: 30,603,466 ns/iter (+/- 11,396,640) = 342 MB/s
test large_data_1_stream ... bench: 21,525,054 ns/iter (+/- 19,954,110) = 48 MB/s
test small_data_100_streams ... bench: 17,087,725 ns/iter (+/- 22,559,371)
test small_data_1_stream ... bench: 19,227,774 ns/iter (+/- 5,343,771)

If this is not enough please give me some pointers and I will try to do an analysis that fits the project.

@lmpn
Copy link
Contributor Author

lmpn commented Oct 26, 2023

@djc I've decided to expose the mtu from the connection.
If there is already an approach to retrieve or calculate such value tell me 🙏

@djc
Copy link
Member

djc commented Oct 27, 2023

So this shows as a ~18% regression for small_data_100_streams? That seems bad -- and surprising...

@lmpn
Copy link
Contributor Author

lmpn commented Oct 27, 2023

So I've executed cargo bench 10 times for main and this branch and compared the times(image below)
Screenshot 2023-10-27 at 15 16 06

There is indeed a regression (on average) for large_data_1_stream and small_data_100_streams, 3% and 6% respectively.

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing the MTU makes sense to me; other quinn-proto users will want it for the same reason.

quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
@Ralith
Copy link
Collaborator

Ralith commented Oct 27, 2023

@lmpn I'm having some difficulty reading your chart screenshot. Are the 4 aggregated figures for each separate test, in order? So there's a 20% speedup in small_data_1_stream? That's confusing since the original single run you posted indicated no change there, and a similar speedup in small_data_100_streams. That original run also indicates a very high variance, so maybe this data is too noisy to be meaningful anyway. I'll try to get some of my own numbers to compare.

We should be able to get identical performance to main by having the quinn layer create a fresh BytesMut for each poll_transmit call, right?

@lmpn
Copy link
Contributor Author

lmpn commented Oct 28, 2023

Sorry for the bad image. Check the one below and tell me if it is clearer.
The speedups are calculated as (1-main_time/poll_transmit_pre_alloc_time)*100

Screenshot 2023-10-28 at 14 28 24

The single run showed 18% regression for small_data_100_streams. I assumed this was because there is too much noise/bad run and because I'm running in my own computer, thus, I took 10 runs of each branch.

As you can see above I calculated the speedup for the:

  • average of all times
  • max of all times
  • min of all times
  • top 3 (average of the 3 lowest times)
  • top 5 (average of the 5 lowest times)

Even with noise this values show:

  • small_data_1_stream has improved in every scenario
  • there is a regression for large_data_1_stream / small_data_100_streams
  • (I think) large_data_10_streams the difference seems negligible.

We should be able to get identical performance to main by having the quinn layer create a fresh BytesMut for each poll_transmit call, right?

Yes I would assume so.

- rename get_current_mtu to current_mtu
- add proper docstring to current_mtu
@Ralith
Copy link
Collaborator

Ralith commented Oct 28, 2023

Thanks for the detailed analysis! To summarize: apparent massive speedup for small_data_1_stream, maybe a slight slowdown (or just noise) elsewhere. Perhaps reusing a single BytesMut is letting a single allocation serve multiple small packets.

I think this nets out as a win as written, personally. I've experimented on a couple of my machines (both Windows and Linux) and wasn't able to show a significant difference either way.

quinn-proto/src/connection/mod.rs Outdated Show resolved Hide resolved
quinn-proto/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/connection.rs Outdated Show resolved Hide resolved
@djc
Copy link
Member

djc commented Oct 31, 2023

@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?

@lmpn
Copy link
Contributor Author

lmpn commented Nov 1, 2023

@lmpn was this motivated by Hacktoberfest, and if so, is there a merge deadline to make this count for your stats?

No...
I just have interest in networking and performance programming.

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

quinn/src/endpoint.rs Outdated Show resolved Hide resolved
@lmpn
Copy link
Contributor Author

lmpn commented Nov 1, 2023

Screenshot 2023-11-01 at 12 00 11

Above the results of the benchmark with the latest commit on this branch.
If someone can double check on their side it would be great.

@djc djc enabled auto-merge (squash) November 1, 2023 12:06
@djc djc merged commit 49aa4b6 into quinn-rs:main Nov 1, 2023
7 of 8 checks passed
@Ralith
Copy link
Collaborator

Ralith commented Nov 1, 2023

My environment still registers too much noise to get meaningful data, but it's exciting that it had such an impact in yours! Maybe the impact is larger on Windows, where GSO isn't doing as much heavy lifting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants