Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing crypto overhead #334

Closed
kazuho opened this issue Apr 30, 2020 · 1 comment · Fixed by #359
Closed

Reducing crypto overhead #334

kazuho opened this issue Apr 30, 2020 · 1 comment · Fixed by #359

Comments

@kazuho
Copy link
Member

kazuho commented Apr 30, 2020

When hardware-assisted UDP GSO is present, overhead of the crypto becomes the biggest bottleneck of QUIC sender performance. In the case shown below doing GSO of 40 packets * 1460 bytes, 45.18% of the time is spent in AEAD encryption (ptls_aead_encrypt), 6.29% is spent in header protection (default_finalize_send_packet). These numbers (almost) represent the real cost, as 96% of the CPU time were spent in user+sys (consisting 100% of this perf tree).

While the actual cost of doing crypto cannot be reduced, there are certain amount of overhead within this 45.18% + 6.29%. The actual cost of AEAD (initialization, aad processing, encrypt, finalize) is 2.73% + 0.78% + 33.31% + 3.15% = 39.97%. The actual cost of generating unpredictable bits for header protection is 1.18%. We can assume that the large fraction of the remaining 10.32% is API overhead.

A sensible approach to fix this issue would be to create add a function to OpenSSL that does everything at once (i.e. all AEAD + header protection). We can hope to see ~10% performance improvement by adding such a function.

Screen Shot 2020-04-30 at 12 56 24 PM

@kazuho
Copy link
Member Author

kazuho commented Apr 30, 2020

A sensible approach to fix this issue would be to create add a function to OpenSSL that does everything at once (i.e. all AEAD + header protection). We can hope to see ~10% performance improvement by adding such a function.

Furthermore, we could take advantage of pipelining, by running the initialization, aad, and finalization phases of multiple packets in parallel. The reason why they show up as much as above compared to the permutation of the payload is because, the first three is not parallelized, while the permutation of the payload is applied to multiple blocks of 16-bytes at once (notice the 6x suffix of the functions). Unlike the case of TLS, we have can parallelize the three phases because we are generating multiple packets at once. Also, we could consider applying that encryption logic in the kernel to reduce the overhead of context switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant