-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory copy in buffer linearize process #17219
Comments
Who created the old slice varies, but the most common sources of the old slice are data frames arriving on the other end of the io pipeline (e.i. a response coming from the upstream which is written to the downstream via TLS). One possible source of fragmentation in the connection output buffer includes framing that the codec has to add to the frame, including HTTP/1 chunk encoding and HTTP2 data frame headers. I think that right now it is likely that HTTP/1 messages with content-length will manage to skip the memcpy when proxying over TLS most of the time. Other cases will likely end up with an output buffer that contains slices of varying sizes which require linearize before send. Regarding your specific case you describe involving http1 + tls proxy, I would think that case shouldn't trigger the linearize case often both the downstream and upstream are HTTP1, unless the response uses HTTP/1 chunk encoding instead of content-length. I know @ggreenway had considered some heuristics to avoid copies in more cases. I think that in the case of HTTP2 we need to do a copy to compact the generated H2 data frame stream either at the time they are generated by the codec or using linearize when generating the the TLS frames. The number of copies involved in these two versions should be almost the same, so there's not a lot of room for optimization. |
Testing with various request sizes and data patterns showed that this was as fast or faster than the alternative, which was to call The ideal solution would be for BoringSSL to have a version of |
@antoniovicente @ggreenway thanks for your explaining. |
I don't know; I think the only way to find out is to try it both ways in a benchmark, and see what the measurements tell us. But so, in a variety of traffic patterns, it was faster on average to have the |
It would be good to know more about the specifics of the protocols involved. Are you proxying HTTP1 or HTTP2? |
Our scenario is: |
I have a few possible guesses about what may be happening:
If the issue is (1), it may be possible to get less copying by using a smaller per_connection_buffer_limit_bytes for the upstream clusters. per_connection_buffer_limit_bytes defaults to 1MB. Your mention of 60 memcpy operations seems consistent with use of this default since 64 * 16kb per SSL_Write == 1MB. Some improvements to the heuristics used to decide wherever or not to linearize may help. For example, skip linearize for the current slice if the next slice is a full 16kb in size already (or is close to 16kb). |
I tried exactly this heuristic; the improvement seen didn't justify the change at the time. But when I tested, I did not test with chunked responses; that may change the result significantly. It's probably worth testing again. Here's my change where I attempted this: #14053 |
@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice. |
Regarding the heuristic #14053 tried by @ggreenway , i agreed that you mentioned "At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record", did you find that threhold at last? |
cc @mum4k More wakeups would be required. Smaller buffers may have a bit of a CPU penalty, but lower memory usage of the proxy and provide slightly better sharing of CPU across connections handled by the same worker. I think we don't have load tests that show throughput changes due to changes to per_connection_buffer_limit_bytes config. |
No, I abandoned that PR in favor of other optimizations that achieved the performance I needed for the use case I was optimizing for at the time. |
We currently don't have such load tests, but it should be relatively easy to set-up a one-off (or even a continuous one) experiment internally if this is desired. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Description:
In my test on big file download through envoy(http1 + tls proxy), found memcpy operation when SslSocket:doWrite method.
After digging more deep in code, found in OwnedImpl::linearize(), Envoy would init new slice when old slice size smaller than the
buffer to write
then copyOut invokes memcpy.Do you know who created these old slicequeue?And do you think they may grow up to a larger size and slow down the performance(too much memcpy)?
The text was updated successfully, but these errors were encountered: