Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory copy in buffer linearize process #17219

Closed
xu1zhou opened this issue Jul 2, 2021 · 17 comments
Closed

Memory copy in buffer linearize process #17219

xu1zhou opened this issue Jul 2, 2021 · 17 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@xu1zhou
Copy link
Contributor

xu1zhou commented Jul 2, 2021

Description:
In my test on big file download through envoy(http1 + tls proxy), found memcpy operation when SslSocket:doWrite method.

#ssl_socket.cc:253

int rc = SSL_write(rawSsl(), write_buffer.linearize(bytes_to_write), bytes_to_write);
#buffer_impl.cc:236

void* OwnedImpl::linearize(uint32_t size) {
  ...
  if (slices_[0].dataSize() < size) {
    Slice new_slice{size, account_};
    Slice::Reservation reservation = new_slice.reserve(size);
    ASSERT(reservation.mem_ != nullptr);
    ASSERT(reservation.len_ == size);
    copyOut(0, size, reservation.mem_);
    new_slice.commit(reservation);
...

After digging more deep in code, found in OwnedImpl::linearize(), Envoy would init new slice when old slice size smaller than the buffer to write then copyOut invokes memcpy.

Do you know who created these old slicequeue?And do you think they may grow up to a larger size and slow down the performance(too much memcpy)?

@xu1zhou xu1zhou added the triage Issue requires triage label Jul 2, 2021
@xu1zhou
Copy link
Contributor Author

xu1zhou commented Jul 2, 2021

@mattklein123

@mattklein123
Copy link
Member

cc @ggreenway @antoniovicente

@lizan lizan added question Questions that are neither investigations, bugs, nor enhancements and removed triage Issue requires triage labels Jul 2, 2021
@antoniovicente
Copy link
Contributor

Who created the old slice varies, but the most common sources of the old slice are data frames arriving on the other end of the io pipeline (e.i. a response coming from the upstream which is written to the downstream via TLS). One possible source of fragmentation in the connection output buffer includes framing that the codec has to add to the frame, including HTTP/1 chunk encoding and HTTP2 data frame headers. I think that right now it is likely that HTTP/1 messages with content-length will manage to skip the memcpy when proxying over TLS most of the time. Other cases will likely end up with an output buffer that contains slices of varying sizes which require linearize before send.

Regarding your specific case you describe involving http1 + tls proxy, I would think that case shouldn't trigger the linearize case often both the downstream and upstream are HTTP1, unless the response uses HTTP/1 chunk encoding instead of content-length. I know @ggreenway had considered some heuristics to avoid copies in more cases.

I think that in the case of HTTP2 we need to do a copy to compact the generated H2 data frame stream either at the time they are generated by the codec or using linearize when generating the the TLS frames. The number of copies involved in these two versions should be almost the same, so there's not a lot of room for optimization.

@antoniovicente antoniovicente changed the title Memmory copy in buffer linearize process Memory copy in buffer linearize process Jul 2, 2021
@ggreenway
Copy link
Contributor

Testing with various request sizes and data patterns showed that this was as fast or faster than the alternative, which was to call SSL_write on each slice in the buffer. The reason is that SSL_write causes a TLS record to be sent, which includes the crypto overhead of sealing the record. Sending more records than neccessary is more expensive than this memcpy.

The ideal solution would be for BoringSSL to have a version of SSL_write that takes an iovec or similar, so we could pass in an array of pointers and lengths, and the function could copy out however much data will fit into a record, without needing to linearize. But that function does not exist.

@xu1zhou
Copy link
Contributor Author

xu1zhou commented Jul 26, 2021

@antoniovicente @ggreenway thanks for your explaining.
From my observation, the slices queue grow about 1-2 slices at front end after codec.cc process. Then before SSL_write, linearize function will cause about 60 times memcpy on these slices, to reduce about 1~2 times "crypto overhead of sealing the record". Do you think it still worth to linearize the slices in that case? @ggreenway

@ggreenway
Copy link
Contributor

I don't know; I think the only way to find out is to try it both ways in a benchmark, and see what the measurements tell us.

But so, in a variety of traffic patterns, it was faster on average to have the linearize here. But my previous testing was not comprehensive. If you collect additional data in this area, I'd like to see it.

@antoniovicente
Copy link
Contributor

It would be good to know more about the specifics of the protocols involved. Are you proxying HTTP1 or HTTP2?

@xu1zhou
Copy link
Contributor Author

xu1zhou commented Jul 27, 2021

Our scenario is: Client----HTTPS1----> Envoy -----HTTP1----> Server, client sent request to get the big size file on Server side. And memcpy happend when Envoy respond https1 to client.

@antoniovicente
Copy link
Contributor

I have a few possible guesses about what may be happening:

  1. The serialized response headers end up adding a slice before the body and that misalignment triggers the linearize copies when proxying the first per_connection_buffer_limit_bytes bytes of the response.
  2. Response is chunk encoded. Envoy removes chunking from the backend and adds a different one as it proxies data, triggering linearize behaviors

If the issue is (1), it may be possible to get less copying by using a smaller per_connection_buffer_limit_bytes for the upstream clusters. per_connection_buffer_limit_bytes defaults to 1MB. Your mention of 60 memcpy operations seems consistent with use of this default since 64 * 16kb per SSL_Write == 1MB.

Some improvements to the heuristics used to decide wherever or not to linearize may help. For example, skip linearize for the current slice if the next slice is a full 16kb in size already (or is close to 16kb).

@ggreenway
Copy link
Contributor

Some improvements to the heuristics used to decide wherever or not to linearize may help. For example, skip linearize for the current slice if the next slice is a full 16kb in size already (or is close to 16kb).

I tried exactly this heuristic; the improvement seen didn't justify the change at the time. But when I tested, I did not test with chunked responses; that may change the result significantly. It's probably worth testing again. Here's my change where I attempted this: #14053

@xu1zhou
Copy link
Contributor Author

xu1zhou commented Jul 28, 2021

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

@xu1zhou
Copy link
Contributor Author

xu1zhou commented Jul 28, 2021

Regarding the heuristic #14053 tried by @ggreenway , i agreed that you mentioned "At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record", did you find that threhold at last?

@antoniovicente
Copy link
Contributor

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

cc @mum4k

More wakeups would be required. Smaller buffers may have a bit of a CPU penalty, but lower memory usage of the proxy and provide slightly better sharing of CPU across connections handled by the same worker. I think we don't have load tests that show throughput changes due to changes to per_connection_buffer_limit_bytes config.

@ggreenway
Copy link
Contributor

Regarding the heuristic #14053 tried by @ggreenway , i agreed that you mentioned "At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record", did you find that threhold at last?

No, I abandoned that PR in favor of other optimizations that achieved the performance I needed for the use case I was optimizing for at the time.

@mum4k
Copy link
Contributor

mum4k commented Jul 28, 2021

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

cc @mum4k

More wakeups would be required. Smaller buffers may have a bit of a CPU penalty, but lower memory usage of the proxy and provide slightly better sharing of CPU across connections handled by the same worker. I think we don't have load tests that show throughput changes due to changes to per_connection_buffer_limit_bytes config.

We currently don't have such load tests, but it should be relatively easy to set-up a one-off (or even a continuous one) experiment internally if this is desired.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 27, 2021
@github-actions
Copy link

github-actions bot commented Sep 3, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as completed Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

6 participants