Memory copy in buffer linearize process #17219

xu1zhou · 2021-07-02T02:59:37Z

Description:
In my test on big file download through envoy(http1 + tls proxy), found memcpy operation when SslSocket:doWrite method.

#ssl_socket.cc:253

int rc = SSL_write(rawSsl(), write_buffer.linearize(bytes_to_write), bytes_to_write);

#buffer_impl.cc:236

void* OwnedImpl::linearize(uint32_t size) {
  ...
  if (slices_[0].dataSize() < size) {
    Slice new_slice{size, account_};
    Slice::Reservation reservation = new_slice.reserve(size);
    ASSERT(reservation.mem_ != nullptr);
    ASSERT(reservation.len_ == size);
    copyOut(0, size, reservation.mem_);
    new_slice.commit(reservation);
...

After digging more deep in code, found in OwnedImpl::linearize(), Envoy would init new slice when old slice size smaller than the buffer to write then copyOut invokes memcpy.

Do you know who created these old slicequeue?And do you think they may grow up to a larger size and slow down the performance(too much memcpy)?

The text was updated successfully, but these errors were encountered:

xu1zhou · 2021-07-02T03:00:47Z

@mattklein123

mattklein123 · 2021-07-02T15:47:40Z

cc @ggreenway @antoniovicente

antoniovicente · 2021-07-02T16:51:15Z

Who created the old slice varies, but the most common sources of the old slice are data frames arriving on the other end of the io pipeline (e.i. a response coming from the upstream which is written to the downstream via TLS). One possible source of fragmentation in the connection output buffer includes framing that the codec has to add to the frame, including HTTP/1 chunk encoding and HTTP2 data frame headers. I think that right now it is likely that HTTP/1 messages with content-length will manage to skip the memcpy when proxying over TLS most of the time. Other cases will likely end up with an output buffer that contains slices of varying sizes which require linearize before send.

Regarding your specific case you describe involving http1 + tls proxy, I would think that case shouldn't trigger the linearize case often both the downstream and upstream are HTTP1, unless the response uses HTTP/1 chunk encoding instead of content-length. I know @ggreenway had considered some heuristics to avoid copies in more cases.

I think that in the case of HTTP2 we need to do a copy to compact the generated H2 data frame stream either at the time they are generated by the codec or using linearize when generating the the TLS frames. The number of copies involved in these two versions should be almost the same, so there's not a lot of room for optimization.

ggreenway · 2021-07-06T15:53:02Z

Testing with various request sizes and data patterns showed that this was as fast or faster than the alternative, which was to call SSL_write on each slice in the buffer. The reason is that SSL_write causes a TLS record to be sent, which includes the crypto overhead of sealing the record. Sending more records than neccessary is more expensive than this memcpy.

The ideal solution would be for BoringSSL to have a version of SSL_write that takes an iovec or similar, so we could pass in an array of pointers and lengths, and the function could copy out however much data will fit into a record, without needing to linearize. But that function does not exist.

xu1zhou · 2021-07-26T06:42:07Z

@antoniovicente @ggreenway thanks for your explaining.
From my observation, the slices queue grow about 1-2 slices at front end after codec.cc process. Then before SSL_write, linearize function will cause about 60 times memcpy on these slices, to reduce about 1~2 times "crypto overhead of sealing the record". Do you think it still worth to linearize the slices in that case? @ggreenway

ggreenway · 2021-07-26T17:09:50Z

I don't know; I think the only way to find out is to try it both ways in a benchmark, and see what the measurements tell us.

But so, in a variety of traffic patterns, it was faster on average to have the linearize here. But my previous testing was not comprehensive. If you collect additional data in this area, I'd like to see it.

antoniovicente · 2021-07-26T17:37:41Z

It would be good to know more about the specifics of the protocols involved. Are you proxying HTTP1 or HTTP2?

xu1zhou · 2021-07-27T00:48:58Z

Our scenario is: Client----HTTPS1----> Envoy -----HTTP1----> Server, client sent request to get the big size file on Server side. And memcpy happend when Envoy respond https1 to client.

antoniovicente · 2021-07-27T02:42:02Z

I have a few possible guesses about what may be happening:

The serialized response headers end up adding a slice before the body and that misalignment triggers the linearize copies when proxying the first per_connection_buffer_limit_bytes bytes of the response.
Response is chunk encoded. Envoy removes chunking from the backend and adds a different one as it proxies data, triggering linearize behaviors

If the issue is (1), it may be possible to get less copying by using a smaller per_connection_buffer_limit_bytes for the upstream clusters. per_connection_buffer_limit_bytes defaults to 1MB. Your mention of 60 memcpy operations seems consistent with use of this default since 64 * 16kb per SSL_Write == 1MB.

Some improvements to the heuristics used to decide wherever or not to linearize may help. For example, skip linearize for the current slice if the next slice is a full 16kb in size already (or is close to 16kb).

ggreenway · 2021-07-27T16:56:47Z

Some improvements to the heuristics used to decide wherever or not to linearize may help. For example, skip linearize for the current slice if the next slice is a full 16kb in size already (or is close to 16kb).

I tried exactly this heuristic; the improvement seen didn't justify the change at the time. But when I tested, I did not test with chunked responses; that may change the result significantly. It's probably worth testing again. Here's my change where I attempted this: #14053

xu1zhou · 2021-07-28T09:04:14Z

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

xu1zhou · 2021-07-28T09:06:15Z

Regarding the heuristic #14053 tried by @ggreenway , i agreed that you mentioned "At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record", did you find that threhold at last?

antoniovicente · 2021-07-28T09:22:48Z

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

cc @mum4k

More wakeups would be required. Smaller buffers may have a bit of a CPU penalty, but lower memory usage of the proxy and provide slightly better sharing of CPU across connections handled by the same worker. I think we don't have load tests that show throughput changes due to changes to per_connection_buffer_limit_bytes config.

ggreenway · 2021-07-28T13:38:15Z

Regarding the heuristic #14053 tried by @ggreenway , i agreed that you mentioned "At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record", did you find that threhold at last?

No, I abandoned that PR in favor of other optimizations that achieved the performance I needed for the use case I was optimizing for at the time.

mum4k · 2021-07-28T18:17:15Z

@antoniovicente I think issue (1)you described exactly match the phenomenon I encounted now. Most of slices are 16KB and had not modified by envoy, but to reduce the num of slices, they have to copy into new slice to fill the empty cause by the first small size slice.
"using a smaller per_connection_buffer_limit_bytes" may helps in this senario, but is the more litter per_connection_buffer_limit_bytes means we need more times to finish reading reponse data into Envoy ?

cc @mum4k

More wakeups would be required. Smaller buffers may have a bit of a CPU penalty, but lower memory usage of the proxy and provide slightly better sharing of CPU across connections handled by the same worker. I think we don't have load tests that show throughput changes due to changes to per_connection_buffer_limit_bytes config.

We currently don't have such load tests, but it should be relatively easy to set-up a one-off (or even a continuous one) experiment internally if this is desired.

github-actions · 2021-08-27T20:01:10Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2021-09-03T20:01:33Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

xu1zhou added the triage Issue requires triage label Jul 2, 2021

lizan added question Questions that are neither investigations, bugs, nor enhancements and removed triage Issue requires triage labels Jul 2, 2021

antoniovicente changed the title ~~Memmory copy in buffer linearize process~~ Memory copy in buffer linearize process Jul 2, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 27, 2021

github-actions bot closed this as completed Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory copy in buffer linearize process #17219

Memory copy in buffer linearize process #17219

xu1zhou commented Jul 2, 2021

xu1zhou commented Jul 2, 2021

mattklein123 commented Jul 2, 2021

antoniovicente commented Jul 2, 2021

ggreenway commented Jul 6, 2021

xu1zhou commented Jul 26, 2021

ggreenway commented Jul 26, 2021

antoniovicente commented Jul 26, 2021

xu1zhou commented Jul 27, 2021

antoniovicente commented Jul 27, 2021

ggreenway commented Jul 27, 2021

xu1zhou commented Jul 28, 2021

xu1zhou commented Jul 28, 2021

antoniovicente commented Jul 28, 2021

ggreenway commented Jul 28, 2021

mum4k commented Jul 28, 2021

github-actions bot commented Aug 27, 2021

github-actions bot commented Sep 3, 2021

Memory copy in buffer linearize process #17219

Memory copy in buffer linearize process #17219

Comments

xu1zhou commented Jul 2, 2021

xu1zhou commented Jul 2, 2021

mattklein123 commented Jul 2, 2021

antoniovicente commented Jul 2, 2021

ggreenway commented Jul 6, 2021

xu1zhou commented Jul 26, 2021

ggreenway commented Jul 26, 2021

antoniovicente commented Jul 26, 2021

xu1zhou commented Jul 27, 2021

antoniovicente commented Jul 27, 2021

ggreenway commented Jul 27, 2021

xu1zhou commented Jul 28, 2021

xu1zhou commented Jul 28, 2021

antoniovicente commented Jul 28, 2021

ggreenway commented Jul 28, 2021

mum4k commented Jul 28, 2021

github-actions bot commented Aug 27, 2021

github-actions bot commented Sep 3, 2021