-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls: improve write performance by reducing copying #14053
Conversation
This change tries to heuristically decide what block size to write. 16kb blocks are most efficient in the TLS layer, but there were cases where Envoy would memcpy nearly all the data in order to create blocks of this size. This change improves performance by sometimes writing smaller blocks in order to reduce memcpy. Signed-off-by: Greg Greenway <ggreenway@apple.com>
@@ -266,6 +266,30 @@ void* OwnedImpl::linearize(uint32_t size) { | |||
return slices_.front()->data(); | |||
} | |||
|
|||
RawSlice OwnedImpl::maybeLinearize(uint32_t max_size, uint32_t desired_min_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation only look at the 1st and the 2nd slice in the buffer, would it makes more sense if we return linearized a group of whole slices, just before it exceed the max_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. This is very much a heuristic, and there are a lot of ways this could potentially be improved. This improved performance of the case I was benchmarking, and didn't show any degradation for low-throughput (smaller request/response) traffic patterns, and was pretty simple and easy to reason about and/or predict what it will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but I suggested the above because then we don't need the second parameter but keep pretty much same behavior for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what you mean by "a linearized group of whole slices"? I don't understand what you're suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like:
uint64_t size = 0;
for (slice : slices_) {
if (size + slice_->dataSize() > max_size) break;
size += slice_->dataSize();
}
return {linearize(size), size};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record. For instance, if all slices were 16383 bytes (1 less than a full record), I don't believe it's faster to linearize everything to 16384; emitting slightly smaller records will be faster.
I made a wild guess at picking 25% of a record as the threshold at which we should definitely linearize, and perf tests indicated that helped. But maybe I need to come up with a small benchmark to try to quantify this relationship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but here we're comparing emitting many small TLS records vs combining many small buffers a single TLS record, not copying data between slices to completely fill records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example I was considering, I was asking about slices in the 4-8kb range. In that range it's not clear to me whether the memcpy cost will be more than the overhead of generating a TLS record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between 2x 8KiB vs 1x 16KiB is probably negligible, but I'm pretty sure that memcpy
overhead is smaller than writing additional TLS record(s) to the wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but that's an educated guess, feel free to benchmark this (e.g. by comparing proxy throughput, not userland microbenchmarks).
@@ -84,6 +84,10 @@ ContextImpl::ContextImpl(Stats::Scope& scope, const Envoy::Ssl::ContextConfig& c | |||
int rc = SSL_CTX_set_app_data(ctx.ssl_ctx_.get(), this); | |||
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or("")); | |||
|
|||
constexpr uint32_t mode = SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER; | |||
rc = SSL_CTX_set_mode(ctx.ssl_ctx_.get(), mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this needed? maybeLinearize
returns either unmodified slice or linearized slice, but I don't think there is a case when unmodified slice would be later linearized (assuming desired_min_size <= max_size
), so the buffer shouldn't move... or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably isn't strictly necessary, and with the current implementation I don't think the buffer will move. But the maybeLinearize
interface doesn't guarantee this property, and I didn't want to keep the existing book-keeping to ensure the write buffer doesn't change. It doesn't look to me like anything is more expensive in boringSSL when this mode is set, it just removes a check that isn't gaining anything for how we use the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but if maybeLinearize
implementation changes enough to require moving buffers, then we can enable this mode. Right now, it removes the default sanity check for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, looking into this I realize I lost part of the change (I did a bunch of moving code between branches to chop an originally big change into manageable pieces). To simplify this code, I removed bytes_to_retry_
. I subsequent call can then end up with a larger buffer from maybeLinearize
if data was added to the buffer since the last attempt at SSL_write
. That's why I was setting this option here.
I'm trying to remember why I made that change originally; I think it may have been to make it easier to read and reason about. I don't think it had a measurable performance impact. Any preference on whether I make that change or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I subsequent call can then end up with a larger buffer from maybeLinearize if data was added to the buffer since the last attempt at SSL_write. That's why I was setting this option here.
I don't believe that SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER
allows for buffer to grow between retries. AFAIK, the buffer data has to stay the same, but it can be available at a different address than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say it is allowed:
// In TLS, a non-blocking |SSL_write| differs from non-blocking |write| in that
// a failed |SSL_write| still commits to the data passed in. When retrying, the
// caller must supply the original write buffer (or a larger one containing the
// original as a prefix). By default, retries will fail if they also do not
// reuse the same |buf| pointer. This may be relaxed with
// |SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER|, but the buffer contents still must be
// unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence literally says the buffer contents still must be unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but there is also (or a larger one containing the original as a prefix)
, hmm. Maybe it's allowed after all.
|
||
// The next slice will already be of the desired size, so don't copy and | ||
// return the front slice. | ||
if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth considering a generalization of this logic so we refuse to copy if we ever encounter that the next slice is larger than some copy threshold?
That way a buffer containing: {1, 1, 1, 1, 1, 16kb} would only end up copying 5 bytes when called with parameters like: linearize(16kb, 4000)
@@ -248,15 +248,17 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st | |||
while (bytes_to_write > 0) { | |||
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number | |||
// of iterations of this loop, either by pure iterations, bytes written, etc. | |||
const auto slice = write_buffer.maybeLinearize(16384, 4096); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the concerns about resume from a different region could be addressed by skipping the call to linearize when doing a retry. When retrying, we know that the first slice contains roughly bytes_to_retry_
@@ -248,15 +248,17 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st | |||
while (bytes_to_write > 0) { | |||
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number | |||
// of iterations of this loop, either by pure iterations, bytes written, etc. | |||
const auto slice = write_buffer.maybeLinearize(16384, 4096); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend setting the copy threshold to 4000 bytes instead of 4096. This is related to the buffer default slice size being 4096 - sizeof(OwnedSlice) which is about 4032 bytes. Setting it to 4096 will result in a lot of spurious copies. See also #14054 (comment)
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Here's the results of the benchmark code I just pushed. The testparams are: full_linearize/short_slice_size/num_short_slices. full_linearize==0 means use the new The short slices were to try to force degenerate behavior. Please review the test code and make sure it's measuring what we want to measure. One thing I went back and forth on was whether to use a real socket, or use a mem BIO to communicate between client and server. The real sockets capture a real cost (kernel mode transitions), but I had some inconsistency between test runs, and I wonder if the syscalls contributed.
|
@antoniovicente The benchmark in this PR might help measure effects in #14111 |
num_writes++; | ||
} | ||
|
||
state.counters["writes_per_iteration"] = num_writes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we capture a count on the number of times linearize did something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
These results were run under bazel, so ignore timing. But the num_times_linearize_did_something
is deterministic.
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
testThroughput/0/0/0 330 us 328 us 1 num_times_linearize_did_something=10 throughput=548.771M/s writes_per_iteration=11
testThroughput/1/0/0 276 us 275 us 1 num_times_linearize_did_something=10 throughput=655.961M/s writes_per_iteration=11
testThroughput/0/1/1 285 us 285 us 1 num_times_linearize_did_something=10 throughput=633.111M/s writes_per_iteration=12
testThroughput/1/1/1 342 us 341 us 1 num_times_linearize_did_something=11 throughput=528.143M/s writes_per_iteration=11
testThroughput/0/128/1 290 us 289 us 1 num_times_linearize_did_something=9 throughput=623.632M/s writes_per_iteration=12
testThroughput/1/128/1 294 us 293 us 1 num_times_linearize_did_something=10 throughput=614.242M/s writes_per_iteration=11
testThroughput/0/4096/1 257 us 256 us 1 num_times_linearize_did_something=1 throughput=704.526M/s writes_per_iteration=12
testThroughput/1/4096/1 285 us 285 us 1 num_times_linearize_did_something=11 throughput=633.298M/s writes_per_iteration=11
testThroughput/0/1/2 283 us 283 us 1 num_times_linearize_did_something=11 throughput=637.848M/s writes_per_iteration=11
testThroughput/1/1/2 288 us 287 us 1 num_times_linearize_did_something=11 throughput=628.116M/s writes_per_iteration=11
testThroughput/0/128/2 316 us 316 us 1 num_times_linearize_did_something=7 throughput=571.135M/s writes_per_iteration=12
testThroughput/1/128/2 271 us 270 us 1 num_times_linearize_did_something=10 throughput=667.464M/s writes_per_iteration=11
testThroughput/0/4096/2 272 us 271 us 1 num_times_linearize_did_something=1 throughput=664.256M/s writes_per_iteration=13
testThroughput/1/4096/2 282 us 282 us 1 num_times_linearize_did_something=11 throughput=640.216M/s writes_per_iteration=11
testThroughput/0/1/3 283 us 282 us 1 num_times_linearize_did_something=11 throughput=639.23M/s writes_per_iteration=11
testThroughput/1/1/3 275 us 274 us 1 num_times_linearize_did_something=11 throughput=656.697M/s writes_per_iteration=11
testThroughput/0/128/3 281 us 280 us 1 num_times_linearize_did_something=5 throughput=642.652M/s writes_per_iteration=12
testThroughput/1/128/3 272 us 271 us 1 num_times_linearize_did_something=10 throughput=664.379M/s writes_per_iteration=11
testThroughput/0/4096/3 261 us 260 us 1 num_times_linearize_did_something=1 throughput=693.156M/s writes_per_iteration=14
testThroughput/1/4096/3 273 us 273 us 1 num_times_linearize_did_something=11 throughput=660.858M/s writes_per_iteration=11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_times_linearize_did_something looks high in the cases where maybeLinearize is used. The number of copies done by both implementations is the same most of the time.
"if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {" should be using desired_min_size
Results after this change:
Benchmark Time CPU Iterations UserCounters...
testThroughput/0/0/0 60.5 us 60.5 us 11572 num_times_linearize_did_something=0 throughput=2.97744G/s writes_per_iteration=12
testThroughput/1/0/0 62.9 us 62.9 us 11116 num_times_linearize_did_something=10 throughput=2.86485G/s writes_per_iteration=11
testThroughput/0/1/1 62.1 us 62.1 us 11279 num_times_linearize_did_something=0 throughput=2.90119G/s writes_per_iteration=13
testThroughput/1/1/1 64.0 us 64.0 us 10919 num_times_linearize_did_something=11 throughput=2.81688G/s writes_per_iteration=11
testThroughput/0/128/1 64.4 us 64.4 us 10880 num_times_linearize_did_something=0 throughput=2.79695G/s writes_per_iteration=14
testThroughput/1/128/1 63.3 us 63.3 us 11091 num_times_linearize_did_something=10 throughput=2.84636G/s writes_per_iteration=11
testThroughput/0/4096/1 62.6 us 62.6 us 11181 num_times_linearize_did_something=0 throughput=2.87997G/s writes_per_iteration=13
testThroughput/1/4096/1 63.5 us 63.5 us 11035 num_times_linearize_did_something=11 throughput=2.83919G/s writes_per_iteration=11
testThroughput/0/1/2 60.9 us 60.9 us 11496 num_times_linearize_did_something=1 throughput=2.96085G/s writes_per_iteration=12
testThroughput/1/1/2 63.9 us 63.8 us 10956 num_times_linearize_did_something=11 throughput=2.82267G/s writes_per_iteration=11
testThroughput/0/128/2 63.3 us 63.2 us 11093 num_times_linearize_did_something=1 throughput=2.85011G/s writes_per_iteration=13
testThroughput/1/128/2 63.1 us 63.1 us 11066 num_times_linearize_did_something=10 throughput=2.85719G/s writes_per_iteration=11
testThroughput/0/4096/2 64.4 us 64.4 us 10890 num_times_linearize_did_something=0 throughput=2.80025G/s writes_per_iteration=14
testThroughput/1/4096/2 63.5 us 63.5 us 11029 num_times_linearize_did_something=11 throughput=2.83725G/s writes_per_iteration=11
testThroughput/0/1/3 61.0 us 60.9 us 11461 num_times_linearize_did_something=1 throughput=2.957G/s writes_per_iteration=12
testThroughput/1/1/3 64.0 us 64.0 us 10950 num_times_linearize_did_something=11 throughput=2.81811G/s writes_per_iteration=11
testThroughput/0/128/3 63.8 us 63.8 us 11085 num_times_linearize_did_something=1 throughput=2.82643G/s writes_per_iteration=13
testThroughput/1/128/3 63.1 us 63.0 us 11112 num_times_linearize_did_something=10 throughput=2.85861G/s writes_per_iteration=11
testThroughput/0/4096/3 64.7 us 64.7 us 10804 num_times_linearize_did_something=1 throughput=2.78513G/s writes_per_iteration=14
testThroughput/1/4096/3 63.3 us 63.3 us 11051 num_times_linearize_did_something=11 throughput=2.84711G/s writes_per_iteration=11
When I try to run tls_throughput_benchmark, I get: |
Signed-off-by: Antonio Vicente <avd@google.com>
Does it also fail when you run it under bazel, eg |
That passes, but runs with the test flags to reduce number of iterations. I think that the binary ran successfully if I cd to directory where the binary lives. |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
} | ||
|
||
// The next slice will already be of the desired size, so don't copy and | ||
// return the front slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an error in the next line, should be:
if (slices_.size() >= 2 && slices_[1]->dataSize() >= desired_min_size) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's a heuristic, I wouldn't say it's an error, just a choice. If the next slice is slightly larger than 4k, and the current slice is 1 byte, what's the best behavior?
It turns out the slice sizes are terrible, as you've noted, due to the inline storage of the OwnedSlice. The second slice contains just slightly less than 16k (I think it's 64 bytes less), which results in a bunch of copies on subsequent slices.
I think the next step is to remove the inline-storage from the slice (#14111), then re-evaluate this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my read of the comment made me think that you intended to compare against desired_min_size.
I think an interesting case is write behavior for HTTP2 which involves a 9 byte data frame header followed by up to 16kb of data. The change in #14111 will have some consequences to how said writes interact with linearize, but I think would result in little to no performance consequences since both the versions of the buffer class would end up copying about the same amount of data during linearize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my read of the comment made me think that you intended to compare against desired_min_size.
I see the confusion now. That comment was written when the parameter had a different (less clear) name. I'll clarify the comment.
num_writes++; | ||
} | ||
|
||
state.counters["writes_per_iteration"] = num_writes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_times_linearize_did_something looks high in the cases where maybeLinearize is used. The number of copies done by both implementations is the same most of the time.
"if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {" should be using desired_min_size
Results after this change:
Benchmark Time CPU Iterations UserCounters...
testThroughput/0/0/0 60.5 us 60.5 us 11572 num_times_linearize_did_something=0 throughput=2.97744G/s writes_per_iteration=12
testThroughput/1/0/0 62.9 us 62.9 us 11116 num_times_linearize_did_something=10 throughput=2.86485G/s writes_per_iteration=11
testThroughput/0/1/1 62.1 us 62.1 us 11279 num_times_linearize_did_something=0 throughput=2.90119G/s writes_per_iteration=13
testThroughput/1/1/1 64.0 us 64.0 us 10919 num_times_linearize_did_something=11 throughput=2.81688G/s writes_per_iteration=11
testThroughput/0/128/1 64.4 us 64.4 us 10880 num_times_linearize_did_something=0 throughput=2.79695G/s writes_per_iteration=14
testThroughput/1/128/1 63.3 us 63.3 us 11091 num_times_linearize_did_something=10 throughput=2.84636G/s writes_per_iteration=11
testThroughput/0/4096/1 62.6 us 62.6 us 11181 num_times_linearize_did_something=0 throughput=2.87997G/s writes_per_iteration=13
testThroughput/1/4096/1 63.5 us 63.5 us 11035 num_times_linearize_did_something=11 throughput=2.83919G/s writes_per_iteration=11
testThroughput/0/1/2 60.9 us 60.9 us 11496 num_times_linearize_did_something=1 throughput=2.96085G/s writes_per_iteration=12
testThroughput/1/1/2 63.9 us 63.8 us 10956 num_times_linearize_did_something=11 throughput=2.82267G/s writes_per_iteration=11
testThroughput/0/128/2 63.3 us 63.2 us 11093 num_times_linearize_did_something=1 throughput=2.85011G/s writes_per_iteration=13
testThroughput/1/128/2 63.1 us 63.1 us 11066 num_times_linearize_did_something=10 throughput=2.85719G/s writes_per_iteration=11
testThroughput/0/4096/2 64.4 us 64.4 us 10890 num_times_linearize_did_something=0 throughput=2.80025G/s writes_per_iteration=14
testThroughput/1/4096/2 63.5 us 63.5 us 11029 num_times_linearize_did_something=11 throughput=2.83725G/s writes_per_iteration=11
testThroughput/0/1/3 61.0 us 60.9 us 11461 num_times_linearize_did_something=1 throughput=2.957G/s writes_per_iteration=12
testThroughput/1/1/3 64.0 us 64.0 us 10950 num_times_linearize_did_something=11 throughput=2.81811G/s writes_per_iteration=11
testThroughput/0/128/3 63.8 us 63.8 us 11085 num_times_linearize_did_something=1 throughput=2.82643G/s writes_per_iteration=13
testThroughput/1/128/3 63.1 us 63.0 us 11112 num_times_linearize_did_something=10 throughput=2.85861G/s writes_per_iteration=11
testThroughput/0/4096/3 64.7 us 64.7 us 10804 num_times_linearize_did_something=1 throughput=2.78513G/s writes_per_iteration=14
testThroughput/1/4096/3 63.3 us 63.3 us 11051 num_times_linearize_did_something=11 throughput=2.84711G/s writes_per_iteration=11
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Greg Greenway ggreenway@apple.com
Commit Message: This change tries to heuristically decide what block size to
write. 16kb blocks are most efficient in the TLS layer, but there were
cases where Envoy would memcpy nearly all the data in order to create
blocks of this size. This change improves performance by sometimes
writing smaller blocks in order to reduce memcpy.
Additional Description:
Risk Level: Medium
Testing: Added UT, all existing tests pass
Docs Changes: none
Release Notes: not needed; no functional change
Platform Specific Features: none
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]