-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls: improve write performance by reducing copying #14053
Changes from all commits
a698a4f
1b7f1a2
2dcf2f9
5aa6497
77b75a0
6dd7c60
c12f14c
2e2a41d
da648ca
fed8fcc
a1d7cad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -277,6 +277,30 @@ void* OwnedImpl::linearize(uint32_t size) { | |
return slices_.front()->data(); | ||
} | ||
|
||
RawSlice OwnedImpl::maybeLinearize(uint32_t max_size, uint32_t desired_min_size) { | ||
while (!slices_.empty() && slices_[0]->dataSize() == 0) { | ||
slices_.pop_front(); | ||
} | ||
|
||
if (slices_.empty()) { | ||
return {nullptr, 0}; | ||
} | ||
|
||
const uint64_t slice_size = std::min<uint64_t>(slices_[0]->dataSize(), max_size); | ||
if (slice_size >= desired_min_size) { | ||
return {slices_[0]->data(), slice_size}; | ||
} | ||
|
||
// The next slice will already be of the desired size, so don't copy and | ||
// return the front slice. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's an error in the next line, should be: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that it's a heuristic, I wouldn't say it's an error, just a choice. If the next slice is slightly larger than 4k, and the current slice is 1 byte, what's the best behavior? It turns out the slice sizes are terrible, as you've noted, due to the inline storage of the OwnedSlice. The second slice contains just slightly less than 16k (I think it's 64 bytes less), which results in a bunch of copies on subsequent slices. I think the next step is to remove the inline-storage from the slice (#14111), then re-evaluate this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, my read of the comment made me think that you intended to compare against desired_min_size. I think an interesting case is write behavior for HTTP2 which involves a 9 byte data frame header followed by up to 16kb of data. The change in #14111 will have some consequences to how said writes interact with linearize, but I think would result in little to no performance consequences since both the versions of the buffer class would end up copying about the same amount of data during linearize. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see the confusion now. That comment was written when the parameter had a different (less clear) name. I'll clarify the comment. |
||
if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Worth considering a generalization of this logic so we refuse to copy if we ever encounter that the next slice is larger than some copy threshold? That way a buffer containing: {1, 1, 1, 1, 1, 16kb} would only end up copying 5 bytes when called with parameters like: linearize(16kb, 4000) |
||
return {slices_[0]->data(), slice_size}; | ||
} | ||
|
||
auto size = std::min<size_t>(max_size, length_); | ||
return {linearize(size), size}; | ||
} | ||
|
||
void OwnedImpl::coalesceOrAddSlice(SlicePtr&& other_slice) { | ||
const uint64_t slice_size = other_slice->dataSize(); | ||
// The `other_slice` content can be coalesced into the existing slice IFF: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -84,6 +84,10 @@ ContextImpl::ContextImpl(Stats::Scope& scope, const Envoy::Ssl::ContextConfig& c | |
int rc = SSL_CTX_set_app_data(ctx.ssl_ctx_.get(), this); | ||
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or("")); | ||
|
||
constexpr uint32_t mode = SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER; | ||
rc = SSL_CTX_set_mode(ctx.ssl_ctx_.get(), mode); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It probably isn't strictly necessary, and with the current implementation I don't think the buffer will move. But the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Huh, looking into this I realize I lost part of the change (I did a bunch of moving code between branches to chop an originally big change into manageable pieces). To simplify this code, I removed I'm trying to remember why I made that change originally; I think it may have been to make it easier to read and reason about. I don't think it had a measurable performance impact. Any preference on whether I make that change or not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't believe that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docs say it is allowed:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The last sentence literally says There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...but there is also |
||
RELEASE_ASSERT((rc & mode) == mode, Utility::getLastCryptoError().value_or("")); | ||
|
||
rc = SSL_CTX_set_min_proto_version(ctx.ssl_ctx_.get(), config.minProtocolVersion()); | ||
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or("")); | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -236,35 +236,27 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st | |
} | ||
} | ||
|
||
uint64_t bytes_to_write; | ||
if (bytes_to_retry_) { | ||
bytes_to_write = bytes_to_retry_; | ||
bytes_to_retry_ = 0; | ||
} else { | ||
bytes_to_write = std::min(write_buffer.length(), static_cast<uint64_t>(16384)); | ||
} | ||
|
||
uint64_t total_bytes_written = 0; | ||
while (bytes_to_write > 0) { | ||
while (true) { | ||
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number | ||
// of iterations of this loop, either by pure iterations, bytes written, etc. | ||
const auto slice = write_buffer.maybeLinearize(16384, 4096); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of the concerns about resume from a different region could be addressed by skipping the call to linearize when doing a retry. When retrying, we know that the first slice contains roughly bytes_to_retry_ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend setting the copy threshold to 4000 bytes instead of 4096. This is related to the buffer default slice size being 4096 - sizeof(OwnedSlice) which is about 4032 bytes. Setting it to 4096 will result in a lot of spurious copies. See also #14054 (comment) |
||
if (slice.len_ == 0) { | ||
break; | ||
} | ||
|
||
// SSL_write() requires that if a previous call returns SSL_ERROR_WANT_WRITE, we need to call | ||
// it again with the same parameters. This is done by tracking last write size, but not write | ||
// data, since linearize() will return the same undrained data anyway. | ||
ASSERT(bytes_to_write <= write_buffer.length()); | ||
int rc = SSL_write(rawSsl(), write_buffer.linearize(bytes_to_write), bytes_to_write); | ||
ASSERT(slice.mem_ != nullptr); | ||
int rc = SSL_write(rawSsl(), slice.mem_, slice.len_); | ||
ENVOY_CONN_LOG(trace, "ssl write returns: {}", callbacks_->connection(), rc); | ||
if (rc > 0) { | ||
ASSERT(rc == static_cast<int>(bytes_to_write)); | ||
ASSERT(rc == static_cast<int>(slice.len_)); | ||
ctx_->stats().write_size_.recordValue(rc); | ||
total_bytes_written += rc; | ||
write_buffer.drain(rc); | ||
bytes_to_write = std::min(write_buffer.length(), static_cast<uint64_t>(16384)); | ||
} else { | ||
int err = SSL_get_error(rawSsl(), rc); | ||
switch (err) { | ||
case SSL_ERROR_WANT_WRITE: | ||
bytes_to_retry_ = bytes_to_write; | ||
break; | ||
case SSL_ERROR_WANT_READ: | ||
// Renegotiation has started. We don't handle renegotiation so just fall through. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation only look at the 1st and the 2nd slice in the buffer, would it makes more sense if we return linearized a group of whole slices, just before it exceed the max_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. This is very much a heuristic, and there are a lot of ways this could potentially be improved. This improved performance of the case I was benchmarking, and didn't show any degradation for low-throughput (smaller request/response) traffic patterns, and was pretty simple and easy to reason about and/or predict what it will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but I suggested the above because then we don't need the second parameter but keep pretty much same behavior for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what you mean by "a linearized group of whole slices"? I don't understand what you're suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record. For instance, if all slices were 16383 bytes (1 less than a full record), I don't believe it's faster to linearize everything to 16384; emitting slightly smaller records will be faster.
I made a wild guess at picking 25% of a record as the threshold at which we should definitely linearize, and perf tests indicated that helped. But maybe I need to come up with a small benchmark to try to quantify this relationship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but here we're comparing emitting many small TLS records vs combining many small buffers a single TLS record, not copying data between slices to completely fill records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example I was considering, I was asking about slices in the 4-8kb range. In that range it's not clear to me whether the memcpy cost will be more than the overhead of generating a TLS record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between 2x 8KiB vs 1x 16KiB is probably negligible, but I'm pretty sure that
memcpy
overhead is smaller than writing additional TLS record(s) to the wire.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but that's an educated guess, feel free to benchmark this (e.g. by comparing proxy throughput, not userland microbenchmarks).