Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls: improve write performance by reducing copying #14053

Closed
wants to merge 11 commits into from
11 changes: 11 additions & 0 deletions include/envoy/buffer/buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ struct RawSlice {
size_t len_ = 0;

bool operator==(const RawSlice& rhs) const { return mem_ == rhs.mem_ && len_ == rhs.len_; }
bool operator!=(const RawSlice& rhs) const { return !(*this == rhs); }
};

using RawSliceVector = absl::InlinedVector<RawSlice, 16>;
Expand Down Expand Up @@ -187,6 +188,16 @@ class Instance {
*/
virtual void* linearize(uint32_t size) PURE;

/**
* Get a pointer to a linear chunk of this buffer. The chunk may be smaller than max_size, even if
* the length of the buffer is larger. The function will heuristically determine how much data to
* copy based on `desired_min_size`, in order to avoid patterns in which all the data is copied
* when it doesn't need to be. For example, if the buffer contains a slice containing 1 byte,
* followed by 100 slices containing ``max_size``, repeatedly calling this function would avoid
* repeatedly copying ``max_size - 1`` bytes to make chunks of ``max_size``.
*/
virtual RawSlice maybeLinearize(uint32_t max_size, uint32_t desired_min_size) PURE;

/**
* Move a buffer into this buffer. As little copying is done as possible.
* @param rhs supplies the buffer to move.
Expand Down
24 changes: 24 additions & 0 deletions source/common/buffer/buffer_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,30 @@ void* OwnedImpl::linearize(uint32_t size) {
return slices_.front()->data();
}

RawSlice OwnedImpl::maybeLinearize(uint32_t max_size, uint32_t desired_min_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation only look at the 1st and the 2nd slice in the buffer, would it makes more sense if we return linearized a group of whole slices, just before it exceed the max_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. This is very much a heuristic, and there are a lot of ways this could potentially be improved. This improved performance of the case I was benchmarking, and didn't show any degradation for low-throughput (smaller request/response) traffic patterns, and was pretty simple and easy to reason about and/or predict what it will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but I suggested the above because then we don't need the second parameter but keep pretty much same behavior for the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what you mean by "a linearized group of whole slices"? I don't understand what you're suggesting.

Copy link
Member

@lizan lizan Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like:

uint64_t size = 0;
for (slice : slices_) {
  if (size + slice_->dataSize() > max_size) break;
  size += slice_->dataSize();
}

return {linearize(size), size};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record. For instance, if all slices were 16383 bytes (1 less than a full record), I don't believe it's faster to linearize everything to 16384; emitting slightly smaller records will be faster.

I made a wild guess at picking 25% of a record as the threshold at which we should definitely linearize, and perf tests indicated that helped. But maybe I need to come up with a small benchmark to try to quantify this relationship.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but here we're comparing emitting many small TLS records vs combining many small buffers a single TLS record, not copying data between slices to completely fill records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example I was considering, I was asking about slices in the 4-8kb range. In that range it's not clear to me whether the memcpy cost will be more than the overhead of generating a TLS record.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between 2x 8KiB vs 1x 16KiB is probably negligible, but I'm pretty sure that memcpy overhead is smaller than writing additional TLS record(s) to the wire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but that's an educated guess, feel free to benchmark this (e.g. by comparing proxy throughput, not userland microbenchmarks).

while (!slices_.empty() && slices_[0]->dataSize() == 0) {
slices_.pop_front();
}

if (slices_.empty()) {
return {nullptr, 0};
}

const uint64_t slice_size = std::min<uint64_t>(slices_[0]->dataSize(), max_size);
if (slice_size >= desired_min_size) {
return {slices_[0]->data(), slice_size};
}

// The next slice will already be of the desired size, so don't copy and
// return the front slice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an error in the next line, should be:
if (slices_.size() >= 2 && slices_[1]->dataSize() >= desired_min_size) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's a heuristic, I wouldn't say it's an error, just a choice. If the next slice is slightly larger than 4k, and the current slice is 1 byte, what's the best behavior?

It turns out the slice sizes are terrible, as you've noted, due to the inline storage of the OwnedSlice. The second slice contains just slightly less than 16k (I think it's 64 bytes less), which results in a bunch of copies on subsequent slices.

I think the next step is to remove the inline-storage from the slice (#14111), then re-evaluate this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my read of the comment made me think that you intended to compare against desired_min_size.

I think an interesting case is write behavior for HTTP2 which involves a 9 byte data frame header followed by up to 16kb of data. The change in #14111 will have some consequences to how said writes interact with linearize, but I think would result in little to no performance consequences since both the versions of the buffer class would end up copying about the same amount of data during linearize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my read of the comment made me think that you intended to compare against desired_min_size.

I see the confusion now. That comment was written when the parameter had a different (less clear) name. I'll clarify the comment.

if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering a generalization of this logic so we refuse to copy if we ever encounter that the next slice is larger than some copy threshold?

That way a buffer containing: {1, 1, 1, 1, 1, 16kb} would only end up copying 5 bytes when called with parameters like: linearize(16kb, 4000)

return {slices_[0]->data(), slice_size};
}

auto size = std::min<size_t>(max_size, length_);
return {linearize(size), size};
}

void OwnedImpl::coalesceOrAddSlice(SlicePtr&& other_slice) {
const uint64_t slice_size = other_slice->dataSize();
// The `other_slice` content can be coalesced into the existing slice IFF:
Expand Down
1 change: 1 addition & 0 deletions source/common/buffer/buffer_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,7 @@ class OwnedImpl : public LibEventInstance {
SliceDataPtr extractMutableFrontSlice() override;
uint64_t length() const override;
void* linearize(uint32_t size) override;
RawSlice maybeLinearize(uint32_t max_size, uint32_t desired_min_size) override;
void move(Instance& rhs) override;
void move(Instance& rhs, uint64_t length) override;
uint64_t reserve(uint64_t length, RawSlice* iovecs, uint64_t num_iovecs) override;
Expand Down
4 changes: 4 additions & 0 deletions source/extensions/transport_sockets/tls/context_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ ContextImpl::ContextImpl(Stats::Scope& scope, const Envoy::Ssl::ContextConfig& c
int rc = SSL_CTX_set_app_data(ctx.ssl_ctx_.get(), this);
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or(""));

constexpr uint32_t mode = SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER;
rc = SSL_CTX_set_mode(ctx.ssl_ctx_.get(), mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this needed? maybeLinearize returns either unmodified slice or linearized slice, but I don't think there is a case when unmodified slice would be later linearized (assuming desired_min_size <= max_size), so the buffer shouldn't move... or am I missing something?

Copy link
Contributor Author

@ggreenway ggreenway Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably isn't strictly necessary, and with the current implementation I don't think the buffer will move. But the maybeLinearize interface doesn't guarantee this property, and I didn't want to keep the existing book-keeping to ensure the write buffer doesn't change. It doesn't look to me like anything is more expensive in boringSSL when this mode is set, it just removes a check that isn't gaining anything for how we use the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but if maybeLinearize implementation changes enough to require moving buffers, then we can enable this mode. Right now, it removes the default sanity check for no reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, looking into this I realize I lost part of the change (I did a bunch of moving code between branches to chop an originally big change into manageable pieces). To simplify this code, I removed bytes_to_retry_. I subsequent call can then end up with a larger buffer from maybeLinearize if data was added to the buffer since the last attempt at SSL_write. That's why I was setting this option here.

I'm trying to remember why I made that change originally; I think it may have been to make it easier to read and reason about. I don't think it had a measurable performance impact. Any preference on whether I make that change or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I subsequent call can then end up with a larger buffer from maybeLinearize if data was added to the buffer since the last attempt at SSL_write. That's why I was setting this option here.

I don't believe that SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER allows for buffer to grow between retries. AFAIK, the buffer data has to stay the same, but it can be available at a different address than before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say it is allowed:

// In TLS, a non-blocking |SSL_write| differs from non-blocking |write| in that
// a failed |SSL_write| still commits to the data passed in. When retrying, the
// caller must supply the original write buffer (or a larger one containing the
// original as a prefix). By default, retries will fail if they also do not
// reuse the same |buf| pointer. This may be relaxed with
// |SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER|, but the buffer contents still must be
// unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence literally says the buffer contents still must be unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but there is also (or a larger one containing the original as a prefix), hmm. Maybe it's allowed after all.

RELEASE_ASSERT((rc & mode) == mode, Utility::getLastCryptoError().value_or(""));

rc = SSL_CTX_set_min_proto_version(ctx.ssl_ctx_.get(), config.minProtocolVersion());
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or(""));

Expand Down
3 changes: 2 additions & 1 deletion source/extensions/transport_sockets/tls/context_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ namespace Tls {
COUNTER(ocsp_staple_failed) \
COUNTER(ocsp_staple_omitted) \
COUNTER(ocsp_staple_responses) \
COUNTER(ocsp_staple_requests)
COUNTER(ocsp_staple_requests) \
HISTOGRAM(write_size, Bytes)

/**
* Wrapper struct for SSL stats. @see stats_macros.h
Expand Down
26 changes: 9 additions & 17 deletions source/extensions/transport_sockets/tls/ssl_socket.cc
Original file line number Diff line number Diff line change
Expand Up @@ -236,35 +236,27 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st
}
}

uint64_t bytes_to_write;
if (bytes_to_retry_) {
bytes_to_write = bytes_to_retry_;
bytes_to_retry_ = 0;
} else {
bytes_to_write = std::min(write_buffer.length(), static_cast<uint64_t>(16384));
}

uint64_t total_bytes_written = 0;
while (bytes_to_write > 0) {
while (true) {
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number
// of iterations of this loop, either by pure iterations, bytes written, etc.
const auto slice = write_buffer.maybeLinearize(16384, 4096);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the concerns about resume from a different region could be addressed by skipping the call to linearize when doing a retry. When retrying, we know that the first slice contains roughly bytes_to_retry_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend setting the copy threshold to 4000 bytes instead of 4096. This is related to the buffer default slice size being 4096 - sizeof(OwnedSlice) which is about 4032 bytes. Setting it to 4096 will result in a lot of spurious copies. See also #14054 (comment)

if (slice.len_ == 0) {
break;
}

// SSL_write() requires that if a previous call returns SSL_ERROR_WANT_WRITE, we need to call
// it again with the same parameters. This is done by tracking last write size, but not write
// data, since linearize() will return the same undrained data anyway.
ASSERT(bytes_to_write <= write_buffer.length());
int rc = SSL_write(rawSsl(), write_buffer.linearize(bytes_to_write), bytes_to_write);
ASSERT(slice.mem_ != nullptr);
int rc = SSL_write(rawSsl(), slice.mem_, slice.len_);
ENVOY_CONN_LOG(trace, "ssl write returns: {}", callbacks_->connection(), rc);
if (rc > 0) {
ASSERT(rc == static_cast<int>(bytes_to_write));
ASSERT(rc == static_cast<int>(slice.len_));
ctx_->stats().write_size_.recordValue(rc);
total_bytes_written += rc;
write_buffer.drain(rc);
bytes_to_write = std::min(write_buffer.length(), static_cast<uint64_t>(16384));
} else {
int err = SSL_get_error(rawSsl(), rc);
switch (err) {
case SSL_ERROR_WANT_WRITE:
bytes_to_retry_ = bytes_to_write;
break;
case SSL_ERROR_WANT_READ:
// Renegotiation has started. We don't handle renegotiation so just fall through.
Expand Down
1 change: 0 additions & 1 deletion source/extensions/transport_sockets/tls/ssl_socket.h
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,6 @@ class SslSocket : public Network::TransportSocket,
const Network::TransportSocketOptionsSharedPtr transport_socket_options_;
Network::TransportSocketCallbacks* callbacks_{};
ContextImplSharedPtr ctx_;
uint64_t bytes_to_retry_{};
std::string failure_reason_;

SslHandshakerImplSharedPtr info_;
Expand Down
4 changes: 4 additions & 0 deletions test/common/buffer/buffer_fuzz.cc
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,10 @@ class StringBuffer : public Buffer::Instance {
return mutableStart();
}

Buffer::RawSlice maybeLinearize(uint32_t max_size, uint32_t /*desired_min_size*/) override {
return {mutableStart(), std::min(size_, max_size)};
}

Buffer::SliceDataPtr extractMutableFrontSlice() override { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; }

void move(Buffer::Instance& rhs) override { move(rhs, rhs.length()); }
Expand Down
54 changes: 54 additions & 0 deletions test/common/buffer/owned_impl_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -786,6 +786,60 @@ TEST_F(OwnedImplTest, LinearizeDrainTracking) {
expectSlices({}, buffer);
}

TEST_F(OwnedImplTest, MaybeLinearizeEmpty) {
Buffer::OwnedImpl empty;
EXPECT_EQ(0, empty.maybeLinearize(1024, 1024).len_);
}

// Test that the correct value is returned in both the case where
// the slice has a larger and a smaller length than `desired_min_size`.
TEST_F(OwnedImplTest, MaybeLinearizeSingleSlice) {
Buffer::OwnedImpl buffer;
buffer.add(std::string(100, 'a'));
EXPECT_EQ(100, buffer.maybeLinearize(1024, 512).len_);
EXPECT_EQ(100, buffer.maybeLinearize(1024, 1).len_);
}

TEST_F(OwnedImplTest, MaybeLinearizeDesiredMinSize) {
Buffer::OwnedImpl buffer;
buffer.add(std::string(10000, 'a'));
Buffer::OwnedImpl other;
other.add(std::string(10000, 'b'));
buffer.move(other);

// Verify test slices are as expected
const auto slices = buffer.getRawSlices();
ASSERT_EQ(2, slices.size());
ASSERT_EQ(10000, slices[0].len_);
ASSERT_EQ(10000, slices[1].len_);

// Ask for the entire buffer size. This should return only the first slice because
// `desired_min_size` is less than the size of that slice.
EXPECT_EQ(slices[0], buffer.maybeLinearize(20000, 9999));

// Ask for the entire buffer size, but with a desired_min_size greater than the first
// slice. This should get fully linearized into a single slice.
EXPECT_EQ(20000, buffer.maybeLinearize(20000, 10001).len_);
}

// Test that a smaller slice than `desired_min_size` is returned if the next slice
// after it is full-sized.
TEST_F(OwnedImplTest, MaybeLinearizePreferNextSlice) {
Buffer::OwnedImpl buffer;
buffer.add("a");
Buffer::OwnedImpl other;
other.add(std::string(10000, 'b'));
buffer.move(other);

// Verify test slices are as expected
const auto slices = buffer.getRawSlices();
ASSERT_EQ(2, slices.size());
ASSERT_EQ(1, slices[0].len_);
ASSERT_EQ(10000, slices[1].len_);

EXPECT_EQ(1, buffer.maybeLinearize(10000, 1024).len_);
}

TEST_F(OwnedImplTest, ReserveCommit) {
// This fragment will later be added to the buffer. It is declared in an enclosing scope to
// ensure it is not destructed until after the buffer is.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -527,6 +527,8 @@ class FakeBuffer : public Buffer::Instance {
MOCK_METHOD(Buffer::SliceDataPtr, extractMutableFrontSlice, (), (override));
MOCK_METHOD(uint64_t, length, (), (const, override));
MOCK_METHOD(void*, linearize, (uint32_t), (override));
MOCK_METHOD(Buffer::RawSlice, maybeLinearize, (uint32_t max_size, uint32_t desired_min_size),
(override));
MOCK_METHOD(void, move, (Instance&), (override));
MOCK_METHOD(void, move, (Instance&, uint64_t), (override));
MOCK_METHOD(uint64_t, reserve, (uint64_t, Buffer::RawSlice*, uint64_t), (override));
Expand Down
22 changes: 22 additions & 0 deletions test/extensions/transport_sockets/tls/BUILD
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
load(
"//bazel:envoy_build_system.bzl",
"envoy_benchmark_test",
"envoy_cc_benchmark_binary",
"envoy_cc_test",
"envoy_cc_test_library",
"envoy_package",
Expand Down Expand Up @@ -179,3 +181,23 @@ envoy_cc_test(
"//test/mocks/stats:stats_mocks",
],
)

envoy_cc_benchmark_binary(
name = "tls_throughput_benchmark",
srcs = ["tls_throughput_test.cc"],
data = [
"//test/extensions/transport_sockets/tls/test_data:certs",
],
external_deps = [
"benchmark",
"ssl",
],
deps = [
"//source/common/buffer:buffer_lib",
],
)

envoy_benchmark_test(
name = "tls_throughput_benchmark_test",
benchmark_binary = "tls_throughput_benchmark",
)
Loading