buffer: improve read reservations to efficiently handle multiple slices #14054

ggreenway · 2020-11-17T01:11:54Z

Signed-off-by: Greg Greenway ggreenway@apple.com

Commit Message: Enable reading larger chunks from sockets in a single call without drastically increasing memory waste by implementing a system where reservations of multiple slices are made, and unused slices (after the read operation) are put into a small cache for re-use by the next read operation.

The largest read operation changed from 16k to 128k.

Watermark buffer limits are still enforced; large reads only happen if buffer limits allow and space is available.

This improves performance in some high-throughput use cases.

Additional Description:
Risk Level: Medium (bugs could result in more memory used than configuration should allow)
Testing: Added tests; all existing tests pass
Docs Changes: None; internal only change
Release Notes: added
Platform Specific Features: None
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-11-17T01:14:00Z

This still needs tests written, but I wanted to get some feedback on the design first.

antoniovicente

Just some top level comments. Let me know how I can help further improve the buffer API and its use in transports/etc.

source/common/network/raw_buffer_socket.cc

source/extensions/transport_sockets/tls/ssl_socket.cc

include/envoy/buffer/buffer.h

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-12-02T18:20:36Z

@antoniovicente PTAL. This isn't tested yet, but I think this will keep us only going over the high watermark by up to 16k (same as before). Does this seem like the right approach?

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente

Sorry for not looking at this earlier. I think you're hitting some weird edge cases that are not fully covered by existing e2e or performance tests.

source/common/buffer/watermark_buffer.h

source/common/buffer/watermark_buffer.cc

source/extensions/transport_sockets/tls/ssl_socket.cc

Signed-off-by: Greg Greenway <ggreenway@apple.com>

limit it Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-12-10T20:05:39Z

@antoniovicente I'm working on merging in the changes around Slice layout into this PR. One oddity is what to do with Reservation::owned_slices_. There's no longer a type in buffer.h to put in here. Given that the type isn't known in this context, should it just be a void*, plus function to properly free it? Or declare a base type (with no members) in buffer.h to have Slice inherit from, just to put something better than void* here? In either case, actually free'ing the held pointer is delegated to something setup in reserve().

antoniovicente · 2020-12-10T20:27:42Z

@antoniovicente I'm working on merging in the changes around Slice layout into this PR. One oddity is what to do with Reservation::owned_slices_. There's no longer a type in buffer.h to put in here. Given that the type isn't known in this context, should it just be a void*, plus function to properly free it? Or declare a base type (with no members) in buffer.h to have Slice inherit from, just to put something better than void* here? In either case, actually free'ing the held pointer is delegated to something setup in reserve().

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

ggreenway · 2020-12-10T20:39:51Z

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

The problem is that Slice is defined in buffer_impl.h, but Reservation is defined in buffer.h. Reservation is part of the interface, and Slice is now strictly part of the implementation.

antoniovicente · 2020-12-10T20:43:04Z

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

The problem is that Slice is defined in buffer_impl.h, but Reservation is defined in buffer.h. Reservation is part of the interface, and Slice is now strictly part of the implementation.

Hmmm. Ok, will think about it and get back to you.

ggreenway · 2020-12-10T20:45:00Z

It could hold a std::unique_ptr<SliceData>, but then I'd have to allocate a SliceDataImpl for each slice. I'd prefer to avoid the extra temporary allocation.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente · 2020-12-10T21:53:06Z

There are few options that come to mind:

Have the buffer own the slices in the reservation instead of having the reservation own them. This is similar to what the old reserve implementation does.
Have reserve return ReservationPtr and allow OwnedImpl to implement its own override of the Reservation mechanism
Have the Reservation store a pointer to a cleanup function that should be run if the Reservation is not committed to the buffer.
Have reservations hold a parallel array of std::unique_ptr<uint8_t> for the storage associated with the raw slices. Commit would need to know how to create OwnedImpl::Slice from a uint8_t[], data size and capacity.
Similar to the previous option, but have the reservation be single region only.

I can see option 4 being attractive. Option 1 would also be efficient and not compromise much on ease of use.

It may be useful to have a buffer AP method
void Buffer::add(std::unique_ptr<uint8_t> data, uint64_t data_start, uint64_t data_end, uint64_t capacity) that allows the buffer to take ownership of externally allocated memory regions.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

instead of 1 block per slice Signed-off-by: Greg Greenway <ggreenway@apple.com>

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2021-01-26T19:02:24Z

Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page.

Current benchmarks:
with freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               799 ns          799 ns     52535503
bufferReserveCommitPartial/131072        289 ns          289 ns    145588343

without freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               750 ns          750 ns     55794513
bufferReserveCommitPartial/131072        407 ns          407 ns    103863598

ggreenway · 2021-01-26T19:04:29Z

I have mixed feelings about the freelist. In an ideal world, it would be better to let malloc take care of this, and it shows a slight perf degradation when all the slices get used by the read, but it is quite a bit faster for small reads.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente

Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page.

Current benchmarks:
with freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               799 ns          799 ns     52535503
bufferReserveCommitPartial/131072        289 ns          289 ns    145588343

without freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               750 ns          750 ns     55794513
bufferReserveCommitPartial/131072        407 ns          407 ns    103863598

I think these results show a clear benefit from freelisting. Thanks for bearing along until we got there. I think that the slight extra cost in the case where the freelist is fully drained each time is fine given that it provides a benefit in the more common partial commit case.

Give me a bit to look over the changes in 9f3a717

antoniovicente · 2021-01-26T22:44:07Z

source/common/buffer/buffer_impl.cc

+    reservation_slices.push_back(slice.reserve(size));
+    slices_owner->owned_slices_.emplace_back(std::move(slice));
+    bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining);
+    reserved += reservation_slices.back().len_;


Isn't reservation_slices.back().len_ == size in the previous 2 statements? accessing the slice size via .back().len_ probably isn't free.

I think right now they are equal, but the API leaves room for Slice::reserve() to return a different size. But I'll stop getting via back().

Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente · 2021-01-27T07:31:42Z

/retest

repokitteh-read-only · 2021-01-27T07:31:47Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14054 (comment) was created by @antoniovicente.

see: more, trace.

antoniovicente

Thanks again for this optimization.

test/common/buffer/owned_impl_test.cc

antoniovicente · 2021-01-27T08:14:59Z

test/common/buffer/owned_impl_test.cc

-    slices[i].len_ = 0;
-  }
-  buf.commit(slices, allocated_slices);
+  { auto reservation = buf.reserveSingleSlice(1280); }


reservation.commit(0); may be interesting, although I know is a no-op.

Given that this is a regression test, I don't want to change it more than I have to. But I added a commit(0) test earlier (line 916 of this file).

antoniovicente · 2021-01-27T08:22:05Z

source/common/buffer/buffer_impl.cc

+    reservation_slices.push_back(slice.reserve(size));
+    slices_owner->owned_slices_.emplace_back(std::move(slice));
+    bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining);
+    reserved += reservation_slices.back().len_;


Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

This test passed with default TCMalloc, but only because of implementation details of how TCMalloc worked, and it failed on ASAN and Windows. The reserve-single API no longer uses the free-list (unlike previous revisions of this PR), and unused slices are no longer stored in OwnedImpl as empty slices (unlike the code before this PR). Therefore, there are no guarantees about which specific slice memory is used between un-committed reservations; the result is determined by the malloc implementation, and there is no good reason to write a test for what that behavior may be. Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2021-01-27T22:43:21Z

@antoniovicente there was a real test failure on both ASAN and Windows. Slightly different failure on the two, but both for the same reason. I deleted the offending test; explanation in commit message.

ggreenway · 2021-01-28T00:22:54Z

/retest

repokitteh-read-only · 2021-01-28T00:22:58Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14054 (comment) was created by @ggreenway.

see: more, trace.

antoniovicente

Thanks for removing test that attempted to exercise undefined behavior

ggreenway · 2021-02-01T21:26:00Z

Mac CI passed all tests; some cleanup task of the CI job appears to have timed out.

buffer: improve read reservations to efficiently handle multiple slices

9c5199f

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway requested review from asraa, dio, lizan and PiotrSikora as code owners November 17, 2020 01:11

mattklein123 assigned antoniovicente Nov 18, 2020

antoniovicente reviewed Nov 18, 2020

View reviewed changes

antoniovicente mentioned this pull request Nov 18, 2020

tls: improve write performance by reducing copying #14053

Closed

ggreenway added 2 commits November 30, 2020 12:40

Merge remote-tracking branch 'upstream/master' into multi-slice-read

246f167

honor high watermark

ae815f1

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway added 4 commits December 4, 2020 10:10

Merge remote-tracking branch 'upstream/master' into multi-slice-read

6adeeab

Signed-off-by: Greg Greenway <ggreenway@apple.com>

fix test expectation

6af8023

Signed-off-by: Greg Greenway <ggreenway@apple.com>

fix incorrect merge conflict resolution

c8c9cce

Signed-off-by: Greg Greenway <ggreenway@apple.com>

remove commented out code

a0b68b7

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente reviewed Dec 7, 2020

View reviewed changes

Remove old reserve/commit API; migrate all uses to new API

ff13426

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway requested review from alyssawilk and mattklein123 as code owners December 7, 2020 23:05

don't increase size of preferred length in watermark code trying to

c0661eb

limit it Signed-off-by: Greg Greenway <ggreenway@apple.com>

Merge remote-tracking branch 'upstream/master' into multi-slice-read

0416701

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway added 2 commits December 10, 2020 16:01

api comments; delete old code

6e9f1aa

Signed-off-by: Greg Greenway <ggreenway@apple.com>

readability cleanup

eb3fe0e

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway added 4 commits January 25, 2021 16:32

slightly improve speed_test

0c26ceb

Signed-off-by: Greg Greenway <ggreenway@apple.com>

only allocate one block for impl to track ownership in reservation,

07a9314

instead of 1 block per slice Signed-off-by: Greg Greenway <ggreenway@apple.com>

fix build

6bc1fc3

Signed-off-by: Greg Greenway <ggreenway@apple.com>

optimize: only lookup thread_local once per reservation

9f3a717

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway added 2 commits January 26, 2021 11:28

fix test

ef66dc4

Signed-off-by: Greg Greenway <ggreenway@apple.com>

clang-tidy

47d7cfd

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente reviewed Jan 26, 2021

View reviewed changes

minor cleanup of len_ handling

ea5e975

Signed-off-by: Greg Greenway <ggreenway@apple.com>

antoniovicente previously approved these changes Jan 27, 2021

View reviewed changes

ggreenway added 2 commits January 27, 2021 09:12

more expectations in ReserveZeroCommit

4a92b7d

Signed-off-by: Greg Greenway <ggreenway@apple.com>

Merge remote-tracking branch 'upstream/main' into multi-slice-read

97bd65b

ggreenway dismissed antoniovicente’s stale review via 97bd65b January 27, 2021 17:17

antoniovicente previously approved these changes Jan 27, 2021

View reviewed changes

ggreenway dismissed antoniovicente’s stale review via 7b94c6f January 27, 2021 21:45

antoniovicente approved these changes Jan 28, 2021

View reviewed changes

ggreenway added 2 commits February 1, 2021 08:26

Merge remote-tracking branch 'upstream/main' into multi-slice-read

180ab4b

Merge remote-tracking branch 'upstream/main' into multi-slice-read

875e8d3

ggreenway merged commit 241a955 into envoyproxy:main Feb 1, 2021

antoniovicente mentioned this pull request Feb 4, 2021

extension: User space io socket #14917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer: improve read reservations to efficiently handle multiple slices #14054

buffer: improve read reservations to efficiently handle multiple slices #14054

ggreenway commented Nov 17, 2020 •

edited

Loading

ggreenway commented Nov 17, 2020

antoniovicente left a comment

ggreenway commented Dec 2, 2020

antoniovicente left a comment

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020 •

edited

Loading

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020

ggreenway commented Jan 26, 2021

ggreenway commented Jan 26, 2021

antoniovicente left a comment

antoniovicente Jan 26, 2021

ggreenway Jan 26, 2021

antoniovicente Jan 27, 2021

antoniovicente commented Jan 27, 2021

repokitteh-read-only bot commented Jan 27, 2021

antoniovicente left a comment

antoniovicente Jan 27, 2021

ggreenway Jan 27, 2021

antoniovicente Jan 27, 2021

ggreenway commented Jan 27, 2021

ggreenway commented Jan 28, 2021

repokitteh-read-only bot commented Jan 28, 2021

antoniovicente left a comment

ggreenway commented Feb 1, 2021

buffer: improve read reservations to efficiently handle multiple slices #14054

buffer: improve read reservations to efficiently handle multiple slices #14054

Conversation

ggreenway commented Nov 17, 2020 • edited Loading

ggreenway commented Nov 17, 2020

antoniovicente left a comment

Choose a reason for hiding this comment

ggreenway commented Dec 2, 2020

antoniovicente left a comment

Choose a reason for hiding this comment

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020 • edited Loading

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020

ggreenway commented Dec 10, 2020

antoniovicente commented Dec 10, 2020

ggreenway commented Jan 26, 2021

ggreenway commented Jan 26, 2021

antoniovicente left a comment

Choose a reason for hiding this comment

antoniovicente Jan 26, 2021

Choose a reason for hiding this comment

ggreenway Jan 26, 2021

Choose a reason for hiding this comment

antoniovicente Jan 27, 2021

Choose a reason for hiding this comment

antoniovicente commented Jan 27, 2021

repokitteh-read-only bot commented Jan 27, 2021

antoniovicente left a comment

Choose a reason for hiding this comment

antoniovicente Jan 27, 2021

Choose a reason for hiding this comment

ggreenway Jan 27, 2021

Choose a reason for hiding this comment

antoniovicente Jan 27, 2021

Choose a reason for hiding this comment

ggreenway commented Jan 27, 2021

ggreenway commented Jan 28, 2021

repokitteh-read-only bot commented Jan 28, 2021

antoniovicente left a comment

Choose a reason for hiding this comment

ggreenway commented Feb 1, 2021

ggreenway commented Nov 17, 2020 •

edited

Loading

antoniovicente commented Dec 10, 2020 •

edited

Loading