-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer: improve read reservations to efficiently handle multiple slices #14054
Conversation
Signed-off-by: Greg Greenway <ggreenway@apple.com>
This still needs tests written, but I wanted to get some feedback on the design first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some top level comments. Let me know how I can help further improve the buffer API and its use in transports/etc.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
@antoniovicente PTAL. This isn't tested yet, but I think this will keep us only going over the high watermark by up to 16k (same as before). Does this seem like the right approach? |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not looking at this earlier. I think you're hitting some weird edge cases that are not fully covered by existing e2e or performance tests.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
limit it Signed-off-by: Greg Greenway <ggreenway@apple.com>
@antoniovicente I'm working on merging in the changes around Slice layout into this PR. One oddity is what to do with |
I think that the replacement for SliceDataPtr in the original PR is |
The problem is that |
Hmmm. Ok, will think about it and get back to you. |
It could hold a |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
There are few options that come to mind:
I can see option 4 being attractive. Option 1 would also be efficient and not compromise much on ease of use. It may be useful to have a buffer AP method |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
instead of 1 block per slice Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page. Current benchmarks:
without freelist
|
I have mixed feelings about the freelist. In an ideal world, it would be better to let malloc take care of this, and it shows a slight perf degradation when all the slices get used by the read, but it is quite a bit faster for small reads. |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page.
Current benchmarks:
with freelist---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- bufferReserveCommit/131072 799 ns 799 ns 52535503 bufferReserveCommitPartial/131072 289 ns 289 ns 145588343
without freelist
---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- bufferReserveCommit/131072 750 ns 750 ns 55794513 bufferReserveCommitPartial/131072 407 ns 407 ns 103863598
I think these results show a clear benefit from freelisting. Thanks for bearing along until we got there. I think that the slight extra cost in the case where the freelist is fully drained each time is fine given that it provides a benefit in the more common partial commit case.
Give me a bit to look over the changes in 9f3a717
source/common/buffer/buffer_impl.cc
Outdated
reservation_slices.push_back(slice.reserve(size)); | ||
slices_owner->owned_slices_.emplace_back(std::move(slice)); | ||
bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining); | ||
reserved += reservation_slices.back().len_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't reservation_slices.back().len_ == size in the previous 2 statements? accessing the slice size via .back().len_ probably isn't free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now they are equal, but the API leaves room for Slice::reserve() to return a different size. But I'll stop getting via back()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this optimization.
slices[i].len_ = 0; | ||
} | ||
buf.commit(slices, allocated_slices); | ||
{ auto reservation = buf.reserveSingleSlice(1280); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reservation.commit(0); may be interesting, although I know is a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a regression test, I don't want to change it more than I have to. But I added a commit(0)
test earlier (line 916 of this file).
source/common/buffer/buffer_impl.cc
Outdated
reservation_slices.push_back(slice.reserve(size)); | ||
slices_owner->owned_slices_.emplace_back(std::move(slice)); | ||
bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining); | ||
reserved += reservation_slices.back().len_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
This test passed with default TCMalloc, but only because of implementation details of how TCMalloc worked, and it failed on ASAN and Windows. The reserve-single API no longer uses the free-list (unlike previous revisions of this PR), and unused slices are no longer stored in OwnedImpl as empty slices (unlike the code before this PR). Therefore, there are no guarantees about which specific slice memory is used between un-committed reservations; the result is determined by the malloc implementation, and there is no good reason to write a test for what that behavior may be. Signed-off-by: Greg Greenway <ggreenway@apple.com>
@antoniovicente there was a real test failure on both ASAN and Windows. Slightly different failure on the two, but both for the same reason. I deleted the offending test; explanation in commit message. |
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing test that attempted to exercise undefined behavior
Mac CI passed all tests; some cleanup task of the CI job appears to have timed out. |
Signed-off-by: Greg Greenway ggreenway@apple.com
Commit Message: Enable reading larger chunks from sockets in a single call without drastically increasing memory waste by implementing a system where reservations of multiple slices are made, and unused slices (after the read operation) are put into a small cache for re-use by the next read operation.
The largest read operation changed from 16k to 128k.
Watermark buffer limits are still enforced; large reads only happen if buffer limits allow and space is available.
This improves performance in some high-throughput use cases.
Additional Description:
Risk Level: Medium (bugs could result in more memory used than configuration should allow)
Testing: Added tests; all existing tests pass
Docs Changes: None; internal only change
Release Notes: added
Platform Specific Features: None
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]