-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MutableBuffer::into_buffer
leaking its extra capacity into the final buffer
#6300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @teh-cmc -- looks like an epic debugging exercise
I am concerned that this change will decrease performance by forcing an extra copy in all situations, though I may not understand the implications
This PR currently seems to have some CI failures
Some other thoughts:
- Have you considered making the initial capacity calculations more accurate (I am not sure this is possible) so the delta between capacity and actual is lower
- Adding another API (if it doesn't already exist) to "shrink_to_fit" for Arrays in general -- that would permit users to decide if they preferred larger allocations / fewer copies or smaller allocations / more copies
pub(super) fn into_buffer(self) -> Buffer { | ||
pub(super) fn into_buffer(mut self) -> Buffer { | ||
// Don't leak our extra capacity into the final buffer. | ||
self.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this force a copy in the case where the capacity is larger that the actual need?
I wouldn't expect this to have a noticable impact on performance for two reasons:
As far as I understand by looking at the code, at the end of the day all of this machinery bottoms out in I'd expect any decent allocator to do whatever is most efficient depending on the exact situation at hand.
That doesn't seem possible given the nature of
I've updated the tests to reflect the new expected capacity values, so they should now pass. Beware that I am changing these values with absolutely zero knowledge of the context in which these tests were written though: it would be fair to say that I have effectively no idea what I'm doing. That's it for the theory... now for the practice. I have found two relevant benchmark suites that relate to this change:
|
Uh-oh, I missed one!
That could be an alternative solution if this one doesn't cut it, yes. Beyond the fact that it is likely much more work, it does sound like an antipattern though: given that Arrow arrays are immutable once created, is there any situation in which you would ever want all that extra capacity to hang around? |
When the buffer is merely an intermediate that won't live for very long. This is extremely common in query processing, and in fact one of the major motivations for the new binary view types is to avoid copying at the expense of less efficient memory usage.
I believe this is only the case where the bump allocator is being used, which is almost always terrible from a performance standpoint. Provided capacity estimation is done correctly, as most of the kernels go to great pains to do, this shouldn't be a major issue. Perhaps there is a particular codepath you are running into that is not doing this?
I think this would be my preference, if only because it is much easier to understand the ramifications of such a change. Such a method should be relatively easy to add to the relevant ArrayBuilders, and therefore not hugely disruptive. |
I'm fine going down that route, but we would need for We keep a lot of Arrow data in long-lived memory storage, and therefore must be guaranteed that the data has been optimized for space before it gets committed to storage, regardless of how it got built or how it got there. Is that still okay? |
I think that would be generally useful, and would also handle the case of sliced arrays. I'd recommend something that returns a new Array, e.g. Edit: I guess one question would be if such a method should also optimise dictionaries... 🤔 Might be worth filing a ticket for discussion |
👍 Alright, closing this in favor of this issue then: I won't be able to attend to it for a couple weeks though. |
This has now been fixed in another way: |
The extra capacity that is initially used to efficiently append data to the mutable buffer ends up leaking into the final immutable buffer, where it will linger indefinitely, therefore hogging memory.
Consider this code:
HEAD
:This patch: