-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of filter operation on ByteViewArray #6154
base: main
Are you sure you want to change the base?
Conversation
Thanks @chloro-pn. 🙏 We have gone back and forth on this idea while integrating StringView into datafusion The StringViewArray has a https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc In fact @XiangpengHao used this API in apache/datafusion#11587 to solve exactly the problem you are describing (too much unused data in the buffers) However, what I worry about is that the heuristic to determine when to compact the string data / buffers will not be ideal for any particular usecase and that one princple of this crate is to give the user maximal control over performance So I would like to propose we support two different modes for filter kernels:
@XiangpengHao I wonder if you have any thoughts to add here? |
From the documentation of
The modification of this PR does not perform expensive operations of compressing |
I think this is a clever PR and is complementary to the There's a hierarchy of This PR makes the first approach, which is transparent to users and has relatively low overhead. Moving forward, I think we need some tests to cover the case that unused buffers are actually replaced by empty buffers. As a side note, you might want to check #6094 as well, as we found that excessive amount of buffers can significantly slows down common operations like |
My suggestion is that we can treat the second type of GC like the current GC (third type GC) as an independent method that users can choose to call. |
My concern with this approach (and with any of the "gc while filtering" approaches) is that it will work well for some use cases but be unavoidable overhead with other use cases. I think we need an API that allows callers to specify how they want the GC to be done What I have talked about with @XiangpengHao and a few others is making some sort of stateful filter API; Among other things such an API would give us a place to put options for filtering So maybe it would look like // create a new filterer
let filterer = ArrayFilterer::new()
// specify that GC should look for empty buffers
.with_gc_strategy(GcStrategy::FindEmptyBuffers);
// feed in arrays and their filters (BooleanArray)
filterer.push(array1, filter1)?;
filterer.push(array2, filter2)?;
// Produce the final output array
let filtered_array = filterer.build(); |
Provide a new API instead of modifying the existing one, agree with this and leave the choice to the user. |
I will try and write this up over the next day or two into a ticket that is more coherent |
I still have writing up the ideas into a more coherent form on my list, but I probably won't get to it for the next few days (I have too too many other projects in flight at the moment) |
I have not had a chance to write up these ideas yet, I apologize |
It's okay, this is not an urgent matter. |
Which issue does this PR close?
Closes #.
Rationale for this change
Performing a
filter
operation onByteViewArray
only filters itsview
data members, but clones the entirebuffers
, which may result in theByteViewArray
holding some unnecessarybuffers
, which may prevent some memory resources from being released in a timely manner.This PR records whether each
buffer
needs to be retained while filteringview
data members, and updates thebuffers
ofByteViewArray
using the recorded information.The reason why this PR replaces unwanted
buffers
with emptybuffers
instead of deleting them directly is that this processing preserves the mapping relationship betweenviews
andbuffers
.What changes are included in this PR?
Are there any user-facing changes?