-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider implementing some sort of deduplicate
/ intern
functionality for StringView
#5910
Comments
BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has code to do the deduplication quickly So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a With the code for fast DictionaryArray --> StringViewArray added in #5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach) |
The idea is quite similar to what we have in ArrowBytesMap in datafusion Check hash before Insert https://github.com/apache/datafusion/blob/48a1754b33e983a8201ca3fefa36136fa44f0c55/datafusion/physical-expr-common/src/binary_map.rs#L332 Ideally, if we have well-supported StringViewArray, we don't need specialized SSO ArrowBytesMap but convert any kind of arrays (including StringViewArray) to optimized arrow Row for group by 🤔 |
One difference is that |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of implementing
StringView
#5374@XiangpengHao implemented
gc
which compacts all the strings in a StringView/BinaryView into contiguous storage in #5513However, that functionality does not deduplicate/intern the strings -- it just copies them over
Describe the solution you'd like
We should make it easy to deduplicate the strings in a StringView.
I do think we should change
gc
to do deduplication without an explict as (as deduplication is expensive)Describe alternatives you've considered
GenericBinaryView::dedupe
) that deduplicated such arrays (likely not moving any strings, but just updating views)GenericBinaryView::gc
that controlled the behavior (as in could also specify doing gc)Additional context
@alexwilcoxson-rel asked in #5904 (comment)
The text was updated successfully, but these errors were encountered: