-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation on StringArrayType
trait
#12027
Conversation
I believe I've actually come up with a better way of handling the different string types for ArrayRef's that I think will be generally more useful than the fn substr_index(args: &[ArrayRef]) -> Result<ArrayRef> {
let string_array: StringArrays = StringArrays::try_from(&args[0])?;
let delimiter_array: StringArrays = StringArrays::try_from(&args[1])?;
let count_array: &PrimitiveArray<Int64Type> = args[2].as_primitive();
substr_index_general::<Int32Type, _, _>(string_array, delimiter_array, count_array)
}
pub fn substr_index_general<
'a,
T: ArrowPrimitiveType,
V: ArrayAccessor<Item = &'a str>,
P: ArrayAccessor<Item = i64>,
>(
string_array: V,
delimiter_array: V,
count_array: P,
) -> Result<ArrayRef>
where
T::Native: OffsetSizeTrait,
{
let mut builder = StringBuilder::new();
let string_iter = ArrayIter::new(string_array);
...
} StringArrays is an Enum that implements a variety of types including TryFrom, From, Array, ArrayAccessor, and, well, StringArrayType. I came up with this approach because I was unable and/or unhappy with the current approaches as I was refactoring the to_timestamp udf to directly support ingesting Utf8View arrays. |
I think the key for making these functions fast is that they need to end up with code that is (different) for the inner loops of each array type.
I was sort of imagining that the I may not totally understand what you are saying |
Do you have concrete examples? I think the new |
I've pushed a poc @ https://github.com/Omega359/arrow-datafusion/blob/842755cd4a54ed378deff997ede19322413428a5/datafusion/functions/src/utils.rs#L192 You can see example usages at
Here's an example of how it improves the code readability and conciseness: |
Thank you for the very detailed documentation! I think the enum-based approach indeed simplifies the interface. But I have two concerns: (2) The reason we have multiple types of string is to allow specialized implementation, for example, the recent With that said, I do appreciate the thoughts you put into unifying the string representations, I 100% agree that we have too many strings to deal with. I hope that we can get |
Thanks for taking the time to look at my idea and give feedback. I do have a few counterpoints to your concerns that may or may not affect things #1. For sure there is going to be a branch per call for some operations however I think that for the majority of usages of the StringArrays enum it would be just to get an iterator .. which would result in just two branch calls - the initial one for .try_from and the other for the .iter() call. Overall I suspect that won't be measurable in any meaningful way. Additionally, I would like to try and see if we can indeed measure the impact when doing an operation such as #2. Absolutely agree with this. In fact I think it may be a great idea to add documentation to StringArrays to note that it is best used for the general case and any specialized implementations should use the StringArrayType trait or switch between the actual string array implementations are desired. That being said I don't see any reason why code that needs to specialize for any reason could not just use any of the existing methods to switch between implementations based on the string array type. |
I went ahead and wrote up a benchmark to verify my assumptions wrt performance @ https://github.com/Omega359/arrow-datafusion/blob/feature/string_arrays/datafusion/functions/benches/string_arrays.rs Here are the results on my machine. 16k array, 20% null, values are 32 char length. You can see that times when using an iterator is almost equivalent among the approaches tested. Using loops with .value(idx) does have varied times between the approaches with the fastest being the direct approach, followed by StringArrayType then the StringArrays approach.
|
StringArrayType is just a trait and the code should compile to the same binary as direct approach thus same performance. Not sure why it is 2x slower than direct approach for With that said, I think most of the functions in Last thoughts: I'm not sure if we want to introduce an easy-to-use but expensive api. |
The implementation is different. StringArrayType goes through a fn as per typical usage of it (since StringArrayType isn't object safe). I think that is where the additional time has gone.
Fair, though it's only expensive when not using .iter(). It was an interesting poc for me so I learnt a few things if nothing else :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @alamb
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
@Omega359 and @XiangpengHao -- what do you think we should do with the conversation above? #12027 (comment) I can't tell if we are suggesting we go with the My "gut" feel is the same as @XiangpengHao that implementing using generics (not The downside of the generics approach is that now we'll end up with 3 copies of most functions and the extra performance, if any, may not justify the binary overhead 🤔 |
Thank you for the review @comphead |
Sorry, missed the notification for this comment till now. The StringArrays may be the nicest API wise but it does incur unavoidable overhead for anything but .iter() (or at least I couldn't find a way to make that approach faster). I would say that for StringArrayType should be used in DataFusion for most use cases except where one would want specialized handling or it to be object safe (which that trait can't be). I'll have another go at the to_timestamp UDF using the StringArrayType approach but I suspect I may need to refactor it quite a bit to make that work. Hopefully I'll have a pull request next week for that. |
👍
Thank you very much. I look forward to it |
Which issue does this PR close?
Follow on to https://github.com/apache/datafusion/compare/main...alamb:datafusion:alamb/stringviewtype_docs?expand=1
Rationale for this change
@tlm365 and @Omega359 moved this trait into the common string utilities, and I think we can use it to make the string functions easier to work with (and eventually special case them like @xinlifoobar is doing in apache/arrow-rs#6231)
What changes are included in this PR?
Add documentation on
StringArrayType
traitAre these changes tested?
Yes, CI
Are there any user-facing changes?
Only documentation, no functional cahnges