Improve documentation on `StringArrayType` trait #12027

alamb · 2024-08-16T10:51:38Z

Which issue does this PR close?

Follow on to https://github.com/apache/datafusion/compare/main...alamb:datafusion:alamb/stringviewtype_docs?expand=1

Rationale for this change

@tlm365 and @Omega359 moved this trait into the common string utilities, and I think we can use it to make the string functions easier to work with (and eventually special case them like @xinlifoobar is doing in apache/arrow-rs#6231)

What changes are included in this PR?

Add documentation on StringArrayType trait

Are these changes tested?

Yes, CI

Are there any user-facing changes?

Only documentation, no functional cahnges

Omega359 · 2024-08-16T21:10:09Z

I believe I've actually come up with a better way of handling the different string types for ArrayRef's that I think will be generally more useful than the StringArrayType (though I don't know if we would want to remove it completely). Here's an example for the substr_index udf:

fn substr_index(args: &[ArrayRef]) -> Result<ArrayRef> {
    let string_array: StringArrays = StringArrays::try_from(&args[0])?;
    let delimiter_array: StringArrays = StringArrays::try_from(&args[1])?;
    let count_array: &PrimitiveArray<Int64Type> = args[2].as_primitive();
    substr_index_general::<Int32Type, _, _>(string_array, delimiter_array, count_array)
}

pub fn substr_index_general<
    'a,
    T: ArrowPrimitiveType,
    V: ArrayAccessor<Item = &'a str>,
    P: ArrayAccessor<Item = i64>,
>(
    string_array: V,
    delimiter_array: V,
    count_array: P,
) -> Result<ArrayRef>
where
    T::Native: OffsetSizeTrait,
{
    let mut builder = StringBuilder::new();
    let string_iter = ArrayIter::new(string_array);
...
}

StringArrays is an Enum that implements a variety of types including TryFrom, From, Array, ArrayAccessor, and, well, StringArrayType.

I came up with this approach because I was unable and/or unhappy with the current approaches as I was refactoring the to_timestamp udf to directly support ingesting Utf8View arrays.

alamb · 2024-08-17T11:02:14Z

StringArrays is an Enum that implements a variety of types including TryFrom, From, Array, ArrayAccessor, and, well, StringArrayType.

I think the key for making these functions fast is that they need to end up with code that is (different) for the inner loops of each array type.

I came up with this approach because I was unable and/or unhappy with the current approaches as I was refactoring the to_timestamp udf to directly support ingesting Utf8View arrays.

I was sort of imagining that the to_timestamp udf would be implemented in terms of StringArrayType and then there would be a match/switch that would invoke that implementation with the different concrete types. The compiler would then generate the special (different) inner loops for the differen ttypes

I may not totally understand what you are saying

XiangpengHao · 2024-08-17T11:42:41Z

I was unable and/or unhappy with the current approaches

Do you have concrete examples?

I think the new StringArrayType is just a wrapper around several exsiting traits, we have used them quite often and making them into a new trait would improve visibility/documentation

Omega359 · 2024-08-17T12:55:21Z

I've pushed a poc @ https://github.com/Omega359/arrow-datafusion/blob/842755cd4a54ed378deff997ede19322413428a5/datafusion/functions/src/utils.rs#L192

You can see example usages at

Here's an example of how it improves the code readability and conciseness:

Omega359@50359b1

XiangpengHao · 2024-08-18T00:52:46Z

Thank you for the very detailed documentation! I think the enum-based approach indeed simplifies the interface.

But I have two concerns:
(1) every time we call a StringArrays function, e.g., StringArrays::value, we need to match the enum type which is an extra branch. This branch can be a performance bottleneck for high-performance/large volumes of data. This branch is required for every operation this enum implements and there's no easy way to work around it. Instead, the generic approach (the current one) makes this match early in the function call and won't need to recheck the data type.

(2) The reason we have multiple types of string is to allow specialized implementation, for example, the recent substr operation. In that case, we need to write different code for different string types. This means that we won't simply use StringArrays::value to get string value, but instead, we will first get the view of a string, then get the underlying buffer, sometimes we even want to know the buffer id. This means that we need to convert ArrayRef -> StringArrays -> [StringViewArray/StringArray/LargeStringArray]. With the generic approach, we can easily specialize the implementation.

With that said, I do appreciate the thoughts you put into unifying the string representations, I 100% agree that we have too many strings to deal with. I hope that we can get StringView in DataFusion by default soon, and then probably simplify our implementation by using a subset of the string types.

Omega359 · 2024-08-18T15:11:57Z

Thanks for taking the time to look at my idea and give feedback. I do have a few counterpoints to your concerns that may or may not affect things

#1. For sure there is going to be a branch per call for some operations however I think that for the majority of usages of the StringArrays enum it would be just to get an iterator .. which would result in just two branch calls - the initial one for .try_from and the other for the .iter() call. Overall I suspect that won't be measurable in any meaningful way.

Additionally, I would like to try and see if we can indeed measure the impact when doing an operation such as StringArrays::value over and over in a loop with a benchmark. I am wondering if CPU branch prediction will kick in and result in a negligible overhead.

#2. Absolutely agree with this. In fact I think it may be a great idea to add documentation to StringArrays to note that it is best used for the general case and any specialized implementations should use the StringArrayType trait or switch between the actual string array implementations are desired. That being said I don't see any reason why code that needs to specialize for any reason could not just use any of the existing methods to switch between implementations based on the string array type.

Omega359 · 2024-08-18T17:25:49Z

I went ahead and wrote up a benchmark to verify my assumptions wrt performance @ https://github.com/Omega359/arrow-datafusion/blob/feature/string_arrays/datafusion/functions/benches/string_arrays.rs

Here are the results on my machine. 16k array, 20% null, values are 32 char length. You can see that times when using an iterator is almost equivalent among the approaches tested. Using loops with .value(idx) does have varied times between the approaches with the fastest being the direct approach, followed by StringArrayType then the StringArrays approach.

string_arrays benchmark/StringArrays-iter-16384/32
                        time:   [18.011 µs 18.093 µs 18.172 µs]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
string_arrays benchmark/StringArrays-using_loop-16384/32
                        time:   [71.839 µs 72.050 µs 72.291 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
string_arrays benchmark/direct-iter-16384/32
                        time:   [16.597 µs 16.654 µs 16.718 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
string_arrays benchmark/direct-using_loop-16384/32
                        time:   [27.554 µs 27.763 µs 27.989 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
string_arrays benchmark/StringArrayType-iter-16384/32
                        time:   [18.375 µs 18.752 µs 19.175 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
string_arrays benchmark/StringArrayType-using_loop-16384/32
                        time:   [44.859 µs 44.946 µs 45.042 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

XiangpengHao · 2024-08-18T23:46:48Z

fastest being the direct approach, followed by StringArrayType

StringArrayType is just a trait and the code should compile to the same binary as direct approach thus same performance. Not sure why it is 2x slower than direct approach for using_loop.

With that said, I think most of the functions in StringArrays can be implemented as trait functions in StringArrayType, thus avoiding the overhead but still having a uniform interface.

Last thoughts: I'm not sure if we want to introduce an easy-to-use but expensive api.

Omega359 · 2024-08-19T13:16:13Z

StringArrayType is just a trait and the code should compile to the same binary as direct approach thus same performance. Not sure why it is 2x slower than direct approach for using_loop.

The implementation is different. StringArrayType goes through a fn as per typical usage of it (since StringArrayType isn't object safe). I think that is where the additional time has gone.

Last thoughts: I'm not sure if we want to introduce an easy-to-use but expensive api.

Fair, though it's only expensive when not using .iter(). It was an interesting poc for me so I learnt a few things if nothing else :)

datafusion/functions/src/string/common.rs

comphead

lgtm thanks @alamb

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

…docs

alamb · 2024-08-21T19:02:18Z

@Omega359 and @XiangpengHao -- what do you think we should do with the conversation above? #12027 (comment) I can't tell if we are suggesting we go with the StringArrays approach or if the StringArrayType trait is ok?

My "gut" feel is the same as @XiangpengHao that implementing using generics (not dyn StringArrayType but a function that is generic over StringArrayType should be at least as fast as the "dynamic dispatch" mechanism of StringArrays (because the compiler gets a change to build special code for it)

The downside of the generics approach is that now we'll end up with 3 copies of most functions and the extra performance, if any, may not justify the binary overhead 🤔

alamb · 2024-08-21T19:02:45Z

Thank you for the review @comphead

Omega359 · 2024-08-24T20:27:39Z

@Omega359 and @XiangpengHao -- what do you think we should do with the conversation above? #12027 (comment) I can't tell if we are suggesting we go with the StringArrays approach or if the StringArrayType trait is ok?

My "gut" feel is the same as @XiangpengHao that implementing using generics (not dyn StringArrayType but a function that is generic over StringArrayType should be at least as fast as the "dynamic dispatch" mechanism of StringArrays (because the compiler gets a change to build special code for it)

The downside of the generics approach is that now we'll end up with 3 copies of most functions and the extra performance, if any, may not justify the binary overhead 🤔

Sorry, missed the notification for this comment till now.

The StringArrays may be the nicest API wise but it does incur unavoidable overhead for anything but .iter() (or at least I couldn't find a way to make that approach faster). I would say that for StringArrayType should be used in DataFusion for most use cases except where one would want specialized handling or it to be object safe (which that trait can't be).

I'll have another go at the to_timestamp UDF using the StringArrayType approach but I suspect I may need to refactor it quite a bit to make that work. Hopefully I'll have a pull request next week for that.

alamb · 2024-08-25T11:29:59Z

The StringArrays may be the nicest API wise but it does incur unavoidable overhead for anything but .iter() (or at least I couldn't find a way to make that approach faster). I would say that for StringArrayType should be used in DataFusion for most use cases except where one would want specialized handling or it to be object safe (which that trait can't be).

👍

I'll have another go at the to_timestamp UDF using the StringArrayType approach but I suspect I may need to refactor it quite a bit to make that work. Hopefully I'll have a pull request next week for that.

Thank you very much. I look forward to it

alamb added 2 commits August 16, 2024 06:46

Improve documentation on StringArrayType trait

c15262e

tweaks

05d4f41

github-actions bot added the functions label Aug 16, 2024

alamb mentioned this pull request Aug 16, 2024

Improve performance of REPEAT functions #12015

Merged

alamb added the documentation Improvements or additions to documentation label Aug 16, 2024

alamb mentioned this pull request Aug 17, 2024

Specialize Prefix/Suffix Match for Like/ILike between Array and Scalar for StringViewArray apache/arrow-rs#6231

Merged

alamb mentioned this pull request Aug 20, 2024

Improve rpad udf by using a GenericStringBuilder #12070

Merged

comphead reviewed Aug 21, 2024

View reviewed changes

datafusion/functions/src/string/common.rs Outdated Show resolved Hide resolved

comphead approved these changes Aug 21, 2024

View reviewed changes

Update datafusion/functions/src/string/common.rs

da1899b

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

github-actions bot removed the documentation Improvements or additions to documentation label Aug 21, 2024

Merge remote-tracking branch 'apache/main' into alamb/stringviewtype_…

0eb3e15

…docs

alamb merged commit 6eea180 into apache:main Aug 22, 2024
24 checks passed

alamb deleted the alamb/stringviewtype_docs branch August 25, 2024 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation on `StringArrayType` trait #12027

Improve documentation on `StringArrayType` trait #12027

alamb commented Aug 16, 2024

Omega359 commented Aug 16, 2024 •

edited

Loading

alamb commented Aug 17, 2024

XiangpengHao commented Aug 17, 2024

Omega359 commented Aug 17, 2024 •

edited

Loading

XiangpengHao commented Aug 18, 2024 •

edited

Loading

Omega359 commented Aug 18, 2024

Omega359 commented Aug 18, 2024

XiangpengHao commented Aug 18, 2024

Omega359 commented Aug 19, 2024

comphead left a comment

alamb commented Aug 21, 2024

alamb commented Aug 21, 2024

Omega359 commented Aug 24, 2024

alamb commented Aug 25, 2024

Improve documentation on StringArrayType trait #12027

Improve documentation on StringArrayType trait #12027

Conversation

alamb commented Aug 16, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 commented Aug 16, 2024 • edited Loading

alamb commented Aug 17, 2024

XiangpengHao commented Aug 17, 2024

Omega359 commented Aug 17, 2024 • edited Loading

XiangpengHao commented Aug 18, 2024 • edited Loading

Omega359 commented Aug 18, 2024

Omega359 commented Aug 18, 2024

XiangpengHao commented Aug 18, 2024

Omega359 commented Aug 19, 2024

comphead left a comment

Choose a reason for hiding this comment

alamb commented Aug 21, 2024

alamb commented Aug 21, 2024

Omega359 commented Aug 24, 2024

alamb commented Aug 25, 2024

Improve documentation on `StringArrayType` trait #12027

Improve documentation on `StringArrayType` trait #12027

Omega359 commented Aug 16, 2024 •

edited

Loading

Omega359 commented Aug 17, 2024 •

edited

Loading

XiangpengHao commented Aug 18, 2024 •

edited

Loading