Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation on StringArrayType trait #12027

Merged
merged 4 commits into from
Aug 22, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 16, 2024

Which issue does this PR close?

Follow on to https://github.com/apache/datafusion/compare/main...alamb:datafusion:alamb/stringviewtype_docs?expand=1

Rationale for this change

@tlm365 and @Omega359 moved this trait into the common string utilities, and I think we can use it to make the string functions easier to work with (and eventually special case them like @xinlifoobar is doing in apache/arrow-rs#6231)

What changes are included in this PR?

Add documentation on StringArrayType trait

Screenshot 2024-08-16 at 6 49 24 AM

Are these changes tested?

Yes, CI

Are there any user-facing changes?

Only documentation, no functional cahnges

@alamb alamb added the documentation Improvements or additions to documentation label Aug 16, 2024
@Omega359
Copy link
Contributor

Omega359 commented Aug 16, 2024

I believe I've actually come up with a better way of handling the different string types for ArrayRef's that I think will be generally more useful than the StringArrayType (though I don't know if we would want to remove it completely). Here's an example for the substr_index udf:

fn substr_index(args: &[ArrayRef]) -> Result<ArrayRef> {
    let string_array: StringArrays = StringArrays::try_from(&args[0])?;
    let delimiter_array: StringArrays = StringArrays::try_from(&args[1])?;
    let count_array: &PrimitiveArray<Int64Type> = args[2].as_primitive();
    substr_index_general::<Int32Type, _, _>(string_array, delimiter_array, count_array)
}

pub fn substr_index_general<
    'a,
    T: ArrowPrimitiveType,
    V: ArrayAccessor<Item = &'a str>,
    P: ArrayAccessor<Item = i64>,
>(
    string_array: V,
    delimiter_array: V,
    count_array: P,
) -> Result<ArrayRef>
where
    T::Native: OffsetSizeTrait,
{
    let mut builder = StringBuilder::new();
    let string_iter = ArrayIter::new(string_array);
...
}

StringArrays is an Enum that implements a variety of types including TryFrom, From, Array, ArrayAccessor, and, well, StringArrayType.

I came up with this approach because I was unable and/or unhappy with the current approaches as I was refactoring the to_timestamp udf to directly support ingesting Utf8View arrays.

@alamb
Copy link
Contributor Author

alamb commented Aug 17, 2024

StringArrays is an Enum that implements a variety of types including TryFrom, From, Array, ArrayAccessor, and, well, StringArrayType.

I think the key for making these functions fast is that they need to end up with code that is (different) for the inner loops of each array type.

I came up with this approach because I was unable and/or unhappy with the current approaches as I was refactoring the to_timestamp udf to directly support ingesting Utf8View arrays.

I was sort of imagining that the to_timestamp udf would be implemented in terms of StringArrayType and then there would be a match/switch that would invoke that implementation with the different concrete types. The compiler would then generate the special (different) inner loops for the differen ttypes

I may not totally understand what you are saying

@XiangpengHao
Copy link
Contributor

I was unable and/or unhappy with the current approaches

Do you have concrete examples?

I think the new StringArrayType is just a wrapper around several exsiting traits, we have used them quite often and making them into a new trait would improve visibility/documentation

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Aug 18, 2024

Thank you for the very detailed documentation! I think the enum-based approach indeed simplifies the interface.

But I have two concerns:
(1) every time we call a StringArrays function, e.g., StringArrays::value, we need to match the enum type which is an extra branch. This branch can be a performance bottleneck for high-performance/large volumes of data. This branch is required for every operation this enum implements and there's no easy way to work around it. Instead, the generic approach (the current one) makes this match early in the function call and won't need to recheck the data type.

(2) The reason we have multiple types of string is to allow specialized implementation, for example, the recent substr operation. In that case, we need to write different code for different string types. This means that we won't simply use StringArrays::value to get string value, but instead, we will first get the view of a string, then get the underlying buffer, sometimes we even want to know the buffer id. This means that we need to convert ArrayRef -> StringArrays -> [StringViewArray/StringArray/LargeStringArray]. With the generic approach, we can easily specialize the implementation.

With that said, I do appreciate the thoughts you put into unifying the string representations, I 100% agree that we have too many strings to deal with. I hope that we can get StringView in DataFusion by default soon, and then probably simplify our implementation by using a subset of the string types.

@Omega359
Copy link
Contributor

Thanks for taking the time to look at my idea and give feedback. I do have a few counterpoints to your concerns that may or may not affect things

#1. For sure there is going to be a branch per call for some operations however I think that for the majority of usages of the StringArrays enum it would be just to get an iterator .. which would result in just two branch calls - the initial one for .try_from and the other for the .iter() call. Overall I suspect that won't be measurable in any meaningful way.

Additionally, I would like to try and see if we can indeed measure the impact when doing an operation such as StringArrays::value over and over in a loop with a benchmark. I am wondering if CPU branch prediction will kick in and result in a negligible overhead.

#2. Absolutely agree with this. In fact I think it may be a great idea to add documentation to StringArrays to note that it is best used for the general case and any specialized implementations should use the StringArrayType trait or switch between the actual string array implementations are desired. That being said I don't see any reason why code that needs to specialize for any reason could not just use any of the existing methods to switch between implementations based on the string array type.

@Omega359
Copy link
Contributor

I went ahead and wrote up a benchmark to verify my assumptions wrt performance @ https://github.com/Omega359/arrow-datafusion/blob/feature/string_arrays/datafusion/functions/benches/string_arrays.rs

Here are the results on my machine. 16k array, 20% null, values are 32 char length. You can see that times when using an iterator is almost equivalent among the approaches tested. Using loops with .value(idx) does have varied times between the approaches with the fastest being the direct approach, followed by StringArrayType then the StringArrays approach.

string_arrays benchmark/StringArrays-iter-16384/32
                        time:   [18.011 µs 18.093 µs 18.172 µs]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
string_arrays benchmark/StringArrays-using_loop-16384/32
                        time:   [71.839 µs 72.050 µs 72.291 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
string_arrays benchmark/direct-iter-16384/32
                        time:   [16.597 µs 16.654 µs 16.718 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
string_arrays benchmark/direct-using_loop-16384/32
                        time:   [27.554 µs 27.763 µs 27.989 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
string_arrays benchmark/StringArrayType-iter-16384/32
                        time:   [18.375 µs 18.752 µs 19.175 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
string_arrays benchmark/StringArrayType-using_loop-16384/32
                        time:   [44.859 µs 44.946 µs 45.042 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

@XiangpengHao
Copy link
Contributor

fastest being the direct approach, followed by StringArrayType

StringArrayType is just a trait and the code should compile to the same binary as direct approach thus same performance. Not sure why it is 2x slower than direct approach for using_loop.

With that said, I think most of the functions in StringArrays can be implemented as trait functions in StringArrayType, thus avoiding the overhead but still having a uniform interface.

Last thoughts: I'm not sure if we want to introduce an easy-to-use but expensive api.

@Omega359
Copy link
Contributor

StringArrayType is just a trait and the code should compile to the same binary as direct approach thus same performance. Not sure why it is 2x slower than direct approach for using_loop.

The implementation is different. StringArrayType goes through a fn as per typical usage of it (since StringArrayType isn't object safe). I think that is where the additional time has gone.

Last thoughts: I'm not sure if we want to introduce an easy-to-use but expensive api.

Fair, though it's only expensive when not using .iter(). It was an interesting poc for me so I learnt a few things if nothing else :)

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Aug 21, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2024

@Omega359 and @XiangpengHao -- what do you think we should do with the conversation above? #12027 (comment) I can't tell if we are suggesting we go with the StringArrays approach or if the StringArrayType trait is ok?

My "gut" feel is the same as @XiangpengHao that implementing using generics (not dyn StringArrayType but a function that is generic over StringArrayType should be at least as fast as the "dynamic dispatch" mechanism of StringArrays (because the compiler gets a change to build special code for it)

The downside of the generics approach is that now we'll end up with 3 copies of most functions and the extra performance, if any, may not justify the binary overhead 🤔

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2024

Thank you for the review @comphead

@alamb alamb merged commit 6eea180 into apache:main Aug 22, 2024
24 checks passed
@Omega359
Copy link
Contributor

@Omega359 and @XiangpengHao -- what do you think we should do with the conversation above? #12027 (comment) I can't tell if we are suggesting we go with the StringArrays approach or if the StringArrayType trait is ok?

My "gut" feel is the same as @XiangpengHao that implementing using generics (not dyn StringArrayType but a function that is generic over StringArrayType should be at least as fast as the "dynamic dispatch" mechanism of StringArrays (because the compiler gets a change to build special code for it)

The downside of the generics approach is that now we'll end up with 3 copies of most functions and the extra performance, if any, may not justify the binary overhead 🤔

Sorry, missed the notification for this comment till now.

The StringArrays may be the nicest API wise but it does incur unavoidable overhead for anything but .iter() (or at least I couldn't find a way to make that approach faster). I would say that for StringArrayType should be used in DataFusion for most use cases except where one would want specialized handling or it to be object safe (which that trait can't be).

I'll have another go at the to_timestamp UDF using the StringArrayType approach but I suspect I may need to refactor it quite a bit to make that work. Hopefully I'll have a pull request next week for that.

@alamb alamb deleted the alamb/stringviewtype_docs branch August 25, 2024 11:29
@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2024

The StringArrays may be the nicest API wise but it does incur unavoidable overhead for anything but .iter() (or at least I couldn't find a way to make that approach faster). I would say that for StringArrayType should be used in DataFusion for most use cases except where one would want specialized handling or it to be object safe (which that trait can't be).

👍

I'll have another go at the to_timestamp UDF using the StringArrayType approach but I suspect I may need to refactor it quite a bit to make that work. Hopefully I'll have a pull request next week for that.

Thank you very much. I look forward to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants