Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to append non contiguous strings to StringBuilder #6347

Closed
alamb opened this issue Sep 2, 2024 · 5 comments · Fixed by #6372
Closed

Ability to append non contiguous strings to StringBuilder #6347

alamb opened this issue Sep 2, 2024 · 5 comments · Fixed by #6372
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Sep 2, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion has an optimized version of concat(col1, ...) for strings (I believe contributed by @JasonLi-cn ) that avoids:

  1. Copying strings multiple times
  2. Manipulating nulls uncessarly

To do this today, we added StringArrayBuilder, which is similar but not hte same as StringBuilder in arrow
https://github.com/apache/datafusion/blob/4838cfbf453f3c21d9c5a84f9577329dd78aa763/datafusion/functions/src/string/common.rs#L354-L417

The major differences are:

  1. You can call write to incrementally build up each string and then call append_offset to create each string. StringBuilder requires each input to be a single contiguous string to call https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value
  2. You can call finish() with the specified null buffer (rather than building it up incrementally)

Describe the solution you'd like
I think it is worth figuring out how to create a similar API for StringBuilder

Incrementally wirte values

Here is one ideal suggestion of how to write values that I think would be relatively easy to use:

let mut builder = StringBuilder::with_capacity(...);
// scope for lifetime
{ 
  // get something that implements std::io::write
  let writable = builder.writeable();
  write!(writeable, "foo"); // append "foo" to the inprogress string
  write!(writeable, "bar"); // append "bar" to the inprogress string
} // scope close, finishes the string "foobar"

Similarly, adding a finish_with_nulls(..) type function that took a NullBuffer would be beneficial if the caller already knew about nulls

Describe alternatives you've considered

We could not do this at all (or just keep the code downstream in DataFusion)

Additional context

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate labels Sep 2, 2024
@devanbenz
Copy link

devanbenz commented Sep 2, 2024

@alamb i would love to take this on since it relates to some of the work I was doing on the DF concat UDF 🫡 I might not be able to start it til next week but it doesn't seem super high priority :)

@tustvold
Copy link
Contributor

tustvold commented Sep 2, 2024

Here is one ideal suggestion of how to write values that I think would be relatively easy to use:

This is actually already supported

https://docs.rs/arrow-array/latest/arrow_array/builder/type.GenericStringBuilder.html#example

Similarly, adding a finish_with_nulls(..) type function that took a NullBuffer would be beneficial if the caller already knew about nulls

This could be achieved by appending empty strings instead of nulls, and then modifying the output, either by using into_parts and reconstructing the array, or using the nullif kernel.

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2024

What I plan to do here is:

  1. Update the documentation to try and make it clearer how to do this
  2. File a separate ticket to add the same support for StringViewArray

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2024

Update the documentation to try and make it clearer how to do this

#6372

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2024

Filed #6373 to track adding same support to StringView.

The other thing this ticket currently discusses is the ability to provide a NullBuffer rather than computing it as part of the builder. But we can track that with a follow on ticket if / when we care

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants