-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37710: [C++][Integration] Add C++ Utf8View implementation #37792
Conversation
d468860
to
660008f
Compare
660008f
to
614a411
Compare
614a411
to
b62d539
Compare
cpp/src/arrow/type_traits.h
Outdated
using is_binary_like_type = std::integral_constant< | ||
bool, (is_base_binary_type<T>::value && !is_string_like_type<T>::value) || | ||
is_binary_view_type<T>::value || is_fixed_size_binary_type<T>::value>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the meaning of is_binary_like_type
. For list-views I added new predicates and kept the existing predicates' semantics.
Illustrate in this graph https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @felipecrv btw. This may silently produce incorrect binary-view handling in operations not touched by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These traits are used in very few places. I think the current arrangement is fine; "binary like" = has a byte array, "string like" = has a byte array which is utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a follow up to address this #38204
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is whether these are used by third-party code. These are not internal APIs, and there is a risk that user code will then try to use binary view data in non-compatible routines, and get UB or data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the change to is_binary_like_type
and is_string_like_type
@bkietz Please make sure you rebase before merging, since these are rather large changes that could unwillingly affect other parts of the code. |
2798197
to
028bca2
Compare
@assignUser It looks like at least one CI job is using the latest nightly a little too oportunistically and causing the CI to fail here! https://github.com/apache/arrow/actions/runs/6658040302/job/18094056322?pr=37792#step:6:21 |
We'll add that to the PR cleaning up the edges on the (somewhat large) change to the R build system defaults. You can safely ignore that CI failure (or if you can't safely ignore it than I volunteer to fix it!) |
Thanks @paleolimbot ! I'll merge now |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f622139. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Thank you for everyone who commented, reviewed and helped get this PR merged |
This was introduced by apacheGH-37792.
This was introduced by apacheGH-37792.
### Rationale for this change We need explicit cast for `int64_t` to `size_t` conversion for i386 environment. This was introduced by GH-37792. ### What changes are included in this PR? Add explicit cast. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #38556 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…he#38557) ### Rationale for this change We need explicit cast for `int64_t` to `size_t` conversion for i386 environment. This was introduced by apacheGH-37792. ### What changes are included in this PR? Add explicit cast. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#38556 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…he#38557) ### Rationale for this change We need explicit cast for `int64_t` to `size_t` conversion for i386 environment. This was introduced by apacheGH-37792. ### What changes are included in this PR? Add explicit cast. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#38556 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…he#38557) ### Rationale for this change We need explicit cast for `int64_t` to `size_t` conversion for i386 environment. This was introduced by apacheGH-37792. ### What changes are included in this PR? Add explicit cast. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#38556 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review.
What changes are included in this PR?
std::string_view
as with StringArrayRelevant mailing list discussions: