Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37710: [C++][Integration] Add C++ Utf8View implementation #37792

Merged
merged 5 commits into from
Oct 26, 2023

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Sep 19, 2023

Rationale for this change

After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review.

What changes are included in this PR?

  • The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
  • The values of string view arrays can be visited as std::string_view as with StringArray
  • String view arrays can be round tripped through IPC and integration JSON

Relevant mailing list discussions:

cpp/src/arrow/array/builder_binary.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/builder_binary.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/data.cc Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/builder.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
Comment on lines 684 to 686
using is_binary_like_type = std::integral_constant<
bool, (is_base_binary_type<T>::value && !is_string_like_type<T>::value) ||
is_binary_view_type<T>::value || is_fixed_size_binary_type<T>::value>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the meaning of is_binary_like_type. For list-views I added new predicates and kept the existing predicates' semantics.

Illustrate in this graph https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @felipecrv btw. This may silently produce incorrect binary-view handling in operations not touched by this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These traits are used in very few places. I think the current arrangement is fine; "binary like" = has a byte array, "string like" = has a byte array which is utf8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a follow up to address this #38204

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is whether these are used by third-party code. These are not internal APIs, and there is a risk that user code will then try to use binary view data in non-compatible routines, and get UB or data corruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted the change to is_binary_like_type and is_string_like_type

@pitrou
Copy link
Member

pitrou commented Oct 26, 2023

@bkietz Please make sure you rebase before merging, since these are rather large changes that could unwillingly affect other parts of the code.

@paleolimbot
Copy link
Member

@assignUser It looks like at least one CI job is using the latest nightly a little too oportunistically and causing the CI to fail here! https://github.com/apache/arrow/actions/runs/6658040302/job/18094056322?pr=37792#step:6:21

@paleolimbot
Copy link
Member

We'll add that to the PR cleaning up the edges on the (somewhat large) change to the R build system defaults. You can safely ignore that CI failure (or if you can't safely ignore it than I volunteer to fix it!)

@bkietz
Copy link
Member Author

bkietz commented Oct 26, 2023

Thanks @paleolimbot ! I'll merge now

@bkietz bkietz merged commit f622139 into apache:main Oct 26, 2023
32 of 35 checks passed
@bkietz bkietz removed the awaiting change review Awaiting change review label Oct 26, 2023
@bkietz bkietz deleted the 37710-cxx-impl-string-view branch October 26, 2023 21:29
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f622139.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

Thank you for everyone who commented, reviewed and helped get this PR merged

kou added a commit to kou/arrow that referenced this pull request Nov 2, 2023
kou added a commit to kou/arrow that referenced this pull request Nov 2, 2023
kou added a commit that referenced this pull request Nov 6, 2023
### Rationale for this change

We need explicit cast for `int64_t` to `size_t` conversion for i386 environment.

This was introduced by GH-37792.

### What changes are included in this PR?

Add explicit cast.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: #38556

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Nov 9, 2023
…he#38557)

### Rationale for this change

We need explicit cast for `int64_t` to `size_t` conversion for i386 environment.

This was introduced by apacheGH-37792.

### What changes are included in this PR?

Add explicit cast.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38556

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…he#38557)

### Rationale for this change

We need explicit cast for `int64_t` to `size_t` conversion for i386 environment.

This was introduced by apacheGH-37792.

### What changes are included in this PR?

Add explicit cast.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38556

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…he#38557)

### Rationale for this change

We need explicit cast for `int64_t` to `size_t` conversion for i386 environment.

This was introduced by apacheGH-37792.

### What changes are included in this PR?

Add explicit cast.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38556

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Integration] Add C++ implementation of Utf8View
6 participants