Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32104: [C++] Add support for Run-End encoded data to Arrow #33641

Merged
merged 31 commits into from
Feb 17, 2023

Conversation

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #32104 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #32104 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv force-pushed the ree branch 3 times, most recently from dd7fd09 to f5d88b9 Compare January 17, 2023 23:51
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first pass on basic facilities. I didn't look at kernels and REE utils yet.

cpp/src/arrow/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/arrow/array.h Outdated Show resolved Hide resolved
cpp/src/arrow/type_test.cc Outdated Show resolved Hide resolved
@@ -1177,6 +1197,7 @@ constexpr bool is_nested(Type::type type_id) {
case Type::STRUCT:
case Type::SPARSE_UNION:
case Type::DENSE_UNION:
case Type::RUN_END_ENCODED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right? These types are semantically (logically) nested. Run-end encoded arrays are only physically nested. I don't know which choice makes the most sense and/or requires the least special-casing.

@lidavidm @bkietz What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick glance, I think this is mostly used as 'physically nested' in the codebase, so this is OK. I don't think we have the right abstractions to differentiate between physical/logical and deal with encodings, e.g. the way we handle dictionaries is often to just decode them, and kernel implementations are very much often lost in the weeds of encoding details.

cpp/src/arrow/type.h Show resolved Hide resolved
cpp/src/arrow/array/builder_encoded.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/concatenate.cc Show resolved Hide resolved
cpp/src/arrow/array/concatenate.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/path_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/visitor_generate.h Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second part of code review: it doesn't seem clear what the offset and length of a run-end encoded array mean. Once this is settled and applied consistently, I'll be able to do a third review pass.

cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_encoded.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_encoded.h Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_encoded.h Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_encoded.h Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array.h Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Feb 1, 2023

⚠️ GitHub issue #32104 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv force-pushed the ree branch 2 times, most recently from 7ff1ce7 to fab46a5 Compare February 3, 2023 14:41
@felipecrv
Copy link
Contributor Author

@lidavidm @wjones127 @westonpace @zeroshade @bkietz

Before this being fully ready for review, I want to:

  1. add the Scalar type and remove the VisitTypeInline and
  2. I want to rewrite the comparator code
  3. more tests

but feel free to skim the code and give higher level architectural feedback before a more detailed review

@felipecrv felipecrv force-pushed the ree branch 8 times, most recently from 6396a80 to 2811015 Compare February 7, 2023 19:45
Since jemalloc.h does

    #  define malloc je_malloc

when JEMALLOC_MANGLE is defined, we can get this error in CI during an
unity build

    /arrow/cpp/src/arrow/vendored/ProducerConsumerQueue.h: In constructor 'arrow_vendored::folly::ProducerConsumerQueue<T>::ProducerConsumerQueue(uint32_t)':

    /arrow/cpp/src/arrow/vendored/ProducerConsumerQueue.h:82:39: error: 'je_arrow_malloc' is not a member of 'std'; did you mean 'je_arrow_malloc'?

       82 |         records_(static_cast<T*>(std::malloc(sizeof(T) * size))),

          |                                       ^~~~~~

    jemalloc_ep-prefix/src/jemalloc_ep/dist/include/jemalloc/jemalloc.h:254:32: note: 'je_arrow_malloc' declared here

      254 |     void JEMALLOC_SYS_NOTHROW *je_malloc(size_t size)

          |                                ^~~~~~~~~

    /arrow/cpp/src/arrow/vendored/ProducerConsumerQueue.h: In destructor 'arrow_vendored::folly::ProducerConsumerQueue<T>::~ProducerConsumerQueue()':

    /arrow/cpp/src/arrow/vendored/ProducerConsumerQueue.h:106:10: error: 'je_arrow_free' is not a member of 'std'; did you mean 'je_arrow_free'?

      106 |     std::free(records_);

          |          ^~~~

    jemalloc_ep-prefix/src/jemalloc_ep/dist/include/jemalloc/jemalloc.h:269:43: note: 'je_arrow_free' declared here

      269 | JEMALLOC_EXPORT void JEMALLOC_SYS_NOTHROW je_free(void *ptr)

          |                                           ^~~~~~~
@felipecrv
Copy link
Contributor Author

@zeroshade this is now passing all the builds.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/arrow/type.h Outdated Show resolved Hide resolved
cpp/src/arrow/testing/json_internal.cc Show resolved Hide resolved
cpp/src/arrow/array/array_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_base.cc Show resolved Hide resolved
cpp/src/arrow/util/ree_util.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/ree_util_test.cc Show resolved Hide resolved
cpp/src/arrow/scalar.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/concatenate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/builder_run_end.cc Show resolved Hide resolved
cpp/src/arrow/array/builder_run_end.cc Outdated Show resolved Hide resolved
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@felipecrv
Copy link
Contributor Author

felipecrv commented Feb 17, 2023

@lidavidm should I rebase and force push so this CI issue goes away or this can be merged as is?

@lidavidm
Copy link
Member

I kicked the builds, let's see.

@zeroshade zeroshade merged commit 1264e40 into apache:main Feb 17, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…pache#33641)

This PR gathers work from multiple PRs that can be closed after this one is merged:

 - Closes apache#13752
 - Closes apache#13754
 - Closes apache#13842
 - Closes apache#13882
 - Closes apache#13916
 - Closes apache#14063
 - Closes apache#13970

And the issues associated with those PRs can also be closed:

 - Fixes apache#20350
 - Add RunEndEncodedScalarType
 - Fixes apache#32543
 - Fixes apache#32544
 - Fixes apache#32688
 - Fixes apache#32731
 - Fixes apache#32772
 - Fixes apache#32774

* Closes: apache#32104

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Tobias Zagorni <tobias@zagorni.eu>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@ursabot
Copy link

ursabot commented Feb 17, 2023

Benchmark runs are scheduled for baseline = 157b8f5 and contender = 1264e40. 1264e40 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.58% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.41% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 1264e409 ec2-t3-xlarge-us-east-2
[Finished] 1264e409 test-mac-arm
[Finished] 1264e409 ursa-i9-9960x
[Finished] 1264e409 ursa-thinkcentre-m75q
[Finished] 157b8f55 ec2-t3-xlarge-us-east-2
[Finished] 157b8f55 test-mac-arm
[Finished] 157b8f55 ursa-i9-9960x
[Finished] 157b8f55 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@felipecrv felipecrv deleted the ree branch February 17, 2023 18:50
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…pache#33641)

This PR gathers work from multiple PRs that can be closed after this one is merged:

 - Closes apache#13752
 - Closes apache#13754
 - Closes apache#13842
 - Closes apache#13882
 - Closes apache#13916
 - Closes apache#14063
 - Closes apache#13970

And the issues associated with those PRs can also be closed:

 - Fixes apache#20350
 - Add RunEndEncodedScalarType
 - Fixes apache#32543
 - Fixes apache#32544
 - Fixes apache#32688
 - Fixes apache#32731
 - Fixes apache#32772
 - Fixes apache#32774

* Closes: apache#32104

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Tobias Zagorni <tobias@zagorni.eu>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
sighingnow added a commit to v6d-io/v6d that referenced this pull request May 8, 2023
Related issue number
--------------------

See also: 
- Homebrew/homebrew-core#129859
- apache/arrow#33608
- apache/arrow#33641

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants