Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35765: [C++] Split vector_selection.cc into more compilation units #35751

Merged
merged 11 commits into from
May 30, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented May 24, 2023

Rationale for this change

When working on #35001 I had a hard time figuring where to place the code for all possible combinations of filters and REE data. vector_selection.cc is hard to follow with so many kernels implemented in a single file. This PR splits the two biggest ones: filter and take. Stuff that can be shared by both stays is in vector_selection_internal.cc and vector_selection.cc is concerned with the registering of the functions and a few smaller kernels.

What changes are included in this PR?

  • vector_selection_(internal|take|filter).(cc|h) source files were extracted from vector_selection.cc

Are these changes tested?

Yes, by existing tests.

@felipecrv
Copy link
Contributor Author

@pitrou @westonpace I tried to separate the code-moving commits from the ones that make changes to explain what's being done step-by-step.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not minor in code changes, even though most of it is moving code around. Can you open a GH issue?

cpp/src/arrow/compute/kernels/vector_selection_filter.h Outdated Show resolved Hide resolved
const FunctionOptions* default_options,
FunctionRegistry* registry);

Status PreallocateData(KernelContext* ctx, int64_t length, int bit_width,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name PreallocateData is probably too general, since it's selection-specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to PreallocatePrimitiveArrayData. It's here because it's used by both take and filter on primitive arrays.

@@ -58,2119 +60,22 @@ using internal::OptionalBitIndexer;
namespace compute {
namespace internal {

using FilterState = OptionsWrapper<FilterOptions>;
using TakeState = OptionsWrapper<TakeOptions>;

int64_t GetFilterOutputSize(const ArraySpan& filter,
FilterOptions::NullSelectionBehavior null_selection) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still useful, since it's a trivial wrapper around GetBitmapFilterOutputSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes this in #35750.

int64_t GetFilterOutputSize(const ArraySpan& filter,
                            FilterOptions::NullSelectionBehavior null_selection) {
  if (filter.type->id() == Type::BOOL) {
    return GetBitmapFilterOutputSize(filter, null_selection);
  }
  return GetREEFilterOutputSize(filter, null_selection);

I will move it completely to vector_selection_filter.cc now and remove it here, but I haven't decided yet if the REEx* filter kernels are going to be implemented in vector_selection_filter.cc or in a separate .cc, so I might undo this later.

cpp/src/arrow/compute/kernels/vector_selection_internal.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 25, 2023
@felipecrv felipecrv changed the title MINOR: [C++] Split vector_selection.cc into more compilation units GH-35765: [C++] Split vector_selection.cc into more compilation units May 25, 2023
@github-actions
Copy link

@assignUser
Copy link
Member

For reference: A minor PR touches <=2 files with no changes to behaviour. (e.g. fixing typos and similar things)

@felipecrv felipecrv requested a review from pitrou May 25, 2023 17:04
@felipecrv felipecrv force-pushed the vector_selection_split branch from 00a43f7 to 5e3e42e Compare May 25, 2023 17:16

#include "arrow/array/data.h"
#include "arrow/compute/api_vector.h"
#include "arrow/compute/kernels/vector_selection_internal.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not include *internal.h from a non *internal.h headers because *internal.h aren't installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector_selection_filter.h is also internal — included only by vector_selection.cc. Should I rename it with an internal suffix in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I just realized it's not needed. 😄 This is why I made PopulateFilterKernels take an output param at some point. That allows me to forward-declare SelectionKernelData and not include the full header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's been done with vector_selection_take.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the internal.h suffix is important as the cmake uses it to decide which headers get installed. So if it is actually a private/internal header it should be renamed imo. But @kou would know best!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the internal.h suffix is important as the cmake uses it to decide which headers get installed. So if it is actually a private/internal header it should be renamed imo.

Correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm also renaming vector_selection_(filter|take).(h|cc) and pushing now.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 25, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 26, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 26, 2023
@pitrou
Copy link
Member

pitrou commented May 30, 2023

@felipecrv Is this ready for review again?

@felipecrv
Copy link
Contributor Author

@felipecrv Is this ready for review again?

Yes.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for doing this @felipecrv

@pitrou pitrou merged commit 44d1b61 into apache:main May 30, 2023
@felipecrv felipecrv deleted the vector_selection_split branch May 30, 2023 15:51
@ursabot
Copy link

ursabot commented May 31, 2023

Benchmark runs are scheduled for baseline = 431785f and contender = 44d1b61. 44d1b61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️3.29% ⬆️0.56%] test-mac-arm
[Failed ⬇️0.32% ⬆️9.55%] ursa-i9-9960x
[Finished ⬇️0.91% ⬆️0.88%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 44d1b615 ec2-t3-xlarge-us-east-2
[Finished] 44d1b615 test-mac-arm
[Finished] 44d1b615 ursa-i9-9960x
[Finished] 44d1b615 ursa-thinkcentre-m75q
[Finished] 431785f3 ec2-t3-xlarge-us-east-2
[Failed] 431785f3 test-mac-arm
[Failed] 431785f3 ursa-i9-9960x
[Finished] 431785f3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Split vector_selection.cc into more compilation units
5 participants