Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32107: [C++] Create Filter Kernels for REE values #35001

Draft
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Apr 10, 2023

Rationale for this change

Run-end encoded arrays can very compactly represent filters with repeated values and run-end encoded arrays can be filtered efficiently without inflating all the runs for filtering. This PR implements algorithms that make it possible to leverage these possibilities.

What changes are included in this PR?

Are these changes tested?

By unit tests that handle all the REE/Plain combinations.

Are there any user-facing changes?

Users don't control which filter kernels get picked directly, so there isn't a user-facing API area yet.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #32107 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #32107 has been automatically assigned in GitHub to PR creator.

zeroshade pushed a commit that referenced this pull request Apr 13, 2023
…lable to other compute kernels (#35002)

### Rationale for this change

Make it easier to implement correct kernels dealing with run-end encoded data.

### What changes are included in this PR?

Extraction of code to a header so kernels like REE filter kernels (see #35001) can use them.

### Are these changes tested?

Indirectly via the `run_end_encode` and `run_end_decode` tests.

### Are there any user-facing changes?

No, these are internal compute headers.

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…lable to other compute kernels (apache#35002)

### Rationale for this change

Make it easier to implement correct kernels dealing with run-end encoded data.

### What changes are included in this PR?

Extraction of code to a header so kernels like REE filter kernels (see apache#35001) can use them.

### Are these changes tested?

Indirectly via the `run_end_encode` and `run_end_decode` tests.

### Are there any user-facing changes?

No, these are internal compute headers.

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@github-actions
Copy link

⚠️ GitHub issue #32107 has been automatically assigned in GitHub to PR creator.

@alippai
Copy link
Contributor

alippai commented May 13, 2023

@felipecrv do you have any anecdotal benchmark numbers?

@felipecrv
Copy link
Contributor Author

felipecrv commented May 15, 2023

@felipecrv do you have any anecdotal benchmark numbers?

@alippai For PlainxREE, only the intuition that for a sufficiently run-end-compressed filter it will be faster than scanning a full bitmap.

But I will be writing benchmarks for the 3 combinations when I complete the unit tests.

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…lable to other compute kernels (apache#35002)

### Rationale for this change

Make it easier to implement correct kernels dealing with run-end encoded data.

### What changes are included in this PR?

Extraction of code to a header so kernels like REE filter kernels (see apache#35001) can use them.

### Are these changes tested?

Indirectly via the `run_end_encode` and `run_end_decode` tests.

### Are there any user-facing changes?

No, these are internal compute headers.

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…lable to other compute kernels (apache#35002)

### Rationale for this change

Make it easier to implement correct kernels dealing with run-end encoded data.

### What changes are included in this PR?

Extraction of code to a header so kernels like REE filter kernels (see apache#35001) can use them.

### Are these changes tested?

Indirectly via the `run_end_encode` and `run_end_decode` tests.

### Are there any user-facing changes?

No, these are internal compute headers.

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@felipecrv felipecrv force-pushed the filter_kernel branch 3 times, most recently from bebe755 to f53be80 Compare May 20, 2023 00:32
pitrou pushed a commit that referenced this pull request May 30, 2023
…#35751)

### Rationale for this change

When working on #35001 I had a hard time figuring where to place the code for all possible combinations of filters and REE data. `vector_selection.cc` is hard to follow with so many kernels implemented in a single file. This PR splits the two biggest ones: filter and take. Stuff that can be shared by both stays is in `vector_selection_internal.cc` and `vector_selection.cc` is concerned with the registering of the functions and a few smaller kernels.

### What changes are included in this PR?

- [x]  `vector_selection_(internal|take|filter).(cc|h)` source files were extracted from `vector_selection.cc`

### Are these changes tested?

Yes, by existing tests.

* Closes: #35765

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@felipecrv felipecrv changed the title GH-32107: [C++] Create Filter Kernels for REE arrays and or filters GH-32107: [C++] Create Filter Kernels for REE values Jun 16, 2023
@felipecrv felipecrv force-pushed the filter_kernel branch 2 times, most recently from 897b922 to 7b8a753 Compare June 17, 2023 02:59
Lead-authored-by: Tobias Zagorni <tobias@zagorni.eu>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Size goes down by 1_147_272 bytes.

    $ ninja libarrow.so && strip release/libarrow.so.1300.0.0 && du -sb release/libarrow.so.1300.0.0
    [3/3] Creating library symlink release/libarrow.so.1300 release/libarrow.so
    104555424       release/libarrow.so.1300.0.0
Another reduction of 1_073_528 bytes.
This doesn't affect binary size at all, but documents better what is
expected of this visitor function.
As this has been incorporated in a different way -- there is a runtime
check on Plain x * filters that check if the filter array is run-end
encoded and dispatches the boolean run visits accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Create Filter Kernel on RLE data
2 participants