GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

R-JunmingChen · 2023-04-27T14:00:29Z

Rationale for this change
This PR is for Issue-34588. Discussing with @westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.

What changes are included in this PR?
C++: Meta Function of dictionary_decode.
Python: Test

Are these changes tested?
One test in tests/test_compute.py

Closes: [C++][Python] Add compute kernel for "dictionary_decode" #34588

github-actions · 2023-04-27T14:00:54Z

Closes: [C++][Python] Add compute kernel for "dictionary_decode" #34588

…ARROW-34588

…ROW-34588

cpp/src/arrow/compute/dictionary_decode.cc

R-JunmingChen · 2023-06-23T09:37:23Z

Hi, @westonpace @AlenkaF, if you have spared time, please review this PR

AlenkaF

Thank you for the contribution @R-JunmingChen !

I had a look at the Python part. Just have some minor nits.

Can we add a combo of encode & decode to the test?
There are some corrections needed from doctest and linter checks:

--- original//arrow/python/pyarrow/compute.py
+++ fixed//arrow/python/pyarrow/compute.py
@@ -387,7 +387,7 @@
     array : decoded Array
         The dictionary_decode result as a new Array
     """
-    if(not isinstance(arr.type, pa.DictionaryType)):
+    if (not isinstance(arr.type, pa.DictionaryType)):
         raise TypeError("Must pass a dictionary array")
 
     return call_function("dictionary_decode", [arr], memory_pool)
--- original//arrow/python/pyarrow/tests/test_compute.py
+++ fixed//arrow/python/pyarrow/tests/test_compute.py
@@ -1750,7 +1750,7 @@
 
 def test_dictionary_decode():
     array = pa.array(["a", "a", "b", "c", "b"])
-    dictionary_array = pa.array(["a", "a", "b", "c", "b"], 
+    dictionary_array = pa.array(["a", "a", "b", "c", "b"],
                                 pa.dictionary(pa.int8(), pa.string()))
 
     assert array != dictionary_array

_________________ [doctest] pyarrow.compute.dictionary_decode __________________
340     arr : Array-like
341     memory_pool : MemoryPool, optional
342         memory pool to use for allocations during function execution.
343 
344     Examples
345     --------
346     >>> import pyarrow as pa
347     >>> import pyarrow.compute as pc
348     >>> x = pa.array(["a", "a", "b"], pa.dictionary(pa.int8(), pa.string()))
349     >>> x
Expected:
    <pyarrow.lib.DictionaryArray object at ...>
Got:
    <pyarrow.lib.DictionaryArray object at 0x7f6adc3da050>
    <BLANKLINE>
    -- dictionary:
      [
        "a",
        "b"
      ]
    -- indices:
      [
        0,
        0,
        1
      ]

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/compute.py:349: DocTestFailure
__________________ [doctest] pyarrow.lib.default_memory_pool ___________________
125 default_memory_pool()
126 
127     Return the process-global memory pool.
128 
129     Examples
130     --------
131     >>> default_memory_pool()
Expected:
    <pyarrow.MemoryPool backend_name=... bytes_allocated=0 max_memory=...>
Got:
    <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=192 max_memory=1205120>

In the latter you can use ... instead of specific bytes allocated etc.

R-JunmingChen · 2023-07-13T01:08:24Z

Can you put the function in vector_hash next to dictionary_encode? I agree it is a bit weird since hashing is not used but I think it would better to keep the functions together than to introduce a new file.

Sure, have moved. Do we need additional comment for that?

westonpace · 2023-07-13T04:53:46Z

Sure, have moved. Do we need additional comment for that?

I think it's ok, but if you want we can add a simple comment like

// This function does not use any hashing utilities but is kept in this file to be near dictionary_encode

westonpace

Thanks. I have a few more questions but I think this is getting close.

python/pyarrow/array.pxi

westonpace · 2023-07-13T04:58:29Z

python/pyarrow/array.pxi

        """
-        return self.dictionary.take(self.indices)
+        return _pc().dictionary_decode(self)


Did the old approach work? I wonder if this will introduce any subtle differences? Is there a reason we need to change this method?

The old approach works well. I am not sure the entire risk of subtle differences. One difference is that we output the original input if the input is not dictionary_array, which has no further influence cause only DictionaryArray has a dictionary_decode function. I think it's good to unify the dictionay_decode logic in Python. But your consideration make senses, the change has a risk affecting users who used the old dictionary_decode.

I roll back to old code currently, but I think we need a further consideration on this.

I think it's good to unify the dictionay_decode logic in Python

This is a good point too. I'll let @jorisvandenbossche or @AlenkaF weigh in as well. We can move forward as-is for now and add this back if they want.

One other reason to keep the original is that it is possible to build pyarrow without compute, and then the old version would still work while using the compute version would raise (although I am not sure how important this is, as many things won't work in that case)

python/pyarrow/compute.py

python/pyarrow/tests/test_compute.py

Co-authored-by: Weston Pace <weston.pace@gmail.com>

westonpace

Thanks for this addition!

…e" (apache#35356) **Rationale for this change** This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel. **What changes are included in this PR?** C++: Meta Function of dictionary_decode. Python: Test **Are these changes tested?** One test in tests/test_compute.py * Closes: apache#34588 Lead-authored-by: Junming Chen <junming.chen.r@outlook.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

conbench-apache-arrow · 2023-07-27T04:26:05Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c7741fb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

…e" (apache#35356) **Rationale for this change** This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel. **What changes are included in this PR?** C++: Meta Function of dictionary_decode. Python: Test **Are these changes tested?** One test in tests/test_compute.py * Closes: apache#34588 Lead-authored-by: Junming Chen <junming.chen.r@outlook.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

R-JunmingChen added 6 commits April 25, 2023 23:55

Init dictionary_decode

4272b08

finish dictionary_decode

2d2182e

add .h

35629fb

delete

a88befb

register

fefd25c

dictonary

beb0e51

R-JunmingChen requested a review from westonpace as a code owner April 27, 2023 14:00

github-actions bot added Component: C++ awaiting review Awaiting review labels Apr 27, 2023

R-JunmingChen added 2 commits April 27, 2023 22:33

add cmake

f718b3d

Support Python

9c62b34

R-JunmingChen requested a review from AlenkaF as a code owner April 29, 2023 15:56

github-actions bot added the Component: Python label Apr 29, 2023

R-JunmingChen added 7 commits April 30, 2023 00:13

depoint

d4599ea

add test

e0155fb

lint and doc

3aa7cc6

lint2

dfe844c

optimize

d77037d

Merge branch 'master' of https://github.com/R-JunmingChen/arrow into …

fe2933b

…ARROW-34588

Merge branch 'main' of https://github.com/R-JunmingChen/arrow into AR…

8c9b60f

…ROW-34588

mapleFU reviewed May 11, 2023

View reviewed changes

cpp/src/arrow/compute/dictionary_decode.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 11, 2023

AlenkaF reviewed Jun 26, 2023

View reviewed changes

R-JunmingChen added 4 commits June 27, 2023 23:19

mend test

9aab653

Merge branch 'main' of https://github.com/apache/arrow into ARROW-34588

5241578

fix lint

ad4f8e1

fix doc test

1ddbb6b

R-JunmingChen added 5 commits July 12, 2023 21:01

fix bug

478fc9b

merge in vector hash

61dfe6c

delete useless func

741a96d

fix bug

c9a0b23

lint

a654688

R-JunmingChen requested a review from westonpace July 13, 2023 01:04

westonpace requested changes Jul 13, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 13, 2023

Update python/pyarrow/array.pxi

cb99d40

Co-authored-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 13, 2023

R-JunmingChen added 6 commits July 13, 2023 13:41

Update array.pxi

2ba4050

Update compute.py

562d3a6

Update test_compute.py

41012c7

Update vector_hash.cc

e26e90b

Update vector_hash.cc

5f6c9c3

Update compute.py

b6e16e4

R-JunmingChen requested a review from westonpace July 14, 2023 00:30

westonpace approved these changes Jul 18, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 18, 2023

westonpace merged commit c7741fb into apache:main Jul 18, 2023

westonpace removed the awaiting merge Awaiting merge label Jul 18, 2023

github-actions bot added the awaiting changes Awaiting changes label Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

R-JunmingChen commented Apr 27, 2023 •

edited

Loading

github-actions bot commented Apr 27, 2023

R-JunmingChen commented Jun 23, 2023

AlenkaF left a comment

R-JunmingChen commented Jul 13, 2023

westonpace commented Jul 13, 2023

westonpace left a comment

westonpace Jul 13, 2023

R-JunmingChen Jul 13, 2023 •

edited

Loading

westonpace Jul 18, 2023

jorisvandenbossche Jul 19, 2023

westonpace left a comment

conbench-apache-arrow bot commented Jul 27, 2023

GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356

Conversation

R-JunmingChen commented Apr 27, 2023 • edited Loading

github-actions bot commented Apr 27, 2023

R-JunmingChen commented Jun 23, 2023

AlenkaF left a comment

Choose a reason for hiding this comment

R-JunmingChen commented Jul 13, 2023

westonpace commented Jul 13, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jul 13, 2023

Choose a reason for hiding this comment

R-JunmingChen Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace Jul 18, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jul 19, 2023

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jul 27, 2023

R-JunmingChen commented Apr 27, 2023 •

edited

Loading

R-JunmingChen Jul 13, 2023 •

edited

Loading