-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34588:[C++][Python] Add a MetaFunction for "dictionary_decode" #35356
Conversation
Hi, @westonpace @AlenkaF, if you have spared time, please review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @R-JunmingChen !
I had a look at the Python part. Just have some minor nits.
- Can we add a combo of encode & decode to the test?
- There are some corrections needed from doctest and linter checks:
--- original//arrow/python/pyarrow/compute.py
+++ fixed//arrow/python/pyarrow/compute.py
@@ -387,7 +387,7 @@
array : decoded Array
The dictionary_decode result as a new Array
"""
- if(not isinstance(arr.type, pa.DictionaryType)):
+ if (not isinstance(arr.type, pa.DictionaryType)):
raise TypeError("Must pass a dictionary array")
return call_function("dictionary_decode", [arr], memory_pool)
--- original//arrow/python/pyarrow/tests/test_compute.py
+++ fixed//arrow/python/pyarrow/tests/test_compute.py
@@ -1750,7 +1750,7 @@
def test_dictionary_decode():
array = pa.array(["a", "a", "b", "c", "b"])
- dictionary_array = pa.array(["a", "a", "b", "c", "b"],
+ dictionary_array = pa.array(["a", "a", "b", "c", "b"],
pa.dictionary(pa.int8(), pa.string()))
assert array != dictionary_array
_________________ [doctest] pyarrow.compute.dictionary_decode __________________
340 arr : Array-like
341 memory_pool : MemoryPool, optional
342 memory pool to use for allocations during function execution.
343
344 Examples
345 --------
346 >>> import pyarrow as pa
347 >>> import pyarrow.compute as pc
348 >>> x = pa.array(["a", "a", "b"], pa.dictionary(pa.int8(), pa.string()))
349 >>> x
Expected:
<pyarrow.lib.DictionaryArray object at ...>
Got:
<pyarrow.lib.DictionaryArray object at 0x7f6adc3da050>
<BLANKLINE>
-- dictionary:
[
"a",
"b"
]
-- indices:
[
0,
0,
1
]
/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/compute.py:349: DocTestFailure
__________________ [doctest] pyarrow.lib.default_memory_pool ___________________
125 default_memory_pool()
126
127 Return the process-global memory pool.
128
129 Examples
130 --------
131 >>> default_memory_pool()
Expected:
<pyarrow.MemoryPool backend_name=... bytes_allocated=0 max_memory=...>
Got:
<pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=192 max_memory=1205120>
In the latter you can use ...
instead of specific bytes allocated etc.
Sure, have moved. Do we need additional comment for that? |
I think it's ok, but if you want we can add a simple comment like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have a few more questions but I think this is getting close.
python/pyarrow/array.pxi
Outdated
""" | ||
return self.dictionary.take(self.indices) | ||
return _pc().dictionary_decode(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the old approach work? I wonder if this will introduce any subtle differences? Is there a reason we need to change this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old approach works well. I am not sure the entire risk of subtle differences. One difference is that we output the original input if the input is not dictionary_array, which has no further influence cause only DictionaryArray has a dictionary_decode function. I think it's good to unify the dictionay_decode logic in Python. But your consideration make senses, the change has a risk affecting users who used the old dictionary_decode.
I roll back to old code currently, but I think we need a further consideration on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to unify the dictionay_decode logic in Python
This is a good point too. I'll let @jorisvandenbossche or @AlenkaF weigh in as well. We can move forward as-is for now and add this back if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other reason to keep the original is that it is possible to build pyarrow without compute, and then the old version would still work while using the compute version would raise (although I am not sure how important this is, as many things won't work in that case)
Co-authored-by: Weston Pace <weston.pace@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this addition!
…e" (apache#35356) **Rationale for this change** This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel. **What changes are included in this PR?** C++: Meta Function of dictionary_decode. Python: Test **Are these changes tested?** One test in tests/test_compute.py * Closes: apache#34588 Lead-authored-by: Junming Chen <junming.chen.r@outlook.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c7741fb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…e" (apache#35356) **Rationale for this change** This PR is for [Issue-34588](apache#34588). Discussing with @ westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel. **What changes are included in this PR?** C++: Meta Function of dictionary_decode. Python: Test **Are these changes tested?** One test in tests/test_compute.py * Closes: apache#34588 Lead-authored-by: Junming Chen <junming.chen.r@outlook.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Rationale for this change
This PR is for Issue-34588. Discussing with @westonpace, a MetaFunction for "dictionary_decode" is implemented instead of adding a compute kernel.
What changes are included in this PR?
C++: Meta Function of dictionary_decode.
Python: Test
Are these changes tested?
One test in tests/test_compute.py