-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44066: [Python] Add Python wrapper for JsonExtensionType #44070
Conversation
|
5253f45
to
8f95269
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Added some inline comments. One other comment:
- Can you add the new classes to
test_extension_type_constructor_errors
in test_misc.py
Are you planning to work on enabling the parquet integration in a later PR?
Thanks for the review!
Added.
Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally. Question: any idea why pytest is saying |
I would be surprised if it works locally out of the box because IIRC the option was disabled by default in C++ for now? (so what we would need on the python side is expose that new
There might somehow we some conflict with the stdlib module? |
Actually not that, but we have a |
@pytest.mark.parametrize("storage_type", ( | ||
pa.utf8(), pa.large_utf8(), pa.string(), pa.large_string())) | ||
@pytest.mark.parquet | ||
def test_parquet_json(tmpdir, storage_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move those tests to parquet subdir? (e.g. python/pyarrow/tests/parquet/test_data_types.py
)
I know we already have parquet related tests in this file, but those are for custom extension type support, while this will be a built-in extension type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
4393536
to
cb6820b
Compare
I see. I assumed
It seems this was the case, changed extension type name to |
8ddc682
to
514985a
Compare
Another rebase on master. |
514985a
to
bf25e3a
Compare
bf25e3a
to
583ba67
Compare
@pitrou could you do a quick pass here in case anything stands out please? |
Create an extension array | ||
|
||
>>> arr = [None, '{ "id":30, "values":["a", "b"] }'] | ||
>>> storage = pa.array(arr, pa.large_utf8()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: it would be nice if one could write json_array = pa.array(arr, json_type)
.
Perhaps open a feature request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a nice feature. #44406
|
||
for storage_type in (pa.int32(), pa.large_binary(), pa.float32()): | ||
with pytest.raises( | ||
pa.ArrowInvalid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pity this doesn't raise TypeError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could catch and raise it but it's probably not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. TypeError
would have to be raised at the C++ level instead. Anyway, this is out of scope for this PR.
@pitrou do you think this merits a merge or should we wait for another review? |
What's the plan for the Parquet |
Perhaps we should open another issue for it? Current implementation seems to roundtrip to parquet ok. diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd
index d6aebd8284..32e2618ecf 100644
--- a/python/pyarrow/_parquet.pxd
+++ b/python/pyarrow/_parquet.pxd
@@ -405,6 +405,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil:
CCacheOptions cache_options() const
void set_coerce_int96_timestamp_unit(TimeUnit unit)
TimeUnit coerce_int96_timestamp_unit() const
+ void set_arrow_extensions_enabled(c_bool enabled)
ArrowReaderProperties default_arrow_reader_properties()
diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx
index 254bfe3b09..6ae1726c71 100644
--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -1441,7 +1441,8 @@ cdef class ParquetReader(_Weakrefable):
FileDecryptionProperties decryption_properties=None,
thrift_string_size_limit=None,
thrift_container_size_limit=None,
- page_checksum_verification=False):
+ page_checksum_verification=False,
+ arrow_extensions_enabled=False):
"""
Open a parquet file for reading.
@@ -1458,6 +1459,7 @@ cdef class ParquetReader(_Weakrefable):
thrift_string_size_limit : int, optional
thrift_container_size_limit : int, optional
page_checksum_verification : bool, default False
+ arrow_extensions_enabled: bool, default False
"""
cdef:
shared_ptr[CFileMetaData] c_metadata
@@ -1522,6 +1524,9 @@ cdef class ParquetReader(_Weakrefable):
if read_dictionary is not None:
self._set_read_dictionary(read_dictionary, &arrow_props)
+ if arrow_extensions_enabled:
+ arrow_props.set_arrow_extensions_enabled(<c_bool>True)
+
with nogil:
check_status(builder.memory_pool(self.pool)
.properties(arrow_props) |
@rok, yes, we should open a new issue for it |
Opened an issue for the |
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit bcb4653. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
We added canonical JsonExtensionType and we should make it usable from Python.
What changes are included in this PR?
Python wrapper for
JsonExtensionType
andJsonArray
are added on Python side as well asJsonArray
on c++ side.Are these changes tested?
Python tests for the extension type and array are included.
Are there any user-facing changes?
This adds a json canonical extension type to pyarrow.