-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata #13629
Changes from 2 commits
02a7f09
0abe1df
4a18fe2
25dde91
3efca03
9580c5a
7d3b42f
970ac49
bc3ec3e
67f444e
26199da
c855deb
d5ebecb
4d8e0fa
c4dba6e
105efa9
6b7bab1
5a63f4b
f9e36d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
import datetime | ||
import decimal | ||
from collections import OrderedDict | ||
import os | ||
|
||
import numpy as np | ||
import pytest | ||
|
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size(): | |
buf = out.getvalue() | ||
|
||
metadata = pq.read_metadata(pa.BufferReader(buf)) | ||
|
||
|
||
def test_metadata_schema_filesystem(tmpdir): | ||
table = pa.table({"a": [1, 2, 3]}) | ||
|
||
# URI writing to local file. | ||
file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet") | ||
|
||
pq.write_table(table, file_path) | ||
|
||
# Get expected `metadata` from path. | ||
metadata = pq.read_metadata(tmpdir / '/data.parquet') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to get There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried that but it leads to segfault. Probably worth an issue? (I don't think the program should crash) Crash Log
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running this locally I can confirm a segfault. I think it happens because the table metadata is (correctly) empty: >>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> table = pa.table({"a": [1, 2, 3]})
>>> file_path = "/tmp/data.parquet"
>>> metadata = table.schema.metadata
>>> pq.read_metadata(file_path).equals(metadata)
zsh: segmentation fault python Which would deserve an issue (a warning should be returned without a crash). Maybe a better option to test ParquetFile metadata would be to inspect individual attributes: >>> pq.read_metadata(file_path).num_columns == 1
True
>>> pq.read_metadata(file_path).num_rows == 3
True There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
From the following metadata attributes, I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, they sneaked in 😅. Removed :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there already a JIRA opened for this segfault? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. I will open one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue Link : https://issues.apache.org/jira/browse/ARROW-17142 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for opening the JIRA |
||
schema = table.schema | ||
|
||
assert pq.read_metadata(file_path).equals(metadata) | ||
assert pq.read_schema(file_path).equals(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When opening the file with
open_input_file
, we should probably use it in a context manager to ensure we also close the file handle again:(and the same for read_schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious as to what is the minimum supported Python version.
I was planning to do something like which requires Python 3.7 or more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.7 is our minimum supported version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the confirmation. Have addressed the comment.