Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata #13629

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions python/pyarrow/parquet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3378,7 +3378,8 @@ def write_metadata(schema, where, metadata_collector=None, **kwargs):
metadata.write_metadata_file(where)


def read_metadata(where, memory_map=False, decryption_properties=None):
def read_metadata(where, memory_map=False, decryption_properties=None,
filesystem=None):
"""
Read FileMetaData from footer of a single Parquet file.

Expand All @@ -3389,6 +3390,10 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
Create memory map when the source is a file path.
decryption_properties : FileDecryptionProperties, default None
Decryption properties for reading encrypted Parquet files.
filesystem : FileSystem, default None
If nothing passed, will be inferred based on path.
Path will try to be found in the local on-disk filesystem otherwise
it will be parsed as an URI to determine the filesystem.

Returns
-------
Expand All @@ -3411,11 +3416,15 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
format_version: 2.6
serialized_size: 561
"""
filesystem, where = _resolve_filesystem_and_path(where, filesystem)
if filesystem is not None:
where = filesystem.open_input_file(where)
return ParquetFile(where, memory_map=memory_map,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When opening the file with open_input_file, we should probably use it in a context manager to ensure we also close the file handle again:

with filesystem.open_input_file(where) as source:
    return ParquetFile(source, ...).metadata

(and the same for read_schema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious as to what is the minimum supported Python version.

I was planning to do something like which requires Python 3.7 or more.

source = filesystem.open_input_file(where) if filesystem is not None else nullcontext()
with source:
    return ParquetFile(
        source, memory_map=memory_map,
        decryption_properties=decryption_properties).schema.to_arrow_schema()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.7 is our minimum supported version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation. Have addressed the comment.

decryption_properties=decryption_properties).metadata


def read_schema(where, memory_map=False, decryption_properties=None):
def read_schema(where, memory_map=False, decryption_properties=None,
filesystem=None):
"""
Read effective Arrow schema from Parquet file metadata.

Expand All @@ -3426,6 +3435,10 @@ def read_schema(where, memory_map=False, decryption_properties=None):
Create memory map when the source is a file path.
decryption_properties : FileDecryptionProperties, default None
Decryption properties for reading encrypted Parquet files.
filesystem : FileSystem, default None
If nothing passed, will be inferred based on path.
Path will try to be found in the local on-disk filesystem otherwise
it will be parsed as an URI to determine the filesystem.

Returns
-------
Expand All @@ -3443,6 +3456,9 @@ def read_schema(where, memory_map=False, decryption_properties=None):
n_legs: int64
animal: string
"""
filesystem, where = _resolve_filesystem_and_path(where, filesystem)
if filesystem is not None:
where = filesystem.open_input_file(where)
return ParquetFile(
where, memory_map=memory_map,
decryption_properties=decryption_properties).schema.to_arrow_schema()
17 changes: 17 additions & 0 deletions python/pyarrow/tests/parquet/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import datetime
import decimal
from collections import OrderedDict
import os

import numpy as np
import pytest
Expand Down Expand Up @@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
buf = out.getvalue()

metadata = pq.read_metadata(pa.BufferReader(buf))


def test_metadata_schema_filesystem(tmpdir):
table = pa.table({"a": [1, 2, 3]})

# URI writing to local file.
file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")

pq.write_table(table, file_path)

# Get expected `metadata` from path.
metadata = pq.read_metadata(tmpdir / '/data.parquet')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get metadata directly from the table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try table.schema.metadata.

Copy link
Contributor Author

@kshitij12345 kshitij12345 Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but it leads to segfault. Probably worth an issue? (I don't think the program should crash)

Crash Log
python/pyarrow/tests/parquet/test_metadata.py .........................sFatal Python error: Segmentation fault

Current thread 0x00007f066e2be740 (most recent call first):
  File "/home/kshiteej/Apache/arrow/python/pyarrow/tests/parquet/test_metadata.py", line 549 in test_metadata_schema_filesystem
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 1761 in runtest
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 259 in <lambda>
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 338 in from_call
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 258 in call_runtest_hook
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 219 in call_and_report
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 130 in runtestprotocol
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 322 in _main
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 268 in wrap_session
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/home/kshiteej/.conda/envs/pyarrow-dev/bin/pytest", line 11 in <module>
Segmentation fault (core dumped)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this locally I can confirm a segfault. I think it happens because the table metadata is (correctly) empty:

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq

>>> table = pa.table({"a": [1, 2, 3]})
>>> file_path = "/tmp/data.parquet"
>>> metadata = table.schema.metadata

>>> pq.read_metadata(file_path).equals(metadata)
zsh: segmentation fault  python

Which would deserve an issue (a warning should be returned without a crash).

Maybe a better option to test ParquetFile metadata would be to inspect individual attributes:

>>> pq.read_metadata(file_path).num_columns == 1
True
>>> pq.read_metadata(file_path).num_rows == 3
True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_columns: 1
num_rows: 3
num_row_groups: 1
format_version: 2.6
serialized_size: 375

From the following metadata attributes, I think num_columns, num_rows, num_row_groups make sense. Should we also check for the other two (I am not sure if they will stay same across versions and platforms)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, they sneaked in 😅. Removed :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already a JIRA opened for this segfault?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I will open one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the JIRA

schema = table.schema

assert pq.read_metadata(file_path).equals(metadata)
assert pq.read_schema(file_path).equals(schema)