Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_files raises KeyError if parquet file doe not have column stats #1353

Closed
binayakd opened this issue Nov 21, 2024 · 0 comments · Fixed by #1354
Closed

add_files raises KeyError if parquet file doe not have column stats #1353

binayakd opened this issue Nov 21, 2024 · 0 comments · Fixed by #1354

Comments

@binayakd
Copy link
Contributor

Apache Iceberg version

0.8.0 (latest release)

Please describe the bug 🐞

Using the NYC taxi data set found here, if I follow the standard way of creating catalog, and table, but instead of doing append, I do add_files:

from pyiceberg.catalog.sql import SqlCatalog
import pyarrow.parquet as pq


warehouse_path = "/tmp/warehouse"
data_file_path = "/tmp/test-data" 

catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    }
)

df = pq.read_table(f"{data_file_path}/yellow_tripdata_2024-01.parquet")

catalog.create_namespace("default")

table = catalog.create_table(
    "default.taxi_dataset",
    schema=df.schema,
)

table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])

I get a KeyError:

Traceback (most recent call last):
  File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in <module>
    main()
  File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in main
    table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1036, in add_files
    tx.add_files(
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 594, in add_files
    for data_file in data_files:
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1537, in _parquet_files_to_data_files
    yield from parquet_files_to_data_files(io=io, table_metadata=table_metadata, file_paths=iter(file_paths))
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2535, in parquet_files_to_data_files
    statistics = data_file_statistics_from_parquet_metadata(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2400, in data_file_statistics_from_parquet_metadata
    del col_aggs[field_id]
        ~~~~~~~~^^^^^^^^^^
KeyError: 1

This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block here
So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run here, the KeyError is thrown.

As discussed on slack, @kevinjqliu proposed to switch del col_aggs[field_id] with col_aggs.pop(field_id, None).

I will be raising a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants