[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should #33624

jhostetler · 2023-01-12T05:53:00Z

Describe the bug, including details regarding any error messages, version, and platform.

I have a hive-partitioned dataset in a Google Cloud Storage bucket. Its size is around 54MB according to gsutil du and verified by downloading it and checking locally. However, if I open it with ds = pyarrow.dataset.dataset("gs://...", partitioning="hive", format="parquet") and then traverse it with ds.to_batches(), it results in multiple GB of inbound network traffic and takes much longer than simply downloading the data.

This is with PyArrow 10.0.1. MacOS 12.6 with Apple M1 CPU.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

djouallah · 2023-01-12T14:08:49Z

I am having the same issue when used with delta table
delta-io/delta-rs#931

jhostetler added the Type: bug label Jan 12, 2023

github-actions bot added the Component: Python label Jan 12, 2023

rando-brando mentioned this issue Feb 2, 2023

[Python][C++] How to limit the memory consumption of to_batches() #33759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should #33624

[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should #33624

jhostetler commented Jan 12, 2023 •

edited

Loading

djouallah commented Jan 12, 2023

[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should #33624

[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should #33624

Comments

jhostetler commented Jan 12, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

djouallah commented Jan 12, 2023

jhostetler commented Jan 12, 2023 •

edited

Loading