Deltalake read generate a massive number of read request #931

djouallah · 2022-11-12T04:51:31Z

Environment

0.6.3

Binding:
Python

Environment:
GCP

Bug

trying to read a simple delta table in GCP, which worked fine before, with 0.6.3 it is became extremely slow, had a look at the log, delta is generating a massive number of read requests !!! the delta file is only 7 parquet files !!!

wjones127 · 2022-11-12T23:58:26Z

Thanks for reporting this @djouallah.

IIRC the older implementation read the entire file with one request. Whereas I think the newer versions of object-store and parquet crates will read parts of files with a range request (potentially in parallel). So it isn't reading more data total, just splitting it into multiple requests. This is good for allowing you to process data batch-wise, but it looks like we are likely using too small of ranges for the requests, so the overhead is slowing it down. We'll need to tune this to be more optimal for the typical file sizes in Delta Lake, which is usually around 100MB.

IMO we should probably add some benchmarks to the reader and possibly add continuous benchmarking so we can track the impact of our changes. This action might help.

wjones127 · 2022-11-13T19:26:39Z

I looked into this today. We got a lot slower when we started passing our Rust storage handlers into the PyArrow read and write dataset functions. I'm not 100% certain, but I suspect that is likely because of the limitation of the GIL. The built-in PyArrow systems can use multiple threads, but the Rust ones passed through Python will always be single threaded.

As far as the "massive number of requests", have you measured how many requests were made earlier? It's not clear anything meaningful changed there. If you do see a difference, it would be helpful to know the exact versions of deltalake and pyarrow you are using, as well as the exact code.

For reading, you should be able to get good performance still by explicitly passing down a PyArrow filesystem into the reader, like so:

from deltalake import DeltaTable
import pyarrow.fs as pa_fs

table_root = "gs://..."
dt = DeltaTable(table_root)
# Needs to manually be set at table root for now
fs = pa_fs.SubTreeFilesystem(table_root[5:], pa_fs.GcsFileSystem())

result = df.to_pyarrow_table(filesystem=fs)

wjones127 · 2023-01-24T02:08:32Z

We made some performance improvements in #933, which addressed the problem I discussed (will be in next Python release). But based on the issue linked above, there might be an issue in PyArrow itself within its scanner.

rando-brando · 2023-02-02T05:21:08Z

Pretty sure it is a scanner issue with arrow package. Same problem when using dataset.to_batches(). See apache/arrow#33759

djouallah added the bug Something isn't working label Nov 12, 2022

djouallah mentioned this issue Jan 12, 2023

[GCS] Scanning hive-partitioned dataset results in orders of magnitude more network traffic than it should apache/arrow#33624

Open

djouallah closed this as completed Dec 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deltalake read generate a massive number of read request #931

Deltalake read generate a massive number of read request #931

djouallah commented Nov 12, 2022

wjones127 commented Nov 12, 2022 •

edited

Loading

wjones127 commented Nov 13, 2022

wjones127 commented Jan 24, 2023

rando-brando commented Feb 2, 2023

Deltalake read generate a massive number of read request #931

Deltalake read generate a massive number of read request #931

Comments

djouallah commented Nov 12, 2022

Environment

Bug

wjones127 commented Nov 12, 2022 • edited Loading

wjones127 commented Nov 13, 2022

wjones127 commented Jan 24, 2023

rando-brando commented Feb 2, 2023

wjones127 commented Nov 12, 2022 •

edited

Loading