Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deltalake read generate a massive number of read request #931

Closed
djouallah opened this issue Nov 12, 2022 · 4 comments
Closed

Deltalake read generate a massive number of read request #931

djouallah opened this issue Nov 12, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@djouallah
Copy link

Environment

0.6.3

Binding:
Python

Environment:
GCP


Bug

trying to read a simple delta table in GCP, which worked fine before, with 0.6.3 it is became extremely slow, had a look at the log, delta is generating a massive number of read requests !!! the delta file is only 7 parquet files !!!

image

@djouallah djouallah added the bug Something isn't working label Nov 12, 2022
@wjones127
Copy link
Collaborator

wjones127 commented Nov 12, 2022

Thanks for reporting this @djouallah.

IIRC the older implementation read the entire file with one request. Whereas I think the newer versions of object-store and parquet crates will read parts of files with a range request (potentially in parallel). So it isn't reading more data total, just splitting it into multiple requests. This is good for allowing you to process data batch-wise, but it looks like we are likely using too small of ranges for the requests, so the overhead is slowing it down. We'll need to tune this to be more optimal for the typical file sizes in Delta Lake, which is usually around 100MB.

IMO we should probably add some benchmarks to the reader and possibly add continuous benchmarking so we can track the impact of our changes. This action might help.

@wjones127
Copy link
Collaborator

I looked into this today. We got a lot slower when we started passing our Rust storage handlers into the PyArrow read and write dataset functions. I'm not 100% certain, but I suspect that is likely because of the limitation of the GIL. The built-in PyArrow systems can use multiple threads, but the Rust ones passed through Python will always be single threaded.

As far as the "massive number of requests", have you measured how many requests were made earlier? It's not clear anything meaningful changed there. If you do see a difference, it would be helpful to know the exact versions of deltalake and pyarrow you are using, as well as the exact code.

For reading, you should be able to get good performance still by explicitly passing down a PyArrow filesystem into the reader, like so:

from deltalake import DeltaTable
import pyarrow.fs as pa_fs

table_root = "gs://..."
dt = DeltaTable(table_root)
# Needs to manually be set at table root for now
fs = pa_fs.SubTreeFilesystem(table_root[5:], pa_fs.GcsFileSystem())

result = df.to_pyarrow_table(filesystem=fs)

@wjones127
Copy link
Collaborator

We made some performance improvements in #933, which addressed the problem I discussed (will be in next Python release). But based on the issue linked above, there might be an issue in PyArrow itself within its scanner.

@rando-brando
Copy link

Pretty sure it is a scanner issue with arrow package. Same problem when using dataset.to_batches(). See apache/arrow#33759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants