-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deltalake read generate a massive number of read request #931
Comments
Thanks for reporting this @djouallah.
IMO we should probably add some benchmarks to the reader and possibly add continuous benchmarking so we can track the impact of our changes. This action might help. |
I looked into this today. We got a lot slower when we started passing our Rust storage handlers into the PyArrow read and write dataset functions. I'm not 100% certain, but I suspect that is likely because of the limitation of the GIL. The built-in PyArrow systems can use multiple threads, but the Rust ones passed through Python will always be single threaded. As far as the "massive number of requests", have you measured how many requests were made earlier? It's not clear anything meaningful changed there. If you do see a difference, it would be helpful to know the exact versions of deltalake and pyarrow you are using, as well as the exact code. For reading, you should be able to get good performance still by explicitly passing down a PyArrow filesystem into the reader, like so: from deltalake import DeltaTable
import pyarrow.fs as pa_fs
table_root = "gs://..."
dt = DeltaTable(table_root)
# Needs to manually be set at table root for now
fs = pa_fs.SubTreeFilesystem(table_root[5:], pa_fs.GcsFileSystem())
result = df.to_pyarrow_table(filesystem=fs) |
We made some performance improvements in #933, which addressed the problem I discussed (will be in next Python release). But based on the issue linked above, there might be an issue in PyArrow itself within its scanner. |
Pretty sure it is a scanner issue with arrow package. Same problem when using |
Environment
0.6.3
Binding:
Python
Environment:
GCP
Bug
trying to read a simple delta table in GCP, which worked fine before, with 0.6.3 it is became extremely slow, had a look at the log, delta is generating a massive number of read requests !!! the delta file is only 7 parquet files !!!
The text was updated successfully, but these errors were encountered: