-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rust): Added proper handling of file.write for large remote csv files #18424
Conversation
file.write returns Ok(n) indicating how many bytes could be written to the file. This can be sammer than the actual passed content. Added loop in case not all content could be written to the file at first go.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18424 +/- ##
=======================================
Coverage 79.86% 79.87%
=======================================
Files 1501 1501
Lines 201813 201815 +2
Branches 2868 2868
=======================================
+ Hits 161177 161191 +14
+ Misses 40089 40077 -12
Partials 547 547 ☔ View full report in Codecov by Sentry. |
@ritchie46, @orlp, @c-peters, any comments? |
Is there a confirmed Python repro of this bug? |
@ritchie46 yes, the same problem occurs when accessing via the Python wrapper. import polars as pl
OBJECT_STORE_URL = <censored>
# Lazy scan csv file
lazy_csv = pl.scan_csv(
OBJECT_STORE_URL + "more_than_2mb_csv_file.csv",
)
lazy_csv.count().collect() Results in the following error:
|
@nameexhaustion can you take a look at this one? |
Left a comment - there is indeed a bug - we should be using something like I tried to but couldn't get a repro with a with a 60M CSV file from S3, I suspect there's some environmental factor (maybe low disk space?) that is needed to trigger it. Fortunately the file cache also does a check at the end on the file size, so currently we at least raise an error instead of silently continuing with truncated data. |
The written content length is exactly the length of the buffer on the file handle. I do not know which external factors determine that buffer size. For me it is set to the 2097152 bytes seen in the error. If this is not reproducible for you, the error will occur once the file is big enough to exceed this buffer size. Anyways i agree that using write_all is more elegant, did not know of this method. Changed the PR accordingly. |
LGTM 👍 |
Observation
Using the LazyCsvReader in combination with an object store fails in case the read files size exceeds 2MB.
The error is thrown in
crates/polars-io/src/file_cache/entry.rs
when reading the files from cache and comparing their on disk cache size with the remote size.Replication
The problem can easily be replicated, all that is required is an object store with a large enough csv file:
Reason
The file is written to disk using the Tokio file.write API. If successful, this API returns
Ok(n)
withn
being the number of bytes written to the file.n
might be smaller than the passed in number of bytes, indicating that not all bytes have been written to the file. This was not properly handled and the file stored in the disk cache might be only the first part of the actual file.Fix
Loop over the write method until all bytes are written.