Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): Added proper handling of file.write for large remote csv files #18424

Merged
merged 3 commits into from
Sep 2, 2024

Conversation

WbaN314
Copy link
Contributor

@WbaN314 WbaN314 commented Aug 28, 2024

Observation

Using the LazyCsvReader in combination with an object store fails in case the read files size exceeds 2MB.

The error is thrown in crates/polars-io/src/file_cache/entry.rs when reading the files from cache and comparing their on disk cache size with the remote size.

Err(Context { error: Context { error: ComputeError(ErrString("downloaded file size (2097152) does not match expected size (12004337)")), msg: ErrString("'csv scan' failed") }, msg: ErrString("'select' input failed to resolve") })

Replication

The problem can easily be replicated, all that is required is an object store with a large enough csv file:

    #[tokio::test(flavor = "multi_thread")]
    async fn polars_read_single_csv() {
        let path = OBJECT_STORE_BASE_URL.to_string()
            + "more_than_2mb_csv_file.csv";
        let path = Path::new(&path);
        let lazy_frame = LazyCsvReader::new(path)
            .finish()
            .unwrap();
        println!("{:?}", lazy_frame.count().collect())
    }

Reason

The file is written to disk using the Tokio file.write API. If successful, this API returns Ok(n) with n being the number of bytes written to the file. n might be smaller than the passed in number of bytes, indicating that not all bytes have been written to the file. This was not properly handled and the file stored in the disk cache might be only the first part of the actual file.

Fix

Loop over the write method until all bytes are written.

file.write returns Ok(n) indicating how many bytes could be written to the file. This can be sammer than the actual passed content. Added loop in case not all content could be written to the file at first go.
@WbaN314 WbaN314 changed the title Added proper handling of file.write fix(rust): Added proper handling of file.write Aug 28, 2024
@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars and removed title needs formatting labels Aug 28, 2024
@WbaN314 WbaN314 changed the title fix(rust): Added proper handling of file.write fix(rust): Added proper handling of file.write for large remote csv files Aug 28, 2024
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.87%. Comparing base (85feb33) to head (793e50a).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/cloud/polars_object_store.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18424   +/-   ##
=======================================
  Coverage   79.86%   79.87%           
=======================================
  Files        1501     1501           
  Lines      201813   201815    +2     
  Branches     2868     2868           
=======================================
+ Hits       161177   161191   +14     
+ Misses      40089    40077   -12     
  Partials      547      547           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@WbaN314
Copy link
Contributor Author

WbaN314 commented Sep 2, 2024

@ritchie46, @orlp, @c-peters, any comments?

@ritchie46
Copy link
Member

Is there a confirmed Python repro of this bug?

@WbaN314
Copy link
Contributor Author

WbaN314 commented Sep 2, 2024

@ritchie46 yes, the same problem occurs when accessing via the Python wrapper.

import polars as pl

OBJECT_STORE_URL = <censored>

# Lazy scan csv file
lazy_csv = pl.scan_csv(
    OBJECT_STORE_URL + "more_than_2mb_csv_file.csv",
)
lazy_csv.count().collect()

Results in the following error:

File "<path>/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: downloaded file size (2097152) does not match expected size (12004337)

This error occurred with the following context stack:
        [1] 'csv scan' failed
        [2] 'select' input failed to resolve

@ritchie46
Copy link
Member

@nameexhaustion can you take a look at this one?

@nameexhaustion
Copy link
Collaborator

Left a comment - there is indeed a bug - we should be using something like write_all() instead.

I tried to but couldn't get a repro with a with a 60M CSV file from S3, I suspect there's some environmental factor (maybe low disk space?) that is needed to trigger it. Fortunately the file cache also does a check at the end on the file size, so currently we at least raise an error instead of silently continuing with truncated data.

@WbaN314
Copy link
Contributor Author

WbaN314 commented Sep 2, 2024

The written content length is exactly the length of the buffer on the file handle. I do not know which external factors determine that buffer size. For me it is set to the 2097152 bytes seen in the error. If this is not reproducible for you, the error will occur once the file is big enough to exceed this buffer size.

Anyways i agree that using write_all is more elegant, did not know of this method. Changed the PR accordingly.

@nameexhaustion
Copy link
Collaborator

LGTM 👍

@ritchie46 ritchie46 merged commit 8a143c0 into pola-rs:main Sep 2, 2024
21 checks passed
@WbaN314 WbaN314 deleted the LazyCsvReader_bug branch September 2, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants