fix(rust): Added proper handling of file.write for large remote csv files #18424

WbaN314 · 2024-08-28T09:41:14Z

Observation

Using the LazyCsvReader in combination with an object store fails in case the read files size exceeds 2MB.

The error is thrown in crates/polars-io/src/file_cache/entry.rs when reading the files from cache and comparing their on disk cache size with the remote size.

Err(Context { error: Context { error: ComputeError(ErrString("downloaded file size (2097152) does not match expected size (12004337)")), msg: ErrString("'csv scan' failed") }, msg: ErrString("'select' input failed to resolve") })

Replication

The problem can easily be replicated, all that is required is an object store with a large enough csv file:

    #[tokio::test(flavor = "multi_thread")]
    async fn polars_read_single_csv() {
        let path = OBJECT_STORE_BASE_URL.to_string()
            + "more_than_2mb_csv_file.csv";
        let path = Path::new(&path);
        let lazy_frame = LazyCsvReader::new(path)
            .finish()
            .unwrap();
        println!("{:?}", lazy_frame.count().collect())
    }

Reason

The file is written to disk using the Tokio file.write API. If successful, this API returns Ok(n) with n being the number of bytes written to the file. n might be smaller than the passed in number of bytes, indicating that not all bytes have been written to the file. This was not properly handled and the file stored in the disk cache might be only the first part of the actual file.

Fix

Loop over the write method until all bytes are written.

file.write returns Ok(n) indicating how many bytes could be written to the file. This can be sammer than the actual passed content. Added loop in case not all content could be written to the file at first go.

codecov · 2024-08-28T10:12:47Z

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.87%. Comparing base (85feb33) to head (793e50a).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-io/src/cloud/polars_object_store.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #18424   +/-   ##
=======================================
  Coverage   79.86%   79.87%           
=======================================
  Files        1501     1501           
  Lines      201813   201815    +2     
  Branches     2868     2868           
=======================================
+ Hits       161177   161191   +14     
+ Misses      40089    40077   -12     
  Partials      547      547

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WbaN314 · 2024-09-02T06:44:47Z

@ritchie46, @orlp, @c-peters, any comments?

ritchie46 · 2024-09-02T08:02:43Z

Is there a confirmed Python repro of this bug?

WbaN314 · 2024-09-02T09:08:38Z

@ritchie46 yes, the same problem occurs when accessing via the Python wrapper.

import polars as pl

OBJECT_STORE_URL = <censored>

# Lazy scan csv file
lazy_csv = pl.scan_csv(
    OBJECT_STORE_URL + "more_than_2mb_csv_file.csv",
)
lazy_csv.count().collect()

Results in the following error:

File "<path>/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: downloaded file size (2097152) does not match expected size (12004337)

This error occurred with the following context stack:
        [1] 'csv scan' failed
        [2] 'select' input failed to resolve

ritchie46 · 2024-09-02T09:48:39Z

@nameexhaustion can you take a look at this one?

crates/polars-io/src/cloud/polars_object_store.rs

nameexhaustion · 2024-09-02T10:55:11Z

Left a comment - there is indeed a bug - we should be using something like write_all() instead.

I tried to but couldn't get a repro with a with a 60M CSV file from S3, I suspect there's some environmental factor (maybe low disk space?) that is needed to trigger it. Fortunately the file cache also does a check at the end on the file size, so currently we at least raise an error instead of silently continuing with truncated data.

WbaN314 · 2024-09-02T11:34:17Z

The written content length is exactly the length of the buffer on the file handle. I do not know which external factors determine that buffer size. For me it is set to the 2097152 bytes seen in the error. If this is not reproducible for you, the error will occur once the file is big enough to exceed this buffer size.

Anyways i agree that using write_all is more elegant, did not know of this method. Changed the PR accordingly.

nameexhaustion · 2024-09-02T12:09:38Z

LGTM 👍

Added proper handling of file.write

f8a6b70

file.write returns Ok(n) indicating how many bytes could be written to the file. This can be sammer than the actual passed content. Added loop in case not all content could be written to the file at first go.

WbaN314 requested review from ritchie46, orlp and c-peters as code owners August 28, 2024 09:41

github-actions bot added the title needs formatting label Aug 28, 2024

WbaN314 changed the title ~~Added proper handling of file.write~~ fix(rust): Added proper handling of file.write Aug 28, 2024

github-actions bot added fix Bug fix rust Related to Rust Polars and removed title needs formatting labels Aug 28, 2024

WbaN314 changed the title ~~fix(rust): Added proper handling of file.write~~ fix(rust): Added proper handling of file.write for large remote csv files Aug 28, 2024

nameexhaustion reviewed Sep 2, 2024

View reviewed changes

crates/polars-io/src/cloud/polars_object_store.rs Outdated Show resolved Hide resolved

SaschaAtWork and others added 2 commits September 2, 2024 13:26

Replaced custom loop solution with write_all method

f2647ed

Merge branch 'pola-rs:main' into LazyCsvReader_bug

793e50a

ritchie46 merged commit 8a143c0 into pola-rs:main Sep 2, 2024
21 checks passed

WbaN314 deleted the LazyCsvReader_bug branch September 2, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rust): Added proper handling of file.write for large remote csv files #18424

fix(rust): Added proper handling of file.write for large remote csv files #18424

WbaN314 commented Aug 28, 2024

codecov bot commented Aug 28, 2024 •

edited

Loading

WbaN314 commented Sep 2, 2024

ritchie46 commented Sep 2, 2024

WbaN314 commented Sep 2, 2024

ritchie46 commented Sep 2, 2024

nameexhaustion commented Sep 2, 2024

WbaN314 commented Sep 2, 2024

nameexhaustion commented Sep 2, 2024

fix(rust): Added proper handling of file.write for large remote csv files #18424

fix(rust): Added proper handling of file.write for large remote csv files #18424

Conversation

WbaN314 commented Aug 28, 2024

Observation

Replication

Reason

Fix

codecov bot commented Aug 28, 2024 • edited Loading

Codecov Report

WbaN314 commented Sep 2, 2024

ritchie46 commented Sep 2, 2024

WbaN314 commented Sep 2, 2024

ritchie46 commented Sep 2, 2024

nameexhaustion commented Sep 2, 2024

WbaN314 commented Sep 2, 2024

nameexhaustion commented Sep 2, 2024

codecov bot commented Aug 28, 2024 •

edited

Loading