len/count regression since 0.20.6 - 11x times slower in sample #14619

yuuuxt · 2024-02-21T07:22:20Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# imports
from pathlib import Path
import pandas as pd
import numpy as np

data_path = Path(r"some-temp-path-here")

assert data_path.exists()

import polars as pl

print(pl.__version__)

# data prep
rng = np.random.default_rng()

sample_data_gen_4m = pd.DataFrame(rng.integers(low=100_000_000_000_000, high=900_000_000_000_000, size=4_000_000), columns=["col_1"]).astype(str)
sample_data_4m = pd.DataFrame({"col_1": ["100000000000000"]*4_000_000})

sample_data_gen_4m.to_parquet(data_path / "sample_data_gen_4m.parquet")
sample_data_4m.to_parquet(data_path / "sample_data_4m.parquet")


# timeit - showing len only, count is similar

# issue exists on random generated data
# 0.20.5  - 253 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.6  - 2.28 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.10 - 3.03 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pl.scan_parquet(data_path / "sample_data_gen_4m.parquet").select(pl.len()).collect()

# no issue in this case
# 0.20.5   - 79.9 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.6   - 114 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.10 - 113 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_4m.parquet").select(pl.len()).collect()

Log output

No response

Issue description

Counting parquet rows gets significantly slower since 0.20.6 (in real-world case it's making the code about 20x slower).

Expected behavior

The time cost should be similar.

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Windows-10-10.0.17763-SP0
Python:               3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2021.10.1
gevent:               21.8.0
hvplot:               <not installed>
matplotlib:           3.4.3
numpy:                1.24.2
openpyxl:             3.0.9
pandas:               1.5.3
pyarrow:              12.0.1
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.22
xlsx2csv:             <not installed>
xlsxwriter:           3.0.1

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-02-22T03:25:28Z

~~Can we get a bisect to see which commit introduced the slow down?~~

I expect this is the new string type. Isn't reader the file just slower?

ritchie46 · 2024-02-22T04:23:38Z

I do see a difference in performance on linux, but not as drastically as on windows. I believe windows is due to the allocator we compile on windows which is very bad. We have to try to switch to default allocator.

ritchie46 · 2024-02-22T04:38:58Z

I think the difference is due to utf8 validation. This is much more expensive because we cannot do it on the whole buffer anymore:

I will see if we can do something smart here.

ritchie46 · 2024-02-26T23:08:24Z

Found that we accidentally introduced quadratic behavior for the new string type. This is resolved by: #14705

yuuuxt · 2024-02-28T02:15:37Z

version 0.20.11 indeed fixes this issue:

# issue exists on random generated data
# 0.20.5  - 253 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.6  - 2.28 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.10 - 3.03 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.11 - 747 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_gen_4m.parquet").select(pl.len()).collect()

# no issue in this case
# 0.20.5   - 79.9 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.6   - 114 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.10 - 113 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.11 - 687 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_4m.parquet").select(pl.len()).collect()

Thanks for the quick fix!

yuuuxt added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 21, 2024

ritchie46 mentioned this issue Feb 23, 2024

perf: add utf8-validation fast paths for utf8view #14644

Merged

ritchie46 closed this as completed in #14644 Feb 23, 2024

c-peters added the accepted Ready for implementation label Feb 26, 2024

c-peters assigned ritchie46 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

len/count regression since 0.20.6 - 11x times slower in sample #14619

len/count regression since 0.20.6 - 11x times slower in sample #14619

yuuuxt commented Feb 21, 2024

ritchie46 commented Feb 22, 2024 •

edited

Loading

ritchie46 commented Feb 22, 2024

ritchie46 commented Feb 22, 2024

ritchie46 commented Feb 26, 2024

yuuuxt commented Feb 28, 2024

len/count regression since 0.20.6 - 11x times slower in sample #14619

len/count regression since 0.20.6 - 11x times slower in sample #14619

Comments

yuuuxt commented Feb 21, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Feb 22, 2024 • edited Loading

ritchie46 commented Feb 22, 2024

ritchie46 commented Feb 22, 2024

ritchie46 commented Feb 26, 2024

yuuuxt commented Feb 28, 2024

ritchie46 commented Feb 22, 2024 •

edited

Loading