Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

len/count regression since 0.20.6 - 11x times slower in sample #14619

Closed
2 tasks done
yuuuxt opened this issue Feb 21, 2024 · 5 comments · Fixed by #14644
Closed
2 tasks done

len/count regression since 0.20.6 - 11x times slower in sample #14619

yuuuxt opened this issue Feb 21, 2024 · 5 comments · Fixed by #14644
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@yuuuxt
Copy link

yuuuxt commented Feb 21, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# imports
from pathlib import Path
import pandas as pd
import numpy as np

data_path = Path(r"some-temp-path-here")

assert data_path.exists()

import polars as pl

print(pl.__version__)

# data prep
rng = np.random.default_rng()

sample_data_gen_4m = pd.DataFrame(rng.integers(low=100_000_000_000_000, high=900_000_000_000_000, size=4_000_000), columns=["col_1"]).astype(str)
sample_data_4m = pd.DataFrame({"col_1": ["100000000000000"]*4_000_000})

sample_data_gen_4m.to_parquet(data_path / "sample_data_gen_4m.parquet")
sample_data_4m.to_parquet(data_path / "sample_data_4m.parquet")


# timeit - showing len only, count is similar

# issue exists on random generated data
# 0.20.5  - 253 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.6  - 2.28 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.10 - 3.03 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pl.scan_parquet(data_path / "sample_data_gen_4m.parquet").select(pl.len()).collect()

# no issue in this case
# 0.20.5   - 79.9 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.6   - 114 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.10 - 113 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_4m.parquet").select(pl.len()).collect()

Log output

No response

Issue description

Counting parquet rows gets significantly slower since 0.20.6 (in real-world case it's making the code about 20x slower).

Expected behavior

The time cost should be similar.

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Windows-10-10.0.17763-SP0
Python:               3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2021.10.1
gevent:               21.8.0
hvplot:               <not installed>
matplotlib:           3.4.3
numpy:                1.24.2
openpyxl:             3.0.9
pandas:               1.5.3
pyarrow:              12.0.1
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.22
xlsx2csv:             <not installed>
xlsxwriter:           3.0.1
@yuuuxt yuuuxt added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 21, 2024
@ritchie46
Copy link
Member

ritchie46 commented Feb 22, 2024

Can we get a bisect to see which commit introduced the slow down?

I expect this is the new string type. Isn't reader the file just slower?

@ritchie46
Copy link
Member

I do see a difference in performance on linux, but not as drastically as on windows. I believe windows is due to the allocator we compile on windows which is very bad. We have to try to switch to default allocator.

@ritchie46
Copy link
Member

I think the difference is due to utf8 validation. This is much more expensive because we cannot do it on the whole buffer anymore:

image

I will see if we can do something smart here.

@ritchie46
Copy link
Member

Found that we accidentally introduced quadratic behavior for the new string type. This is resolved by: #14705

@yuuuxt
Copy link
Author

yuuuxt commented Feb 28, 2024

version 0.20.11 indeed fixes this issue:

# issue exists on random generated data
# 0.20.5  - 253 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.6  - 2.28 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.10 - 3.03 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 0.20.11 - 747 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_gen_4m.parquet").select(pl.len()).collect()

# no issue in this case
# 0.20.5   - 79.9 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.6   - 114 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.10 - 113 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 0.20.11 - 687 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pl.scan_parquet(data_path / "sample_data_4m.parquet").select(pl.len()).collect()

Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants