Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming version of polars.LazyFrame.count produces a wrong result when used in aggregation #15049

Closed
2 tasks done
tikhdm opened this issue Mar 14, 2024 · 0 comments · Fixed by #15051
Closed
2 tasks done
Assignees
Labels
A-streaming Related to the streaming engine accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@tikhdm
Copy link

tikhdm commented Mar 14, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame({
    'g': [1] * 2000,
    'a': ['yes', None] * 1000
}).lazy()

print('Streaming', df.group_by('g').agg(pl.col('a').count()).collect(streaming=True))
print('Non streaming', df.group_by('g').agg(pl.col('a').count()).collect(streaming=False))

The result is:

Streaming shape: (1, 2)
┌─────┬──────┐
│ g   ┆ a    │
│ --- ┆ ---  │
│ i64 ┆ u32  │
╞═════╪══════╡
│ 1   ┆ 2000 │
└─────┴──────┘
Non streaming shape: (1, 2)
┌─────┬──────┐
│ g   ┆ a    │
│ --- ┆ ---  │
│ i64 ┆ u32  │
╞═════╪══════╡
│ 1   ┆ 1000 │
└─────┴──────┘

Log output

RUN STREAMING PIPELINE
df -> primitive_group_by -> ordered_sink
RefCell { value: [] }
keys/aggregates are not partitionable: running default HASH AGGREGATION

Issue description

The behavior of polars.count is different between streaming and non-streaming queries. In the non-streaming mode it ignores empty values which corresponds to the documentation.
In the streaming mode it just returns a number of rows in each group.

Expected behavior

I would expect the same result in both streaming and non-streaming modes.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@tikhdm tikhdm added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 14, 2024
@stinodego stinodego added P-medium Priority: medium A-streaming Related to the streaming engine labels Mar 14, 2024
@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Mar 14, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Mar 14, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Mar 14, 2024
@c-peters c-peters added the accepted Ready for implementation label Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-streaming Related to the streaming engine accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants