Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise informative error message when group_by(by=[...]) is used #17540

Closed
2 tasks done
barak1412 opened this issue Jul 10, 2024 · 6 comments · Fixed by #17654
Closed
2 tasks done

Raise informative error message when group_by(by=[...]) is used #17540

barak1412 opened this issue Jul 10, 2024 · 6 comments · Fixed by #17654
Labels
A-exceptions Area: exception handling python Related to Python Polars

Comments

@barak1412
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl


df = pl.DataFrame({
    'a': [0, 1, 1, 2, 2],
    'b': [0, 0, 1, 2, 2],
    'v': [9, 5, 4, 8, 3]
})

df_grouped = df.group_by(by=['a', 'b']).agg(pl.sum('v'))
print(df_grouped)

Results in:

shape: (1, 2)
┌────────────┬─────┐
│ by         ┆ v   │
│ ---        ┆ --- │
│ list[str]  ┆ i64 │
╞════════════╪═════╡
│ ["a", "b"] ┆ 29  │
└────────────┴─────┘

However, the following:

import polars as pl


df = pl.DataFrame({
    'a': [0, 1, 1, 2, 2],
    'b': [0, 0, 1, 2, 2],
    'v': [9, 5, 4, 8, 3]
})

df_grouped = df.group_by(by=['a', 'b']).agg(pl.sum('v'))
print(df_grouped)

Results in:

shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ v   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 2   ┆ 11  │
│ 1   ┆ 0   ┆ 5   │
│ 0   ┆ 0   ┆ 9   │
│ 1   ┆ 1   ┆ 4   │
└─────┴─────┴─────┘

Log output

No response

Issue description

When using the by kwarg, we get List type as column.

Expected behavior

I would expect using the by kwarg to yeild the same result.

Installed versions

--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Windows-10-10.0.19041-SP0
Python:               3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.0
connectorx:           0.3.3a1
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.5.1
nest_asyncio:         1.5.1
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              13.0.0
pydantic:             1.8.2
pyiceberg:            <not installed>
sqlalchemy:           1.4.32
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@barak1412 barak1412 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 10, 2024
@deanm0000
Copy link
Collaborator

repeat of this.

@deanm0000 deanm0000 added invalid A bug report that is not actually a bug and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 10, 2024
@barak1412
Copy link
Contributor Author

@deanm0000 I saw @ritchie46 comment there about using kwarg as syntactic suger, but it still feels very odd to get a list type when 2 columns are supplied, as result.

Are you sure it is intended?

@MarcoGorelli
Copy link
Collaborator

yeah, the list just gets interpreted as a literal, just like how it would here:

In [15]: df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [16]: df.with_columns(by=['a', 'b'])
Out[16]:
shape: (3, 3)
┌─────┬─────┬────────────┐
│ a   ┆ b   ┆ by         │
│ --- ┆ --- ┆ ---        │
│ i64 ┆ i64 ┆ list[str]  │
╞═════╪═════╪════════════╡
│ 1   ┆ 4   ┆ ["a", "b"] │
│ 2   ┆ 5   ┆ ["a", "b"] │
│ 3   ┆ 6   ┆ ["a", "b"] │
└─────┴─────┴────────────┘

@MarcoGorelli
Copy link
Collaborator

having said that, there's no valid use case for grouping by a list literal...I think it might be worthwhile to raise an informative error here

@MarcoGorelli MarcoGorelli added A-exceptions Area: exception handling and removed invalid A bug report that is not actually a bug labels Jul 10, 2024
@MarcoGorelli MarcoGorelli changed the title Inconsistent result of group_by with by kwarg Raise informative error message when group_by(by=[...]) is used Jul 10, 2024
@barak1412
Copy link
Contributor Author

@MarcoGorelli Agree

@deanm0000
Copy link
Collaborator

It is intended that you don't use by as a named argument rather than positional. Like I said, I don't know what is causing the difference in behavior so I don't think it is explicitly intended that the behavior changed just that the old behavior was more of an accident for lack of a better term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-exceptions Area: exception handling python Related to Python Polars
Projects
None yet
3 participants