Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min/max operations on list with empty and/or None elements #13978

Closed
2 tasks done
FBruzzesi opened this issue Jan 25, 2024 · 4 comments · Fixed by #14018
Closed
2 tasks done

min/max operations on list with empty and/or None elements #13978

FBruzzesi opened this issue Jan 25, 2024 · 4 comments · Fixed by #14018
Assignees
Labels
A-dtype-list/array Area: list/array data type bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@FBruzzesi
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

print(
    pl.DataFrame({"lst": [[], [1,2,3]]}).select(pl.col("lst").list.max()),
    pl.DataFrame({"lst": [[], [None, 1], [1,2,3]]}).select(pl.col("lst").list.max()),
    sep="\n"
)

Log output

shape: (2, 1)
┌──────────────────────┐
│ lst                  │
│ ---                  │
│ i64                  │
╞══════════════════════╡
│ -9223372036854775808 │
│ 3                    │
└──────────────────────┘
shape: (3, 1)
┌──────┐
│ lst  │
│ ---  │
│ i64  │
╞══════╡
│ null │
│ 1    │
│ 3    │
└──────┘

Issue description

Breaking down to two cases:

  • If all list elements are either empty or populated with integer values, then min/max output the type precision
  • The behavior is different if there is one list of integers, one empty and one with None's

Expected behavior

I would expect the first case to behave as the second, namely output null for empty lists

Installed versions

--------Version info---------
Polars:               0.20.5
Index type:           UInt32
Platform:             Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.1.0
gevent:               23.9.0.post1
hvplot:               <not installed>
matplotlib:           3.7.1
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              12.0.1
pydantic:             2.4.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.17
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@FBruzzesi FBruzzesi added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 25, 2024
@FBruzzesi FBruzzesi changed the title min/max operations on list empty and none elements min/max operations on list with empty and/or None elements Jan 25, 2024
@Julian-J-S
Copy link
Contributor

can reproduce. Interesting and weird bug 😮

pl.DataFrame(
    {"lst": [[1, 2, 3], []]},
).with_columns(max=pl.col("lst").list.max())
# shape: (2, 2)
# ┌───────────┬──────────────────────┐
# │ lst       ┆ max                  │
# │ ---       ┆ ---                  │
# │ list[i64] ┆ i64                  │
# ╞═══════════╪══════════════════════╡
# │ [1, 2, 3] ┆ 3                    │
# │ []        ┆ -9223372036854775808 │ <<<<<<<<<<<< whoooooopsi 😱 
# └───────────┴──────────────────────┘

pl.DataFrame(
    {"lst": [[], [None, 1], [1, 2, 3]]},
).with_columns(max=pl.col("lst").list.max())
# shape: (3, 2)
# ┌───────────┬──────┐
# │ lst       ┆ max  │
# │ ---       ┆ ---  │
# │ list[i64] ┆ i64  │
# ╞═══════════╪══════╡
# │ []        ┆ null │ <<<<<<<<<<<<<<<<<<<<<< same value but correct? 🤔 
# │ [null, 1] ┆ 1    │
# │ [1, 2, 3] ┆ 3    │
# └───────────┴──────┘

@taki-mekhalfa
Copy link
Contributor

I can reproduce in python, but not in rust:

fn main() {
    let file = fs::File::open("list.parquet").unwrap();

    let df = ParquetReader::new(file).finish().unwrap();
    println!("{df}");
    // ==>
   shape: (2, 1)
    ┌───────────┐
    │ c         │
    │ ---       │
    │ list[i64] │
    ╞═══════════╡
    │ [1, 2, 3] │
    │ []        │
    └───────────┘
    // <==
    let df = df.lazy().select([col("c").list().min()]).collect().unwrap();
    println!("{df}");
    // ==>
    shape: (2, 1)
    ┌──────┐
    │ c    │
    │ ---  │
    │ i64  │
    ╞══════╡
    │ 1    │
    │ null │
    └──────┘
    // <==
}

@cmdlineluser
Copy link
Contributor

There seems to be different codepaths if there are nulls:

if has_inner_nulls(ca) {
return inner(ca);
};
match ca.inner_dtype() {
dt if dt.is_numeric() => Ok(max_list_numerical(ca, &dt)),
_ => inner(ca),

>>> pl.Series([[]], dtype=pl.List(pl.Int64)).list.max()
shape: (1,)
Series: '' [i64]
[
	-9223372036854775808
]
>>> pl.Series([[]], dtype=pl.List(pl.Float64)).list.max()
shape: (1,)
Series: '' [f64]
[
	-inf
]

@reswqa reswqa added P-medium Priority: medium A-dtype-list/array Area: list/array data type and removed needs triage Awaiting prioritization by a maintainer labels Jan 26, 2024
@reswqa reswqa self-assigned this Jan 26, 2024
@reswqa
Copy link
Collaborator

reswqa commented Jan 26, 2024

Thanks for reporting this, will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-list/array Area: list/array data type bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants