Non-deterministic and incomplete group_by in 0.20.11 #14749

seanlane · 2024-02-28T19:06:36Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

bug_source = pl.read_csv('https://gist.githubusercontent.com/seanlane/658541735a08ad474962088b0ceaf5c2/raw/4a676ca37068283c9ef2e89a92522658ff0d60bb/truncated_bug_source_exploded.csv')

polars_df = (
    bug_source
    .group_by(pl.col('disk_hex'))
    .agg(
        pl.mean('cell_tract_median_value').alias('disk_tract_median_value'),
        pl.mean('cell_tract_mean_value').alias('disk_tract_mean_value'),
        pl.sum('cell_item_count').alias('disk_item_count'),
        pl.len().alias('disks_populated')
    )
)
print(polars_df.describe())
print(polars_df.group_by(pl.col('disk_hex')).agg(pl.len()).describe())

Log output

avg line length: 59.370117
std. dev. line length: 1.020644
initial row estimate: 24956
no. of chunks: 16 processed by: 16 threads.
estimated unique values: 24369
estimated unique count: 24369 exceeded the boundary: 1000, running default HASH AGGREGATION
shape: (9, 6)
┌────────────┬───────────┬─────────────────────────┬───────────────────────┬─────────────────┬─────────────────┐
│ statistic  ┆ disk_hex  ┆ disk_tract_median_value ┆ disk_tract_mean_value ┆ disk_item_count ┆ disks_populated │
│ ---        ┆ ---       ┆ ---                     ┆ ---                   ┆ ---             ┆ ---             │
│ str        ┆ f64       ┆ f64                     ┆ f64                   ┆ f64             ┆ f64             │
╞════════════╪═══════════╪═════════════════════════╪═══════════════════════╪═════════════════╪═════════════════╡
│ count      ┆ 17209.0   ┆ 17209.0                 ┆ 17209.0               ┆ 17209.0         ┆ 17209.0         │
│ null_count ┆ 0.0       ┆ 0.0                     ┆ 0.0                   ┆ 0.0             ┆ 0.0             │
│ mean       ┆ 6.1312e17 ┆ 247295.082002           ┆ 247295.082002         ┆ 2.862921        ┆ 1.452728        │
│ std        ┆ 1.2257e14 ┆ 115951.831569           ┆ 115951.831569         ┆ 3.613228        ┆ 1.054252        │
│ min        ┆ 6.1271e17 ┆ 31866.12                ┆ 31866.12              ┆ 1.0             ┆ 1.0             │
│ 25%        ┆ 6.1316e17 ┆ 157374.64               ┆ 157374.64             ┆ 1.0             ┆ 1.0             │
│ 50%        ┆ 6.1316e17 ┆ 233493.99               ┆ 233493.99             ┆ 1.0             ┆ 1.0             │
│ 75%        ┆ 6.1316e17 ┆ 315333.295              ┆ 315333.295            ┆ 3.0             ┆ 2.0             │
│ max        ┆ 6.1370e17 ┆ 741125.49               ┆ 741125.49             ┆ 31.0            ┆ 13.0            │
└────────────┴───────────┴─────────────────────────┴───────────────────────┴─────────────────┴─────────────────┘
estimated unique values: 17209
estimated unique count: 17209 exceeded the boundary: 1000, running default HASH AGGREGATION
shape: (9, 3)
┌────────────┬───────────┬──────────┐
│ statistic  ┆ disk_hex  ┆ len      │
│ ---        ┆ ---       ┆ ---      │
│ str        ┆ f64       ┆ f64      │
╞════════════╪═══════════╪══════════╡
│ count      ┆ 17165.0   ┆ 17165.0  │
│ null_count ┆ 0.0       ┆ 0.0      │
│ mean       ┆ 6.1312e17 ┆ 1.002563 │
│ std        ┆ 1.2271e14 ┆ 0.050566 │
│ min        ┆ 6.1271e17 ┆ 1.0      │
│ 25%        ┆ 6.1316e17 ┆ 1.0      │
│ 50%        ┆ 6.1316e17 ┆ 1.0      │
│ 75%        ┆ 6.1316e17 ┆ 1.0      │
│ max        ┆ 6.1370e17 ┆ 2.0      │
└────────────┴───────────┴──────────┘

Issue description

Forgive me for using a link to read data from this GitHub Gist, but I was unable to reproduce the issue with less than 20,000 rows or so

We're using the H3 library to index metrics about areas and then perform aggregations using the neighboring cells within a certain radius. In Polars, we take each H3 cell, use the library to find all neighbors within a disk of a certain radius, explode the array of disk cells, and then group by the ID of the disk cell, aggregating on metrics of interest across the disk of a given cell.

What I noticed is that in dataframes with a length greater than around 20,000 rows, grouping on DIsk ID appears to both be nondeterministic (at least, I'm getting differing results each time) and the grouping is incomplete, the values in the Disk ID column do not become unique after grouping.

I attempted to reproduce this in 0.20.7, 0.20.8, 0.20.9, and 0.20.10, but I could only do so in 0.20.11.

When adding maintain_order=True to the .group_by(pl.col('disk_hex')) call, the nondeterminism of the grouping appears to cease, but the resulting dataframe still includes duplicate values within the Disk ID column.

Expected behavior

After grouping by Disk ID, I expect the values in the Disk ID column to be unique, and grouping again on the same index column should yield a dataframe of the same length.

Grouping with Pandas and DuckDB shows that the expected length of the grouped dataframe should be 17148, but Polars yields a dataframe with a length of 17209. If I perform this grouping on larger dataframes, greater than 200,000 rows, then length of dataframes produced by Polars can differ.

Installed versions

--------Version info---------
Polars:               0.20.11
Index type:           UInt32
Platform:             macOS-12.5-x86_64-i386-64bit
Python:               3.11.3 (main, Apr  5 2023, 15:47:48) [Clang 13.1.6 (clang-1316.0.21.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

The text was updated successfully, but these errors were encountered:

mirkomiorelli · 2024-02-28T22:25:14Z

When adding maintain_order=True to the .group_by(pl.col('disk_hex')) call, the nondeterminism of the grouping appears to cease, but the resulting dataframe still includes duplicate values within the Disk ID column.

I am having the same issue working with dataframes larger than 20K records. I found that adding a sort operation on the grouping col before the group by solves the non-deterministic and non-uniqueness behavior for me on 0.20.11. E.g., in_ the gist, adding .sort(by='disk_hex') before the .group_by(pl.col('disk_hex')) I think returns the expected result if I am interpreting the log correctly.

seanlane · 2024-02-28T23:16:35Z

@mirkomiorelli I'm seeing that as well, adding .sort(by='disk_hex') before the .group_by(pl.col('disk_hex')) returns the expected number of rows (17,148). This also corrects the issue when grouping on a larger dataframe with greater than 250,000 rows as well.

jmoralez · 2024-02-28T23:29:16Z

I'm seeing something similar for max().over(), it seems to be when the input isn't sorted and happens when n_rows > 1,000. Here's a minimal example:

import numpy as np
import polars as pl

n_groups = 250
rows_per_group = 4
(
    pl.DataFrame(
        {
            'grp': np.repeat(np.arange(n_groups), rows_per_group),
            'x': np.tile(np.arange(rows_per_group), n_groups)
        }
    )
    .sample(fraction=1.0, shuffle=True)
    .select(pl.col('x').max().over('grp'))
    ['x']
    .value_counts()
)

at 1,000 samples this returns the expected

# x	count
# i64	u32
# 3	1000

but at 1,004 (n_groups=251) it returns the following:

# x	count
# i64	u32
# 2	269
# 1	150
# 3	493
# 0	92

ritchie46 · 2024-02-29T02:38:01Z

Ai, this is bad. Will take a look.

ritchie46 · 2024-02-29T02:48:55Z

@nameexhaustion maybe this is due to the new Total hash in grouping?

nameexhaustion · 2024-02-29T03:25:14Z

High chance, yes

ritchie46 · 2024-02-29T03:37:08Z

Introduced by ##14648. I will see if I can revert it and maybe localize the culprit.

nameexhaustion · 2024-02-29T03:40:08Z

I think I mixed up the hashing, sometimes it is hashing T and sometimes it is hashing &T:

and a quick test shows the hash of &T is not equal to the hash of T:

the .to_total_ord() function always returns a copy if it gets called on a reference, so some care was needed there

nameexhaustion · 2024-02-29T03:47:36Z

Function is group_by_threaded_slice in hashing.rs in polars core, I think the other places that do hashing (e.g. group_by_threaded_iter) probably also need to be checked

ritchie46 · 2024-02-29T03:47:54Z

Yeap, recompiling now. :)

ritchie46 · 2024-02-29T03:57:24Z

Fixed. I will release a patch immediately. Thank you for taking a look so quickly @nameexhaustion .

seanlane added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2024

MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2024

ritchie46 added the P-critical Priority: critical label Feb 29, 2024

ritchie46 removed the P-high Priority: high label Feb 29, 2024

ritchie46 mentioned this issue Feb 29, 2024

fix: fix hashing specialization #14754

Merged

ritchie46 closed this as completed in #14754 Feb 29, 2024

ritchie46 mentioned this issue Feb 29, 2024

python polars 0.20.12 #14755

Merged

c-peters added the accepted Ready for implementation label Mar 4, 2024

c-peters assigned ritchie46 Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic and incomplete group_by in 0.20.11 #14749

Non-deterministic and incomplete group_by in 0.20.11 #14749

seanlane commented Feb 28, 2024 •

edited

Loading

mirkomiorelli commented Feb 28, 2024 •

edited

Loading

seanlane commented Feb 28, 2024

jmoralez commented Feb 28, 2024

ritchie46 commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

nameexhaustion commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

nameexhaustion commented Feb 29, 2024 •

edited

Loading

nameexhaustion commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

Non-deterministic and incomplete group_by in 0.20.11 #14749

Non-deterministic and incomplete group_by in 0.20.11 #14749

Comments

seanlane commented Feb 28, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

mirkomiorelli commented Feb 28, 2024 • edited Loading

seanlane commented Feb 28, 2024

jmoralez commented Feb 28, 2024

ritchie46 commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

nameexhaustion commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

nameexhaustion commented Feb 29, 2024 • edited Loading

nameexhaustion commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

ritchie46 commented Feb 29, 2024

seanlane commented Feb 28, 2024 •

edited

Loading

mirkomiorelli commented Feb 28, 2024 •

edited

Loading

nameexhaustion commented Feb 29, 2024 •

edited

Loading