Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic and incomplete group_by in 0.20.11 #14749

Closed
2 tasks done
seanlane opened this issue Feb 28, 2024 · 11 comments · Fixed by #14754
Closed
2 tasks done

Non-deterministic and incomplete group_by in 0.20.11 #14749

seanlane opened this issue Feb 28, 2024 · 11 comments · Fixed by #14754
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars

Comments

@seanlane
Copy link

seanlane commented Feb 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

bug_source = pl.read_csv('https://gist.githubusercontent.com/seanlane/658541735a08ad474962088b0ceaf5c2/raw/4a676ca37068283c9ef2e89a92522658ff0d60bb/truncated_bug_source_exploded.csv')

polars_df = (
    bug_source
    .group_by(pl.col('disk_hex'))
    .agg(
        pl.mean('cell_tract_median_value').alias('disk_tract_median_value'),
        pl.mean('cell_tract_mean_value').alias('disk_tract_mean_value'),
        pl.sum('cell_item_count').alias('disk_item_count'),
        pl.len().alias('disks_populated')
    )
)
print(polars_df.describe())
print(polars_df.group_by(pl.col('disk_hex')).agg(pl.len()).describe())

Log output

avg line length: 59.370117
std. dev. line length: 1.020644
initial row estimate: 24956
no. of chunks: 16 processed by: 16 threads.
estimated unique values: 24369
estimated unique count: 24369 exceeded the boundary: 1000, running default HASH AGGREGATION
shape: (9, 6)
┌────────────┬───────────┬─────────────────────────┬───────────────────────┬─────────────────┬─────────────────┐
│ statistic  ┆ disk_hex  ┆ disk_tract_median_value ┆ disk_tract_mean_value ┆ disk_item_count ┆ disks_populated │
│ ---        ┆ ---       ┆ ---                     ┆ ---                   ┆ ---             ┆ ---             │
│ str        ┆ f64       ┆ f64                     ┆ f64                   ┆ f64             ┆ f64             │
╞════════════╪═══════════╪═════════════════════════╪═══════════════════════╪═════════════════╪═════════════════╡
│ count      ┆ 17209.0   ┆ 17209.0                 ┆ 17209.0               ┆ 17209.0         ┆ 17209.0         │
│ null_count ┆ 0.0       ┆ 0.0                     ┆ 0.0                   ┆ 0.0             ┆ 0.0             │
│ mean       ┆ 6.1312e17 ┆ 247295.082002           ┆ 247295.082002         ┆ 2.862921        ┆ 1.452728        │
│ std        ┆ 1.2257e14 ┆ 115951.831569           ┆ 115951.831569         ┆ 3.613228        ┆ 1.054252        │
│ min        ┆ 6.1271e17 ┆ 31866.12                ┆ 31866.12              ┆ 1.0             ┆ 1.0             │
│ 25%        ┆ 6.1316e17 ┆ 157374.64               ┆ 157374.64             ┆ 1.0             ┆ 1.0             │
│ 50%        ┆ 6.1316e17 ┆ 233493.99               ┆ 233493.99             ┆ 1.0             ┆ 1.0             │
│ 75%        ┆ 6.1316e17 ┆ 315333.295              ┆ 315333.295            ┆ 3.0             ┆ 2.0             │
│ max        ┆ 6.1370e17 ┆ 741125.49               ┆ 741125.49             ┆ 31.0            ┆ 13.0            │
└────────────┴───────────┴─────────────────────────┴───────────────────────┴─────────────────┴─────────────────┘
estimated unique values: 17209
estimated unique count: 17209 exceeded the boundary: 1000, running default HASH AGGREGATION
shape: (9, 3)
┌────────────┬───────────┬──────────┐
│ statistic  ┆ disk_hex  ┆ len      │
│ ---        ┆ ---       ┆ ---      │
│ str        ┆ f64       ┆ f64      │
╞════════════╪═══════════╪══════════╡
│ count      ┆ 17165.0   ┆ 17165.0  │
│ null_count ┆ 0.0       ┆ 0.0      │
│ mean       ┆ 6.1312e17 ┆ 1.002563 │
│ std        ┆ 1.2271e14 ┆ 0.050566 │
│ min        ┆ 6.1271e17 ┆ 1.0      │
│ 25%        ┆ 6.1316e17 ┆ 1.0      │
│ 50%        ┆ 6.1316e17 ┆ 1.0      │
│ 75%        ┆ 6.1316e17 ┆ 1.0      │
│ max        ┆ 6.1370e17 ┆ 2.0      │
└────────────┴───────────┴──────────┘

Issue description

Forgive me for using a link to read data from this GitHub Gist, but I was unable to reproduce the issue with less than 20,000 rows or so

We're using the H3 library to index metrics about areas and then perform aggregations using the neighboring cells within a certain radius. In Polars, we take each H3 cell, use the library to find all neighbors within a disk of a certain radius, explode the array of disk cells, and then group by the ID of the disk cell, aggregating on metrics of interest across the disk of a given cell.

What I noticed is that in dataframes with a length greater than around 20,000 rows, grouping on DIsk ID appears to both be nondeterministic (at least, I'm getting differing results each time) and the grouping is incomplete, the values in the Disk ID column do not become unique after grouping.

I attempted to reproduce this in 0.20.7, 0.20.8, 0.20.9, and 0.20.10, but I could only do so in 0.20.11.

When adding maintain_order=True to the .group_by(pl.col('disk_hex')) call, the nondeterminism of the grouping appears to cease, but the resulting dataframe still includes duplicate values within the Disk ID column.

Expected behavior

After grouping by Disk ID, I expect the values in the Disk ID column to be unique, and grouping again on the same index column should yield a dataframe of the same length.

Grouping with Pandas and DuckDB shows that the expected length of the grouped dataframe should be 17148, but Polars yields a dataframe with a length of 17209. If I perform this grouping on larger dataframes, greater than 200,000 rows, then length of dataframes produced by Polars can differ.

Installed versions

--------Version info---------
Polars:               0.20.11
Index type:           UInt32
Platform:             macOS-12.5-x86_64-i386-64bit
Python:               3.11.3 (main, Apr  5 2023, 15:47:48) [Clang 13.1.6 (clang-1316.0.21.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None
@seanlane seanlane added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 28, 2024
@MarcoGorelli MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 28, 2024
@mirkomiorelli
Copy link

mirkomiorelli commented Feb 28, 2024

When adding maintain_order=True to the .group_by(pl.col('disk_hex')) call, the nondeterminism of the grouping appears to cease, but the resulting dataframe still includes duplicate values within the Disk ID column.

I am having the same issue working with dataframes larger than 20K records. I found that adding a sort operation on the grouping col before the group by solves the non-deterministic and non-uniqueness behavior for me on 0.20.11. E.g., in_ the gist, adding .sort(by='disk_hex') before the .group_by(pl.col('disk_hex')) I think returns the expected result if I am interpreting the log correctly.

@seanlane
Copy link
Author

@mirkomiorelli I'm seeing that as well, adding .sort(by='disk_hex') before the .group_by(pl.col('disk_hex')) returns the expected number of rows (17,148). This also corrects the issue when grouping on a larger dataframe with greater than 250,000 rows as well.

@jmoralez
Copy link

I'm seeing something similar for max().over(), it seems to be when the input isn't sorted and happens when n_rows > 1,000. Here's a minimal example:

import numpy as np
import polars as pl

n_groups = 250
rows_per_group = 4
(
    pl.DataFrame(
        {
            'grp': np.repeat(np.arange(n_groups), rows_per_group),
            'x': np.tile(np.arange(rows_per_group), n_groups)
        }
    )
    .sample(fraction=1.0, shuffle=True)
    .select(pl.col('x').max().over('grp'))
    ['x']
    .value_counts()
)

at 1,000 samples this returns the expected

# x	count
# i64	u32
# 3	1000

but at 1,004 (n_groups=251) it returns the following:

# x	count
# i64	u32
# 2	269
# 1	150
# 3	493
# 0	92

@ritchie46
Copy link
Member

Ai, this is bad. Will take a look.

@ritchie46
Copy link
Member

@nameexhaustion maybe this is due to the new Total hash in grouping?

@ritchie46 ritchie46 added the P-critical Priority: critical label Feb 29, 2024
@nameexhaustion
Copy link
Collaborator

High chance, yes

@ritchie46
Copy link
Member

Introduced by ##14648. I will see if I can revert it and maybe localize the culprit.

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Feb 29, 2024

I think I mixed up the hashing, sometimes it is hashing T and sometimes it is hashing &T:
image
and a quick test shows the hash of &T is not equal to the hash of T:
image

the .to_total_ord() function always returns a copy if it gets called on a reference, so some care was needed there

@nameexhaustion
Copy link
Collaborator

Function is group_by_threaded_slice in hashing.rs in polars core, I think the other places that do hashing (e.g. group_by_threaded_iter) probably also need to be checked

@ritchie46
Copy link
Member

Yeap, recompiling now. :)

@ritchie46
Copy link
Member

Fixed. I will release a patch immediately. Thank you for taking a look so quickly @nameexhaustion .

@ritchie46 ritchie46 removed the P-high Priority: high label Feb 29, 2024
@c-peters c-peters added the accepted Ready for implementation label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants