-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic and incomplete group_by in 0.20.11 #14749
Comments
I am having the same issue working with dataframes larger than 20K records. I found that adding a sort operation on the grouping col before the group by solves the non-deterministic and non-uniqueness behavior for me on |
@mirkomiorelli I'm seeing that as well, adding |
I'm seeing something similar for import numpy as np
import polars as pl
n_groups = 250
rows_per_group = 4
(
pl.DataFrame(
{
'grp': np.repeat(np.arange(n_groups), rows_per_group),
'x': np.tile(np.arange(rows_per_group), n_groups)
}
)
.sample(fraction=1.0, shuffle=True)
.select(pl.col('x').max().over('grp'))
['x']
.value_counts()
) at 1,000 samples this returns the expected
but at 1,004 (n_groups=251) it returns the following:
|
Ai, this is bad. Will take a look. |
@nameexhaustion maybe this is due to the new Total hash in grouping? |
High chance, yes |
Introduced by ##14648. I will see if I can revert it and maybe localize the culprit. |
Function is group_by_threaded_slice in hashing.rs in polars core, I think the other places that do hashing (e.g. group_by_threaded_iter) probably also need to be checked |
Yeap, recompiling now. :) |
Fixed. I will release a patch immediately. Thank you for taking a look so quickly @nameexhaustion . |
Checks
Reproducible example
Log output
Issue description
Forgive me for using a link to read data from this GitHub Gist, but I was unable to reproduce the issue with less than 20,000 rows or so
We're using the H3 library to index metrics about areas and then perform aggregations using the neighboring cells within a certain radius. In Polars, we take each H3 cell, use the library to find all neighbors within a disk of a certain radius, explode the array of disk cells, and then group by the ID of the disk cell, aggregating on metrics of interest across the disk of a given cell.
What I noticed is that in dataframes with a length greater than around 20,000 rows, grouping on DIsk ID appears to both be nondeterministic (at least, I'm getting differing results each time) and the grouping is incomplete, the values in the Disk ID column do not become unique after grouping.
I attempted to reproduce this in
0.20.7
,0.20.8
,0.20.9
, and0.20.10
, but I could only do so in0.20.11
.When adding
maintain_order=True
to the.group_by(pl.col('disk_hex'))
call, the nondeterminism of the grouping appears to cease, but the resulting dataframe still includes duplicate values within the Disk ID column.Expected behavior
After grouping by Disk ID, I expect the values in the Disk ID column to be unique, and grouping again on the same index column should yield a dataframe of the same length.
Grouping with Pandas and DuckDB shows that the expected length of the grouped dataframe should be 17148, but Polars yields a dataframe with a length of 17209. If I perform this grouping on larger dataframes, greater than 200,000 rows, then length of dataframes produced by Polars can differ.
Installed versions
The text was updated successfully, but these errors were encountered: