DataFrame.divisions are lost on repartition when npartions==1 #975

luxcem · 2024-03-14T17:01:07Z

DataFrame.divisions are lost when using repartition or set_index with npartitions == 1

df = pd.DataFrame(np.random.randint(0, 20000, size=(10, 3)), columns=list('ABC'))
ddf = dd.from_pandas(df)
print(ddf.divisions) # (0, 9)

ddf = ddf.reset_index().set_index("A", sort=True, npartitions=2)
print(ddf.divisions) # (1483, 19649)

ddf = dd.from_pandas(df)
ddf = ddf.reset_index().set_index("A", sort=True, npartitions=1)
print(ddf.divisions) # (None, None)

Environment:

Dask version: 2024.3.0
Python version: 3.12.2
Operating System: Osx
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

phofl · 2024-03-14T17:37:47Z

Yes, the npartition=1 is a shortcut to avoid computing the quantiles, which is a huge performance pain in most cases. Is there a scenario where you need the divisions there?

fjetter · 2024-03-15T08:44:30Z

@luxcem can you explain why you are interested in divisions in this example? Dask itself won't use them internally as soon as we're on a single partitioned dataframe since the algorithms for single partitions don't require divisions. Therefore, with query planning we are not calculating them. The legacy dataframe performs a possibly expensive computation to get them.

If you are interested in the min/max values, instead, I recommend doing dask.compute(ddf.index.min(), ddf.index.max()) instead of relying on the divisions.

luxcem · 2024-03-15T08:55:23Z

Typically, I employ this approach with a variable npartitions, which varies based on factors like data size or cluster availability. This value can be set to 1. Certain functions require the divisions parameter to be defined. Failure to set divisions properly can lead to subsequent computational errors. Could I be approaching this incorrectly?

For instance, the compute function invokes optimize within _expr, and this may potentially trigger an AssertionError in : https://github.com/dask/dask-expr/blob/main/dask_expr/_expr.py#L497.

fjetter · 2024-03-15T09:00:24Z

For instance, the compute function invokes optimize within _expr, and this may potentially trigger an

This is a bug and we would appreciate it if you could share a reproducer. We certainly don't want to trigger any exceptions just because divisions are not set. The optimizer must deal with this automatically.

Regarding the availability of divisions themselves, I would rather consider this a best effort attribute. We will not always guarantee this to be set with meaningful values and in the single partition case this is one of the cases that we choose to not set them. I recommend to not rely on this being set yourself.

luxcem · 2024-03-15T09:04:02Z

Ok I'll work on a reproducer.
Another example is with repartition after divisions are lost, it will trigger an Exception in https://github.com/dask/dask-expr/blob/main/dask_expr/_repartition.py#L253

phofl · 2024-03-15T12:27:40Z

Thanks for working on a reproducer. I am curious to see where things are wrong. @fjetter is correct that we normally don't need divisions for one partition dfs, since we can work with them independently of divisions

luxcem mentioned this issue Mar 15, 2024

repartition with divisions when know_divisions is False raise an exception. #979

Closed

wilsonbb mentioned this issue Mar 28, 2024

Unexpected behavior when loading an wnsemble with sort and sorted flags lincc-frameworks/tape#412

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.divisions are lost on repartition when npartions==1 #975

DataFrame.divisions are lost on repartition when npartions==1 #975

luxcem commented Mar 14, 2024 •

edited

Loading

phofl commented Mar 14, 2024

fjetter commented Mar 15, 2024

luxcem commented Mar 15, 2024 •

edited

Loading

fjetter commented Mar 15, 2024

luxcem commented Mar 15, 2024

phofl commented Mar 15, 2024

DataFrame.divisions are lost on repartition when npartions==1 #975

DataFrame.divisions are lost on repartition when npartions==1 #975

Comments

luxcem commented Mar 14, 2024 • edited Loading

phofl commented Mar 14, 2024

fjetter commented Mar 15, 2024

luxcem commented Mar 15, 2024 • edited Loading

fjetter commented Mar 15, 2024

luxcem commented Mar 15, 2024

phofl commented Mar 15, 2024

luxcem commented Mar 14, 2024 •

edited

Loading

luxcem commented Mar 15, 2024 •

edited

Loading