REGR: passing dask arrays to Series or DataFrame #38645

keewis · 2020-12-22T22:40:34Z

Code Sample, a copy-pastable example

import pandas as pd
import dask.array as da
a = da.ones((12,), chunks=4)
s = pd.Series(a, index=range(12))
print(s.dtype)

Problem description

This has been detected by xarray's upstream-dev CI (environment): with 1.1.3, the dtype is float64 while on master (installed from scipy-wheels-nightly) this became object (and the series / dataframe contains dask scalars). Was that change intentional? Poking around on the merged PR list, this might have been #38563 (not sure, though).

To be clear, for us this only affects test code and since it would compute anyways we can easily work around this by computing the dask array before passing it to pd.Series or pd.DataFrame.

See also pydata/xarray#4717.

cc @TomAugspurger

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-12-23T09:02:46Z

@keewis thanks for the report! Can confirm the change in behaviour.

cc @jbrockmendel

simonjayhawkins · 2020-12-23T13:43:24Z

Poking around on the merged PR list, this might have been #38563 (not sure, though).

can confirm, first bad commit: [cec2f5f] REF: handle non-list_like cases upfront in sanitize_array (#38563)

jbrockmendel · 2020-12-23T17:09:25Z

the low-level place to fix this would be in is_list_like, question is if we can do that without a big performance hit

jbrockmendel · 2021-06-25T21:19:43Z

One way to fix on dask's end would to be to implement Array.__iter__. Is that a viable option?

keewis · 2021-06-27T14:51:43Z

I'm going to forward this to the dask devs: cc @TomAugspurger, @jsignell, @jrbourbeau

jsignell · 2021-07-12T15:44:20Z

One way to fix on dask's end would to be to implement Array.__iter__. Is that a viable option?

In that scenario would the output of Array.__iter__ be an iterable of real data values? Currently the output is an iterable of dask arrays, which have .dtype. It seems like returning real values would potentially trigger computation prematurely and returning generated value with the right dtype seems potentially confusing as well.

For comparison, dd.DataFrame has an __iter__ but it just looks at the _meta not at the real data (so it knows about columns but not about rows).

jbrockmendel · 2021-07-12T20:50:31Z

In that scenario would the output of Array.iter be an iterable of real data values?

I think it literally just needs to have a __iter__ attribute, doesn't matter what it returns (even would be OK if it raised NotImplementedError)

jsignell · 2021-07-13T13:30:09Z

This is what it looks like if I have Array.__iter__ raise NotImplementedError

In [1]: import pandas as pd
   ...: import dask.array as da
   ...: a = da.ones((12,), chunks=4)
   ...: s = pd.Series(a, index=range(12))
   ...: print(s.dtype)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-2e11dcb4eba5> in <module>
      2 import dask.array as da
      3 a = da.ones((12,), chunks=4)
----> 4 s = pd.Series(a, index=range(12))
      5 print(s.dtype)

~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    436                     data = data.copy()
    437             else:
--> 438                 data = sanitize_array(data, index, dtype, copy)
    439 
    440                 manager = get_option("mode.data_manager")

~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
    562         # materialize e.g. generators, convert e.g. tuples, abc.ValueView
    563         # TODO: non-standard array-likes we can convert to ndarray more efficiently?
--> 564         data = list(data)
    565 
    566         if dtype is not None or len(data) == 0:

~/dask/dask/array/core.py in __iter__(self)
   1343 
   1344     def __iter__(self):
-> 1345         raise NotImplementedError
   1346 
   1347     def __len__(self):

NotImplementedError:

jsignell · 2021-07-13T13:38:16Z

But I just noticed that dask.Series evaluated the data in the __iter__ method. So it might be reasonable for dask.Array to do the same. I'll open a PR on dask to carry on this discussion.

jbrockmendel · 2021-07-13T14:26:51Z

Yah, NotImplementedError was probably too cute. What happens with list(dask_array) now?

jsignell · 2021-07-13T15:09:35Z

I just opened the PR on dask so we can carry on the dask-side of the conversation over there. dask/dask#7888

mrocklin · 2021-07-13T15:25:58Z

One way to fix on dask's end would to be to implement Array.iter. Is that a viable option?

I would expect Pandas to try some of the __array__ protocols first. I think that np.asarray(my_dask_array) should efficiently produce something sensible.

jbrockmendel · 2021-07-13T16:33:14Z

We can probably do that in sanitize_array, which would avoid the problem with the NotImplementedError

jsignell · 2021-07-13T17:38:58Z

Good point! If you can add that to sanitize_array then I don't think any changes are needed in dask!

jbrockmendel · 2021-07-13T17:40:14Z

We still need the __iter__ method to exist, because the is_list_like check is explicitly a hasattr(obj, "__iter__")

mrocklin · 2021-07-13T17:43:11Z

I guess my hope would be that Pandas would first check "is this thing array-like" if the answer is "no" then it would ask "ok, well, maybe it's list-like?" To me it makes sense to start with the more efficient things (numpy-ish) and then go down the list of less efficient options until we find something that works.

I don't know all of the history/nuance here though. Please ignore my comments above if they don't make sense.

jbrockmendel · 2021-07-13T18:03:21Z

That's absolutely reasonable. In fact there's a comment https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L563 about doing exactly that.

That would make the conversion more efficient, but in order for the conversion to be done at all, we need to have is_list_like(obj), and that uses the hasattr(obj, "__iter__") check. In principle we could make that fall back to checking for __array__, but is_list_like is optimized to the bone so im reticent.

mrocklin · 2021-07-13T19:15:25Z

I'm proposing a check further up in that if-elif-else chain, somewhere after if isinstance(data, np.ndarray) but before the final else clause that runs the is_list_like call. In Dask we tend to use a check that is similar to if hasattr(data, "shape") and hasattr(data, "dtype").

mrocklin · 2021-07-13T19:16:13Z

Or I guess hasattr(data, "__array__") would probably be more direct here

if hasattr(data, "__array__"):
    return sanitize_array(np.asarray(data), ...)

mrocklin · 2021-07-13T19:17:04Z

Oh! Unless you're saying that this function only gets called if there is an __iter__ method. That would make more sense why this is an issue. My apologies for misunderstanding.

jsignell · 2021-07-16T20:15:19Z

Ok @jbrockmendel I opened a PR on the dask side to implement __iter__ on Array. It fails rather spectacularly with current pandas, but hopefully, that should give you the hook you need?

In [1]: import pandas as pd
   ...: import dask.array as da
   ...: a = da.ones((12,), chunks=4)
   ...: s = pd.Series(a, index=range(12))
   ...: s
Out[1]: 
0     dask.array<getitem, shape=(), dtype=float64, c...
1     dask.array<getitem, shape=(), dtype=float64, c...
2     dask.array<getitem, shape=(), dtype=float64, c...
3     dask.array<getitem, shape=(), dtype=float64, c...
4     dask.array<getitem, shape=(), dtype=float64, c...
5     dask.array<getitem, shape=(), dtype=float64, c...
6     dask.array<getitem, shape=(), dtype=float64, c...
7     dask.array<getitem, shape=(), dtype=float64, c...
8     dask.array<getitem, shape=(), dtype=float64, c...
9     dask.array<getitem, shape=(), dtype=float64, c...
10    dask.array<getitem, shape=(), dtype=float64, c...
11    dask.array<getitem, shape=(), dtype=float64, c...
dtype: object

jbrockmendel · 2021-07-16T21:21:11Z

ill make a pandas PR to use __array__ as discussed above. once that is merged we can confirm that implementing __iter__ fixes the problem.

FWIW i'd implement __iter__ so that list(obj) == list(np.array(obj))`

jsignell · 2021-07-16T21:42:20Z

There was some discussion on the dask side and people feel that having a greedy __iter__ is too much of a gotcha. Too easy to call by mistake. See dask/dask#7889 for reference.

This doesn't fix the original issue pandas-dev/pandas#38645, but hopefully it'll make it easier for pandas to know that it should sanitize dask.arrays.

keewis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 22, 2020

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2020

jorisvandenbossche added this to the 1.3 milestone Dec 23, 2020

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020

code sample for pandas-dev#38645

36a4183

keewis mentioned this issue Jan 3, 2021

silence the dask dataframe upstream-dev errors pydata/xarray#4757

Merged

2 tasks

jorisvandenbossche changed the title ~~BUG: passing dask arrays to Series or DataFrame~~ REGR: passing dask arrays to Series or DataFrame Jan 6, 2021

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jun 8, 2021

simonjayhawkins modified the milestones: 1.3, 1.3.1 Jun 30, 2021

jsignell mentioned this issue Jul 13, 2021

Implement Array.__iter__ [test-upstream] dask/dask#7888

Closed

2 tasks

jsignell mentioned this issue Jul 13, 2021

Implement Array.__iter__ dask/dask#7889

Closed

jsignell mentioned this issue Jul 16, 2021

Implement lazy Array.__iter__ dask/dask#7905

Merged

3 tasks

jbrockmendel mentioned this issue Jul 16, 2021

BUG: Series(dask.array) GH#38645 #42577

Merged

4 tasks

jreback closed this as completed in #42577 Jul 20, 2021

mrocklin pushed a commit to dask/dask that referenced this issue Jul 21, 2021

Implement lazy Array.__iter__ (#7905)

4fd68a7

This doesn't fix the original issue pandas-dev/pandas#38645, but hopefully it'll make it easier for pandas to know that it should sanitize dask.arrays.

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: passing dask arrays to Series or DataFrame #38645

REGR: passing dask arrays to Series or DataFrame #38645

keewis commented Dec 22, 2020 •

edited

Loading

jorisvandenbossche commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jbrockmendel commented Dec 23, 2020

jbrockmendel commented Jun 25, 2021

keewis commented Jun 27, 2021

jsignell commented Jul 12, 2021

jbrockmendel commented Jul 12, 2021

jsignell commented Jul 13, 2021

jsignell commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

jsignell commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

jsignell commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

mrocklin commented Jul 13, 2021

mrocklin commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jsignell commented Jul 16, 2021

jbrockmendel commented Jul 16, 2021

jsignell commented Jul 16, 2021

REGR: passing dask arrays to Series or DataFrame #38645

REGR: passing dask arrays to Series or DataFrame #38645

Comments

keewis commented Dec 22, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

jorisvandenbossche commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jbrockmendel commented Dec 23, 2020

jbrockmendel commented Jun 25, 2021

keewis commented Jun 27, 2021

jsignell commented Jul 12, 2021

jbrockmendel commented Jul 12, 2021

jsignell commented Jul 13, 2021

jsignell commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

jsignell commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

jsignell commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jbrockmendel commented Jul 13, 2021

mrocklin commented Jul 13, 2021

mrocklin commented Jul 13, 2021

mrocklin commented Jul 13, 2021

jsignell commented Jul 16, 2021

jbrockmendel commented Jul 16, 2021

jsignell commented Jul 16, 2021

keewis commented Dec 22, 2020 •

edited

Loading