Extra dimension on first argument passed into apply_ufunc #2714

birdsarah · 2019-01-26T00:04:47Z

Here's my code:

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.rand(1000, 100))
da = da.rename({'dim_0': 'rows_a'})
db = xr.DataArray(np.random.rand(1000, 100))
db = db.rename({'dim_0': 'rows_b'})

def print_shape(a):
    print(a.shape)
    return np.zeros(shape=(a.shape[0]))

def print_two_shapes(a, b):
    print(a.shape)
    print(b.shape)
    return np.zeros(shape=(a.shape[0], b.shape[0]))

If I print_shape and print_shapes with apply_ufunc I am surprised by the results:

xr.apply_ufunc(
    print_shape,
    da,
    input_core_dims=[['dim_1']]
)

(1000, 100)
<xarray.DataArray (rows_a: 1000)>
array([0., 0., 0., ..., 0., 0., 0.])
Coordinates:
  * rows_a   (rows_a) int64 0 1 2 3 4 5 6 7 ... 992 993 994 995 996 997 998 999

vs

xr.apply_ufunc(
    print_two_shapes,
    da, db,
    input_core_dims=[['dim_1'], ['dim_1']]
)

(1000, 1, 100)
(1000, 100)

<xarray.DataArray (rows_a: 1000, rows_b: 1000)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
  * rows_a   (rows_a) int64 0 1 2 3 4 5 6 7 ... 992 993 994 995 996 997 998 999
  * rows_b   (rows_b) int64 0 1 2 3 4 5 6 7 ... 992 993 994 995 996 997 998 999

My array da has changed shape from (1000, 100) to (1000, 1, 100) when a second argument was added to my call to apply_ufunc.

Maybe this is documented, but I missed it. If it is documented I'd be glad to be pointed to it and I'll see if I can come up with a suggestion of how to highlight this better in the documentation as it really threw me.

The text was updated successfully, but these errors were encountered:

birdsarah · 2019-01-26T00:09:24Z

I should add, if I pass in plain numpy arrays then I do not have this problem. But ultimately I want to pass in a chunked DataArray, as described here: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization (this is my whole reason for using xarray).

The work around is easy I just use da[:,0,:] but it's odd!

shoyer · 2019-01-26T02:48:01Z

The notion of "core dimensions" in apply_ufunc() is definitely quite tricky to understand.

That said, I think this is (mostly) doing the right thing:

Your inputs have dimensions ['row_a', 'dim_1'] and ['row_b', 'dim_1']
Xarray broadcasts over dimensions that aren't included in "core dimensions" , so inputs are broadcast to have dimensions like ['row_a', 'row_b', 'dim_1'] and ['row_a', 'row_b', 'dim_1'].

This is probably especially confusing because the unlabeled versions of da and db are given "broadcastable" shapes (1000, 1, 100) and (1000, 100) rather than the fully "broadcast" shapes of (1000, 1000, 100) and (1000, 1000, 100), which would make it more obvious what is going on.

For your specific use case: maybe you meant to specify input_core_dims=[['row_a'], ['row_b']] instead? That version would give inputs with dimensions like ['dim_1', 'row_a'] and ['dim_1', 'row_b'].

More generally: I think we really need a version of apply() that doesn't do this confusing broadcasting and dimension reordering. See #1130 for discussion about that.

birdsarah · 2019-01-26T03:32:10Z

Hi, I will have to think about your response a lot more to see if I can wrap my head around it.

In the meantime I'm not sure I have my input_core_dims correct, but that's the only configuration I could get to work.

I chunk along row_a, and row_b and I output a new array with the dims [row_a, row_b].

By trial and error, the above configuration is the only one I could find where I got out the dims I was expecting and didn't get an error.

shoyer · 2019-01-26T03:38:20Z

Maybe it would help to describe what you were trying to do here. What should "one unit" of your calculation look like?

birdsarah · 2019-01-26T03:38:31Z

Can you clarify one thing in your note.

unlabeled versions of da and db are given "broadcastable" shapes (1, 1000, 100) and (1000, 100)

Is it (1000, 1, 100) as my code seems to return, or, as you said (1, 1000, 100)? Is it deterministic?

shoyer · 2019-01-26T03:40:14Z

unlabeled versions of da and db are given "broadcastable" shapes (1, 1000, 100) and (1000, 100)

Is it (1000, 1, 100) as my code seems to return, or, as you said (1, 1000, 100)? Is it deterministic?

That was a typo (I'll fix it). It should be (1000, 1, 100).

The behavior is definitely deterministic, if hard to understand!

birdsarah · 2019-01-26T03:46:36Z

Maybe it would help to describe what you were trying to do here.

Sure - thanks!

I have a dataset that's long, the sample code shown below is 200k rows, but the full dataset will be much larger. I'm interested in pairwise distances except not for all rows, just the distances for few thousand rows, wrt to the full 200k.

Here's how I hack this together:

My starting array

df_array = xr.DataArray(df)
df_array = df_array.rename({PIVOT: 'all_sites'})
df_array

<xarray.DataArray (all_sites: 185084, dim_1: 245)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
  * all_sites  (all_sites) object '0.gravatar.com||gprofiles.js||Gravatar.init' ... 'ÐºÑ\x83Ñ\x80Ñ\x81Ñ\x8b.1Ñ\x81ÐµÐ½Ñ\x82Ñ\x8fÐ±Ñ\x80Ñ\x8f.Ñ\x80Ñ\x84||store.js||store.set'
  * dim_1      (dim_1) object 'AnalyserNode.connect' ... 'HTMLCanvasElement.previousSibling'

My slice of the array

sites_of_interest = [sub list of all sites]
df_dye_array = xr.DataArray(df.loc[sites_of_interest])
df_dye_array = df_dye_array.rename({PIVOT: 'dye_sites'})

Chunk

df_array_c = df_array.chunk({'all_sites': 10_000})
df_dye_array_c = df_dye_array.chunk({'dye_sites': 100})

Get distances

def get_chebyshev_distances_xarray_ufunc(df_array, df_dye_array):
    chebyshev = lambda x: np.abs(df_array[:,0,:] - x).max(axis=1)
    result = np.apply_along_axis(chebyshev, 1, df_dye_array).T
    return result

distance_array = xr.apply_ufunc(
        get_chebyshev_distances_xarray_ufunc,
        df_array_c, df_dye_array_c,
        dask='parallelized',
        output_dtypes=[float],
        input_core_dims=[['dim_1'], ['dim_1']],
)

What I get out is an array with the length of my original array and the width of my sites of interest where each number is the chebyshev distance between their respective rows of the original dataset (which are 245 long).

birdsarah · 2019-01-26T03:47:08Z

The behavior is definitely deterministic, if hard to understand!

phew!

shoyer · 2019-01-26T04:02:09Z

OK, I think you're doing this right. You want an output with shape ['all_sites', 'dye_sites'] right?

My suggestion would be to add an explicit call to np.broadcast_arrays() at the start of your applied function. This will make the dimensions a little easier to understand.

def get_chebyshev_distances_xarray_ufunc(array, dye_array):
    array, dye_array = np.broadcast_arrays(array, dye_array)
    # array is a 3D numpy array with logical dimensions ['all_sites', 'dye_sites', 'dim_1']
    # dye_array is a 3D numpy array with logical dimensions ['all_sites', 'dye_sites', 'dim_1']

    # compute the distance matrix
    # return a numpy array with logical dimensions ['all_sites', 'dye_sites']

(edit note: fixed dimensions)

shoyer · 2019-01-26T04:05:43Z

actually, scratch that

shoyer · 2019-01-26T04:05:58Z

I think this would also work (due to the intrinsic broadcasting behavior of numpy functions):

def get_chebyshev_distances_xarray_ufunc(array, dye_array):
    return abs(array - dye_array).max(axis=-2)

birdsarah · 2019-01-26T04:22:42Z

Unfortunately neither of your suggestions work. With the second, I get the error:

ValueError: parameter 'value': expected array with shape (10000, 100), got (10000, 245)

With the first:

ValueError: operands could not be broadcast together with shapes (5000,100,245) (100,)

It's okay. I have something that works. And it's deterministic :D

stale · 2020-12-26T08:11:45Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

stale bot added the stale label Dec 26, 2020

dcherian added topic-documentation and removed stale labels May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra dimension on first argument passed into apply_ufunc #2714

Extra dimension on first argument passed into apply_ufunc #2714

birdsarah commented Jan 26, 2019 •

edited

Loading

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019 •

edited

Loading

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019

birdsarah commented Jan 26, 2019

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019 •

edited

Loading

shoyer commented Jan 26, 2019

shoyer commented Jan 26, 2019 •

edited

Loading

birdsarah commented Jan 26, 2019

stale bot commented Dec 26, 2020

Extra dimension on first argument passed into apply_ufunc #2714

Extra dimension on first argument passed into apply_ufunc #2714

Comments

birdsarah commented Jan 26, 2019 • edited Loading

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019 • edited Loading

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019

birdsarah commented Jan 26, 2019

birdsarah commented Jan 26, 2019

shoyer commented Jan 26, 2019 • edited Loading

shoyer commented Jan 26, 2019

shoyer commented Jan 26, 2019 • edited Loading

birdsarah commented Jan 26, 2019

stale bot commented Dec 26, 2020

birdsarah commented Jan 26, 2019 •

edited

Loading

shoyer commented Jan 26, 2019 •

edited

Loading

shoyer commented Jan 26, 2019 •

edited

Loading

shoyer commented Jan 26, 2019 •

edited

Loading