-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extra dimension on first argument passed into apply_ufunc #2714
Comments
I should add, if I pass in plain numpy arrays then I do not have this problem. But ultimately I want to pass in a chunked DataArray, as described here: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization (this is my whole reason for using xarray). The work around is easy I just use |
The notion of "core dimensions" in That said, I think this is (mostly) doing the right thing:
This is probably especially confusing because the unlabeled versions of For your specific use case: maybe you meant to specify More generally: I think we really need a version of |
Hi, I will have to think about your response a lot more to see if I can wrap my head around it. In the meantime I'm not sure I have my input_core_dims correct, but that's the only configuration I could get to work. I chunk along row_a, and row_b and I output a new array with the dims [row_a, row_b]. By trial and error, the above configuration is the only one I could find where I got out the dims I was expecting and didn't get an error. |
Maybe it would help to describe what you were trying to do here. What should "one unit" of your calculation look like? |
Can you clarify one thing in your note.
Is it |
That was a typo (I'll fix it). It should be The behavior is definitely deterministic, if hard to understand! |
Sure - thanks! I have a dataset that's long, the sample code shown below is 200k rows, but the full dataset will be much larger. I'm interested in pairwise distances except not for all rows, just the distances for few thousand rows, wrt to the full 200k. Here's how I hack this together: My starting array df_array = xr.DataArray(df)
df_array = df_array.rename({PIVOT: 'all_sites'})
df_array
<xarray.DataArray (all_sites: 185084, dim_1: 245)>
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
* all_sites (all_sites) object '0.gravatar.com||gprofiles.js||Gravatar.init' ... 'кÑ\x83Ñ\x80Ñ\x81Ñ\x8b.1Ñ\x81енÑ\x82Ñ\x8fбÑ\x80Ñ\x8f.Ñ\x80Ñ\x84||store.js||store.set'
* dim_1 (dim_1) object 'AnalyserNode.connect' ... 'HTMLCanvasElement.previousSibling' My slice of the array sites_of_interest = [sub list of all sites]
df_dye_array = xr.DataArray(df.loc[sites_of_interest])
df_dye_array = df_dye_array.rename({PIVOT: 'dye_sites'}) Chunk df_array_c = df_array.chunk({'all_sites': 10_000})
df_dye_array_c = df_dye_array.chunk({'dye_sites': 100}) Get distances def get_chebyshev_distances_xarray_ufunc(df_array, df_dye_array):
chebyshev = lambda x: np.abs(df_array[:,0,:] - x).max(axis=1)
result = np.apply_along_axis(chebyshev, 1, df_dye_array).T
return result
distance_array = xr.apply_ufunc(
get_chebyshev_distances_xarray_ufunc,
df_array_c, df_dye_array_c,
dask='parallelized',
output_dtypes=[float],
input_core_dims=[['dim_1'], ['dim_1']],
) What I get out is an array with the length of my original array and the width of my sites of interest where each number is the chebyshev distance between their respective rows of the original dataset (which are 245 long). |
phew! |
OK, I think you're doing this right. You want an output with shape My suggestion would be to add an explicit call to def get_chebyshev_distances_xarray_ufunc(array, dye_array):
array, dye_array = np.broadcast_arrays(array, dye_array)
# array is a 3D numpy array with logical dimensions ['all_sites', 'dye_sites', 'dim_1']
# dye_array is a 3D numpy array with logical dimensions ['all_sites', 'dye_sites', 'dim_1']
# compute the distance matrix
# return a numpy array with logical dimensions ['all_sites', 'dye_sites'] (edit note: fixed dimensions) |
actually, scratch that |
I think this would also work (due to the intrinsic broadcasting behavior of numpy functions): def get_chebyshev_distances_xarray_ufunc(array, dye_array):
return abs(array - dye_array).max(axis=-2) |
Unfortunately neither of your suggestions work. With the second, I get the error:
With the first:
It's okay. I have something that works. And it's deterministic :D |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
Here's my code:
If I print_shape and print_shapes with apply_ufunc I am surprised by the results:
vs
My array
da
has changed shape from(1000, 100)
to(1000, 1, 100)
when a second argument was added to my call toapply_ufunc
.Maybe this is documented, but I missed it. If it is documented I'd be glad to be pointed to it and I'll see if I can come up with a suggestion of how to highlight this better in the documentation as it really threw me.
The text was updated successfully, but these errors were encountered: