Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support selections across all dimensions in element.dataset #3924

Closed
wants to merge 8 commits into from

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Aug 22, 2019

Overview

This PR builds on top of following PRs:

It updates the Dataset.select method to support down selecting an element using all of the dimensions in the element's .dataset property. Without this, it's only possible to down select elements using the key and value dimensions.

Example 1: Points

Create a sample 3-dimensional dataset. x and y are independently drawn from the standard normal distribution and r is calculated to be the radius of each point from the origin.

import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import dim
hv.extension('plotly')

np.random.seed(1)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])

# Add radius column
df['r'] = (df.x ** 2 + df.y ** 2) ** 0.5

ds = hv.Dataset(df)

Then create a Points element from this dataset with x and y as key dimensions.

points = ds.to.points(kdims=['x', 'y'], groupby=[])
points

newplot-2

Prior to #3919, the points object would not have dimension information about r, so it would not be possible to perform a selection on points using r. But, with the addition of the .dataset property (and the changes in this PR), it's now possible to perform a selection using r as well.

Perform selection using x (a key dimension) and r (neither a key nor value dimension):

points * points.select(x=(0, None), r=(0, 1.5))

newplot-3

Example 2: Histogram

This PR uses #3921 to support rebinning the histogram samples in response to selections.

Here's an example of performing a selection directly on the histogram element that uses both x (the histogram's key dimension) and r (neither a key nor value dimension):

hist1 = hv.operation.histogram(points, num_bins=10, dynamic=False, normed=False)
hist2 = hist1.select((dim('x') > 0) & (dim('r') < 1.5))
hist1 * hist2

newplot-4

This example also demonstrates the dim expression support that was added to select in #3920.

When select is performed on a Histogram element that has .dataset and ._operation_kwargs properties, regenerate the histogram using the selected data.
@jlstevens
Copy link
Contributor

jlstevens commented Sep 5, 2019

@jonmmease I really like this functionality but I have some issues about the semantics of selecting on dimensions on an element that aren't declared as kdims and vdims...especially if the .dataset property can return None.

If a DataSet instance is always present on the .dataset property, I think we can present a clearer and more consistent story along the lines of 'you can find dimensions to select on which aren't kdims and vdims directly on the element by inspecting the .dataset property'. Linking would then be based on the shared identity of .dataset between elements.

Is there a reason a .dataset value of None is needed or could an element just create a new one if not derived from an existing dataset?

@jonmmease
Copy link
Collaborator Author

Thanks for taking a look @jlstevens. The goal is for .dataset to never return None, and I believe the only exception to this rule is for a Histogram created using the constructor. In this case the .data property holds a dict of the bin edges/heights, and this dict cannot be used to create a Dataset that corresponds to the data used to create the histogram.

If we rewrite how Histogram works in version 2 to be more like Distribution, then I think we could say that .dataset is never None. In an offline conversation you suggested that we add a warning when accessing the Histogram.dataset property when it is None, explaining why it is None. This sounds like a fine idea to me, and I'll make that update in the next revision.

@jonmmease
Copy link
Collaborator Author

@philippjfr before I make any other changes here, how do you feel about the .dataset property being None for Histogram's created using the Histogram constructor? Do you agree that it would be appropriate to raise a warning in this case? I'm leaning against it slightly, as it is a well defined behavior, and not something that a user would really be able to work around.

@jbednar
Copy link
Member

jbednar commented Sep 10, 2019

The warning is raised only when accessing .dataset, right? If so, what are the cases when that attribute would be accessed on a Histogram?

@jonmmease
Copy link
Collaborator Author

If so, what are the cases when that attribute would be accessed on a Histogram?

It would only be when a user is checking the value of the property manually. The only reason to do this, that I have in mind, is to check to see which variables are available for use with select.

@jonmmease
Copy link
Collaborator Author

@jlstevens in 0a917a7 I made sure that Histogram elements never have a .dataset property of None.

But this wasn't the only case. QuadMesh is another case where the .data property is not compatible with the Dataset storage backends. Here's the kind of thing that gets stored as the QuadMesh.data property:

{'x': array([  10.        ,   46.41588834,  215.443469  , 1000.        ]),
 'y': array([ 1.,  4.,  7., 10.]),
 'z': array([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])}

I don't see a workaround for this. Do you?

@philippjfr
Copy link
Member

@philippjfr before I make any other changes here, how do you feel about the .dataset property being None for Histogram's created using the Histogram constructor?

I think that's fine and see no need for a warning.

@philippjfr
Copy link
Member

But this wasn't the only case. QuadMesh is another case where the .data property is not compatible with the Dataset storage backends.

I'm not sure what you mean by this. It definitely is compatible with certain Dataset interfaces, specifically the GridInterface.

@jonmmease
Copy link
Collaborator Author

Yeah, never mind. I was running into problems with trying to construct a Dataset from self from within the constructor because self wasn't fully constructed yet. In 4011043, I moved the logic to construct a default .dataset property to the property getter method.

With this change, .dataset is never None and there's no special handling needed for Histograms 🎉 cc @jlstevens

@jonmmease
Copy link
Collaborator Author

Closing in favor of the new pipeline approach implemented in #3967

@jonmmease jonmmease closed this Sep 18, 2019
@jonmmease jonmmease deleted the select_all_dims branch October 3, 2019 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants