Implement grid concatenation and standardize datatype casting #2762

philippjfr · 2018-06-02T13:08:21Z

This PR has two main aims:

Standardize and simplify how casting between different datatypes works
Implement concatenation for all datatypes

Before this PR was applied past both casting and concatenation were limited to columnar data formats, which meant that certain operations could not be applied to gridded data, e.g. a HoloMap collapse. Having a dedicated concat implementation for both columnar and gridded data also allows much more efficient concatenation than what is currently in use by methods like .table and .dframe and will generalize them so that we can eventually replace the column specific .table implementation with a general one that returns a dataset of arbitrary type.

Implementing concatenation along HoloMap dimensions also means that Dataset.groupby operations are now reversible and fixes HoloMap.collapse.

Fixes Collapsing UniformNdMappings of Dataset types #1417
Adds unit tests

philippjfr · 2018-06-02T13:10:01Z

holoviews/core/util.py

@@ -1532,7 +1531,7 @@ def groupby_python(self_or_cls, ndmapping, dimensions, container_type,
        selects = get_unique_keys(ndmapping, dimensions)
        selects = group_select(list(selects))
        groups = [(k, group_type((v.reindex(idims) if hasattr(v, 'kdims')
-                                  else [((), (v,))]), **kwargs))
+                                  else [((), v)]), **kwargs))


These are NdMapping implementation details left over from when we had an NdElement, the precursor to datasets, which now lead to strange behavior.

philippjfr · 2018-06-02T13:11:38Z

holoviews/core/data/__init__.py

 datatypes = ['dictionary', 'grid']

 try:
    import pandas as pd # noqa (Availability import)
    from .pandas import PandasInterface
+    default_datatype = 'dataframe'


When converting from gridded to columnar data throughout the code it usually has to cast the data to a specific datatype. Various places in the code hardcoded ['pandas', 'dictionary'] in these places, defining a default_datatype avoids having to hardcode this all over the place.

Shouldn't this be "default_columnar_datatype', then? Or are there no cases where columnar data needs to be cast into some gridded data type?

Columnar data cannot be cast to gridded data without some kind of aggregation occurring. So that's correct. Would still be okay with changing it to default_columnar_datatype.

philippjfr · 2018-06-02T13:14:11Z

holoviews/core/data/grid.py

+            arrays = [grid[vdim.name] for grid in grids]
+            stack = np.stack if any(is_dask(arr) for arr in arrays) else da.stack
+            new_data[vdim.name] = stack(arrays, -1)
+        return new_data


Since arrays cannot be concatenated along multiple axes at once the implementation of concat on gridded interfaces has two components. A general concat method coordinates hierarchical concatenation along each dimension and uses the interface specific concat_dim method implementations to concatenate along one particular axis or dimension.

philippjfr · 2018-06-02T13:15:06Z

Made some additional comments to clarify certain implementation details.

philippjfr · 2018-06-02T13:18:46Z

holoviews/core/data/interface.py

+        cast = []
+        for ds in datasets:
+            if cast_type is not None or ds.interface.datatype != datatype:
+                ds = ds.clone(ds, datatype=[datatype], new_type=cast_type)


Casting works quite simply, if the Interface.initialize is passed another dataset and it finds a mismatch between the supplied datatype and the requested datatype it will deconstruct the original dataset into the columnar or gridded tuple format, which is supported by all interfaces. In this way a dataset can easily be cast to any other datatype, except for columnar -> gridded conversions.

philippjfr · 2018-06-02T13:23:13Z

As a followup to this PR we should provide special handling for dask arrays/dataframes during casting. This requires multiple things:

Interfaces need to declare if they support lazy data
Interfaces need to declare an API to check if the data for a dimension is lazy
The .values method on Interfaces need to provide an option to return a lazy (i.e. dask) array

philippjfr · 2018-06-20T11:48:29Z

Ready for review.

jlstevens · 2018-06-20T16:02:56Z

holoviews/core/data/__init__.py

+    """
+    Concatenates multiple datasets wrapped in an NdMapping type
+    along all of its dimensions. Before concatenation all datasets
+    are cast to the same datatype. For columnar data concatenation


'same datatype' determined how?

Either explicitly defined or the type of the first dataset that was passed in.

Would be good to state that bit about it being chosen from the first one if not explicitly set.

jlstevens · 2018-06-20T16:11:33Z

holoviews/core/data/interface.py

+            datasets = datasets.items()
+            keys, datasets = zip(*datasets)
+        elif isinstance(datasets, list) and not any(isinstance(v, tuple) for v in datasets):
+            keys = [()]*len(datasets)


What are all these empty tuple keys for? Just to get things in the right format?

Right, concatenate is usually meant for concatenating along some dimension but you can also concatenate a simple list of datasets without concatenating along some dimensions. For that case we generate empty tuple keys. Happy to add a comment. Separately I also need to assert that this only happens for tabular data, since gridded data must be concatenated along some dimension.

jlstevens · 2018-06-20T16:49:51Z

holoviews/core/data/iris.py

@@ -4,6 +4,9 @@
 from itertools import product

 import iris
+from iris.coords import DimCoord
+from iris.cube import CubeList
+from iris.experimental.equalise_cubes import equalise_attributes


Will be good to have the iris interface moved to geoviews. Could this be done for 1.10.6?

Tests need to be moved into the holoviews package first.

You mean 'geoviews' package?

No I mean the /tests need to move to /holoviews/tests.

The interface tests are defined as mix-in classes, so if I want to run them in geoviews I have to be able to import them from holoviews. We also promised this to the bokeh folks so they can run our bokeh unit tests easily.

jlstevens · 2018-06-20T16:51:52Z

holoviews/core/spaces.py

-                col_data = group.last.clone(data)
-            collapsed[key] = col_data
-        return collapsed if self.ndims > 1 else collapsed.last
+                group_data = group.last.clone(data)


group_data can be a whole load of different things at different times. Not critical but I would prefer to have something that isn't clobbered so much.

jlstevens · 2018-06-20T16:53:29Z

Other than a few minor comments this looks good and I'm happy to merge.

jlstevens · 2018-06-22T11:59:45Z

Tests are green. Merging.

philippjfr added type: bug Something isn't correct or isn't working type: enhancement Minor feature or improvement to an existing feature labels Jun 2, 2018

philippjfr commented Jun 2, 2018

View reviewed changes

philippjfr added this to the v1.10.5 milestone Jun 2, 2018

philippjfr added status: WIP tag: component: data labels Jun 2, 2018

philippjfr force-pushed the cast_and_concat branch from 3a8b95a to b16c799 Compare June 2, 2018 14:01

philippjfr modified the milestones: v1.10.5, v1.10.6 Jun 10, 2018

philippjfr added 12 commits June 20, 2018 11:57

Implemented concatenation for most interfaces

7982048

Defined default datatype

37bfdbb

Stop wrapping NdMapping values in tuples in groupby

bf5ce00

Simplified and improved casting

0ef9685

Added concatenate utility

6e17281

Fixed HoloMap.collapse for gridded data

901bb17

Further improvements to grid concatenation

3ece3d9

Updated old usages of Interface.concatenate

95b6844

Fixed flakes

7dd80c5

Fixes for iris concatenation

ffad755

Added validation for GridInterface.concat

a96e32c

Improved tests for grid concatenation

67bd62a

philippjfr force-pushed the cast_and_concat branch from b16c799 to 67bd62a Compare June 20, 2018 11:47

philippjfr removed the status: WIP label Jun 20, 2018

Fixed bug in NdMapping.table

fc15f1a

jlstevens reviewed Jun 20, 2018

View reviewed changes

Addressed review comments

6356310

jlstevens merged commit e5d4adc into master Jun 22, 2018

philippjfr deleted the cast_and_concat branch July 4, 2018 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement grid concatenation and standardize datatype casting #2762

Implement grid concatenation and standardize datatype casting #2762

philippjfr commented Jun 2, 2018

philippjfr Jun 2, 2018

philippjfr Jun 2, 2018

jbednar Jun 2, 2018

philippjfr Jun 3, 2018

philippjfr Jun 2, 2018

philippjfr commented Jun 2, 2018

philippjfr Jun 2, 2018

philippjfr commented Jun 2, 2018

philippjfr commented Jun 20, 2018

jlstevens Jun 20, 2018

philippjfr Jun 20, 2018

jlstevens Jun 20, 2018

jlstevens Jun 20, 2018

philippjfr Jun 20, 2018

jlstevens Jun 20, 2018

philippjfr Jun 20, 2018

jlstevens Jun 20, 2018

philippjfr Jun 20, 2018 •

edited

Loading

philippjfr Jun 20, 2018

jlstevens Jun 20, 2018

jlstevens commented Jun 20, 2018

jlstevens commented Jun 22, 2018

Implement grid concatenation and standardize datatype casting #2762

Implement grid concatenation and standardize datatype casting #2762

Conversation

philippjfr commented Jun 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philippjfr commented Jun 2, 2018

Choose a reason for hiding this comment

philippjfr commented Jun 2, 2018

philippjfr commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philippjfr Jun 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlstevens commented Jun 20, 2018

jlstevens commented Jun 22, 2018

philippjfr Jun 20, 2018 •

edited

Loading