Refactor gtda/diagrams (#454)

* First attempt at fix of #438 * Fix y-axis in `PersistenceImage` plots, extending #453 * Represent multiplicity of persistence pairs in hovertext in `plot_diagram` * Change reflect mode of `gaussian_filter` to "constant" from "reflect" Affects `HeatKernel` and `PersistenceImage` * Fix `PersistenceLandscape` plot method - Only the figure corresponding to the first seen homology dimension was returned - Output is now a figure with subplots, a main title, and one subtitle per plot (homology dimension) * Improve tests for plot methods in gtda.diagrams.representations Cover use of `plotly_params` * Minor docstring linting * Miscellaneous docstring improvements in gtda/diagrams * Fix validation dictionary for `metric_params` in the case of `PersistenceImage` * Change default value of `order` in Amplitude, from 2. to None (vector features) * Change meaning of default None for `weight_function` in `PersistenceImage` - ``None`` corresponding the identity means that there really is a non-trivial weighting in that case. Semantically, this does not seem correct ("None" should mean no weighting at all) * Improve code style and clarity in plot methods in gtda.diagrams.representations * Refactor gtda/diagrams/_metrics.py to fix several bugs - Change computation of heat/persistence image distances and amplitudes to yield the continuum limit when `n_bins` tends to infinity. - Make `sigma` in persistence-image-- and heat-kernel--based representations/distances/amplitudes measured in the same units as the filtration parameter (not in pixels), thus decoupling its effect from the effect of `n_bins`. Also change the default value in PairwiseDistance, Amplitude, HeatKernel and PersistenceImage from 1. to 0.1. - Remove trivial points from diagrams before creating a sampled image in `heats` and `persistence_images`. This ensures in particular that the trivial points really give no contribution regardless of the weight function used for persistence images. - Finish fixing #438 and similar issues with PairwiseDistance when the metric is persistence-image--based. - Ensure `silhouettes` does not create NaNs when a subdiagram is trivial. - Change default value of `power` in `silhouette_amplitudes` and `silhouette_distances` to 1., to agree with Amplitude and PairwiseDistance docs. - Fix use of np.array_equal in `persistence_image_distances` and `heat_distances` -- so far the boolean check for equal inputs always failed due to the in-place modifications of `diagrams_1`. - No longer store weights in `effective_metric_params_` attribute of PairwiseDistance and Amplitude when the metric is persistence-image--based. - Remove _heat from gtda.diagrams._metrics. - Remove identity from gtda.diagrams._utils and make default `weight_function` kwargs equal to np.ones_like everywhere to agree with default in `PersistenceImage`. - Other style improvements (variable names, linting) * Fix trace name when homology dimension is np.inf in `BettiCurve` and `Silhouette` * Adapt `test_all_pts_the_same` to new behaviour of `heats_` in gtda.diagrams._metrics * Improve test coverage of `Amplitude` and `PairwiseDistance` - Make sure silhouettes and persistence images are covered throughout - Cover `order` parameter throughout * Add test of zero `weight_function` for `PersistenceImage` * Make behaviour of `Scaler.fit` when the metric is persistence image the same as `Amplitude` Accordingly add more combinations of metrics and metric_params in tests for Scaler * Delete never-used `_matrix_wrapper` and `_arrays_wrapper` functions * Remove `_pad` from gtda.diagrams._utils as it is never used * Make `copy=True` in calls to check_diagrams in Scaler.transform and Scaler.inverse_transform * Make `homology_dimensions_` attributes tuples instead of lists, with integers when possible * Improve code style * Hard-code zero array outputs by `heats` and `persistence_images` when step sizes are zero * Add `homology_dimensions` kwarg to `_bin` Achieves beautification of self._samplings and self.samplings_ (ints shown instead of floats) in several transformers (and saves some computation) * Adapt choices of `min_values`, `max_values` and `sigma` in hypothesis-based tests New meaning of sigma led to overflow issues in existing tests. * Make all homology dimensions equal in `test_hk_big_sigma` Also extend this test to `PersistenceImage`, and rename it accordingly * Cover use of `plotly_params` kwarg in diagram preprocessing classes plot methods * Extract some common logic from plot methods in gtda.diagrams.representations * Silence expected warnings from image transformers in test_common * Implement @wreise's suggestion to abstract away sorting and integer conversion of fit hom dims * Add tests for `Amplitude` and `PairwiseDistance` to check that a zero weight function yields identically zero amplitudes/distances. * Refactor `_subdiagrams` to throw informative errors on unfulfilled input properties
giotto-ai · Aug 14, 2020 · 0932d85 · 0932d85
1 parent a445530
commit 0932d85
Show file tree

Hide file tree

Showing 16 changed files with 934 additions and 639 deletions.
diff --git a/gtda/diagrams/_metrics.py b/gtda/diagrams/_metrics.py
diff --git a/gtda/diagrams/_utils.py b/gtda/diagrams/_utils.py
@@ -4,33 +4,58 @@
 import numpy as np
 
 
+def _homology_dimensions_to_sorted_ints(homology_dimensions):
+    return tuple(
+        sorted([int(dim) if dim != np.inf else dim
+                for dim in homology_dimensions])
+        )
+
+
 def _subdiagrams(X, homology_dimensions, remove_dim=False):
     """For each diagram in a collection, extract the subdiagrams in a given
     list of homology dimensions. It is assumed that all diagrams in X contain
     the same number of points in each homology dimension."""
-    n = len(X)
+    n_samples = len(X)
+    X_0 = X[0]
+
+    def _subdiagrams_single_homology_dimension(homology_dimension):
+        n_features_in_dim = np.sum(X_0[:, 2] == homology_dimension)
+        try:
+            # In this case, reshape ensures copy
+            Xs = X[X[:, :, 2] == homology_dimension].\
+                reshape(n_samples, n_features_in_dim, 3)
+            return Xs
+        except ValueError as e:
+            if e.args[0].lower().startswith("cannot reshape array"):
+                raise ValueError(
+                    f"All persistence diagrams in the collection must have "
+                    f"the same number of birth-death-dimension triples in any "
+                    f"given homology dimension. This is not true in homology "
+                    f"dimension {homology_dimension}. Trivial triples for "
+                    f"which birth = death may be added or removed to fulfill "
+                    f"this requirement."
+                )
+            else:
+                raise e
+
     if len(homology_dimensions) == 1:
-        Xs = X[X[:, :, 2] == homology_dimensions[0]].reshape(n, -1, 3)
+        Xs = _subdiagrams_single_homology_dimension(homology_dimensions[0])
     else:
-        Xs = np.concatenate([X[X[:, :, 2] == dim].reshape(n, -1, 3)
-                             for dim in homology_dimensions],
-                            axis=1)
+        # np.concatenate will also create a copy
+        Xs = np.concatenate(
+            [_subdiagrams_single_homology_dimension(dim)
+             for dim in homology_dimensions],
+            axis=1
+            )
     if remove_dim:
         Xs = Xs[:, :, :2]
     return Xs
 
 
-def _pad(X, max_diagram_sizes):
-    X_padded = {dim: np.pad(
-        X[dim],
-        ((0, 0), (0, max_diagram_sizes[dim] - X[dim].shape[1]),
-         (0, 0)), 'constant') for dim in X.keys()}
-    return X_padded
-
-
-def _sample_image(image, sampled_diag):
-    # NOTE: Modifies `image` in-place
-    unique, counts = np.unique(sampled_diag, axis=0, return_counts=True)
+def _sample_image(image, diagram_pixel_coords):
+    # WARNING: Modifies `image` in-place
+    unique, counts = \
+        np.unique(diagram_pixel_coords, axis=0, return_counts=True)
     unique = tuple(tuple(row) for row in unique.astype(np.int).T)
     image[unique] = counts
 
@@ -54,7 +79,7 @@ def _multirange(counts):
 
 def _filter(X, filtered_homology_dimensions, cutoff):
     n = len(X)
-    homology_dimensions = sorted(list(set(X[0, :, 2])))
+    homology_dimensions = sorted(np.unique(X[0, :, 2]))
     unfiltered_homology_dimensions = [dim for dim in homology_dimensions if
                                       dim not in filtered_homology_dimensions]
 
@@ -97,8 +122,9 @@ def _filter(X, filtered_homology_dimensions, cutoff):
     return Xf
 
 
-def _bin(X, metric, n_bins=100, **kw_args):
-    homology_dimensions = sorted(list(set(X[0, :, 2])))
+def _bin(X, metric, n_bins=100, homology_dimensions=None, **kw_args):
+    if homology_dimensions is None:
+        homology_dimensions = sorted(np.unique(X[0, :, 2]))
     # For some vectorizations, we force the values to be the same + widest
     sub_diags = {dim: _subdiagrams(X, [dim], remove_dim=True)
                  for dim in homology_dimensions}
@@ -131,18 +157,30 @@ def _bin(X, metric, n_bins=100, **kw_args):
     samplings = {}
     step_sizes = {}
     for dim in homology_dimensions:
-        samplings[dim], step_sizes[dim] = np.linspace(min_vals[dim],
-                                                      max_vals[dim],
-                                                      retstep=True,
-                                                      num=n_bins)
+        samplings[dim], step_sizes[dim] = np.linspace(
+            min_vals[dim], max_vals[dim], retstep=True, num=n_bins
+            )
     if metric in ['landscape', 'betti', 'heat', 'silhouette']:
         for dim in homology_dimensions:
             samplings[dim] = samplings[dim][:, [0], None]
             step_sizes[dim] = step_sizes[dim][0]
     return samplings, step_sizes
 
 
-def _calculate_weights(X, weight_function, samplings, **kw_args):
-    weights = {dim: weight_function(samplings[dim][:, 1])
-               for dim in samplings.keys()}
-    return weights
+def _make_homology_dimensions_mapping(homology_dimensions,
+                                      homology_dimensions_ref):
+    """`homology_dimensions_ref` is assumed to be a sorted tuple as is e.g.
+    :attr:`homology_dimensions_` for several transformers."""
+    if homology_dimensions is None:
+        homology_dimensions_mapping = list(enumerate(homology_dimensions_ref))
+    else:
+        homology_dimensions_mapping = []
+        for dim in homology_dimensions:
+            if dim not in homology_dimensions_ref:
+                raise ValueError(f"All homology dimensions must be in "
+                                 f"{homology_dimensions_ref}; {dim} is not.")
+            else:
+                homology_dimensions_arr = np.array(homology_dimensions_ref)
+                inv_idx = np.flatnonzero(homology_dimensions_arr == dim)[0]
+                homology_dimensions_mapping.append((inv_idx, dim))
+    return homology_dimensions_mapping
diff --git a/gtda/diagrams/distance.py b/gtda/diagrams/distance.py
@@ -8,7 +8,7 @@
 from sklearn.utils.validation import check_is_fitted
 
 from ._metrics import _AVAILABLE_METRICS, _parallel_pairwise
-from ._utils import _bin, _calculate_weights
+from ._utils import _bin, _homology_dimensions_to_sorted_ints
 from ..utils._docs import adapt_fit_transform_docs
 from ..utils.intervals import Interval
 from ..utils.validation import check_diagrams, validate_params
@@ -24,9 +24,6 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
     matrices or a single distance matrix between pairs of diagrams is
     calculated according to the following steps:
 
-    Input collections of persistence diagrams for this transformer must satisfy
-    certain requirements, see e.g. :meth:`fit`.
-
         1. All diagrams are partitioned into subdiagrams corresponding to
            distinct homology dimensions.
         2. Pairwise distances between subdiagrams of equal homology
@@ -37,22 +34,29 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
            three-dimensional array, or a single distance matrix constructed
            by taking norms of the vectors of distances between diagram pairs.
 
+    **Important notes**:
+
+        - Input collections of persistence diagrams for this transformer must
+          satisfy certain requirements, see e.g. :meth:`fit`.
+        - The shape of outputs of :meth:`transform` depends on the value of the
+          `order` parameter.
+
     Parameters
     ----------
-    metric : ``'bottleneck'`` | ``'wasserstein'`` | ``'landscape'`` | \
-        ``'betti'`` | ``'heat'`` | ``'persistence_image'``, | \
-        ``'silhouette'``, optional, default: ``'landscape'``
+    metric : ``'bottleneck'`` | ``'wasserstein'`` | ``'betti'`` | \
+        ``'landscape'`` | ``'silhouette'`` | ``'heat'`` | \
+        ``'persistence_image'``, optional, default: ``'landscape'``
         Distance or dissimilarity function between subdiagrams:
 
         - ``'bottleneck'`` and ``'wasserstein'`` refer to the identically named
           perfect-matching--based notions of distance.
+        - ``'betti'`` refers to the :math:`L^p` distance between Betti curves.
         - ``'landscape'`` refers to the :math:`L^p` distance between
           persistence landscapes.
-        - ``'betti'`` refers to the :math:`L^p` distance between Betti curves.
-        - ``'heat'`` refers to the :math:`L^p` distance between
-          Gaussian-smoothed diagrams.
         - ``'silhouette'`` refers to the :math:`L^p` distance between
           silhouettes.
+        - ``'heat'`` refers to the :math:`L^p` distance between
+          Gaussian-smoothed diagrams.
         - ``'persistence_image'`` refers to the :math:`L^p` distance between
           Gaussian-smoothed diagrams represented on birth-persistence axes.
 
@@ -61,27 +65,27 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
         ``None`` is equivalent to passing the defaults described below):
 
         - If ``metric == 'bottleneck'`` the only argument is `delta` (float,
-          default: ``0.01``). When equal to ``0.``, an exact algorithm is
-          used; otherwise, a faster approximate algorithm is used.
+          default: ``0.01``). When equal to ``0.``, an exact algorithm is used;
+          otherwise, a faster approximate algorithm is used.
         - If ``metric == 'wasserstein'`` the available arguments are `p`
           (float, default: ``2.``) and `delta` (float, default: ``0.01``).
-          Unlike the case of ``'bottleneck'``, `delta` cannot be set to
-          ``0.`` and an exact algorithm is not available.
+          Unlike the case of ``'bottleneck'``, `delta` cannot be set to ``0.``
+          and an exact algorithm is not available.
         - If ``metric == 'betti'`` the available arguments are `p` (float,
           default: ``2.``) and `n_bins` (int, default: ``100``).
-        - If ``metric == 'landscape'`` the available arguments are `p`
-          (float, default: ``2.``), `n_bins` (int, default: ``100``) and
-          `n_layers` (int, default: ``1``).
-        - If ``metric == 'heat'`` the available arguments are `p`
-          (float, default: ``2.``), `sigma` (float, default: ``1.``) and
-          `n_bins` (int, default: ``100``).
-        - If ``metric == 'silhouette'`` the available arguments are `p`
-          (float, default: ``2.``), `order` (float, default: ``1.``) and
-          `n_bins` (int, default: ``100``).
+        - If ``metric == 'landscape'`` the available arguments are `p` (float,
+          default: ``2.``), `n_bins` (int, default: ``100``) and `n_layers`
+          (int, default: ``1``).
+        - If ``metric == 'silhouette'`` the available arguments are `p` (float,
+          default: ``2.``), `power` (float, default: ``1.``) and `n_bins` (int,
+          default: ``100``).
+        - If ``metric == 'heat'`` the available arguments are `p` (float,
+          default: ``2.``), `sigma` (float, default: ``0.1``) and `n_bins`
+          (int, default: ``100``).
         - If ``metric == 'persistence_image'`` the available arguments are `p`
-          (float, default: ``2.``), `sigma` (float, default: ``1.``),
-          `n_bins` (int, default: ``100``) and `weight_function`
-          (callable or None, default: ``None``).
+          (float, default: ``2.``), `sigma` (float, default: ``0.1``), `n_bins`
+          (int, default: ``100``) and `weight_function` (callable or None,
+          default: ``None``).
 
     order : float or None, optional, default: ``2.``
         If ``None``, :meth:`transform` returns for each pair of diagrams a
@@ -98,9 +102,9 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
     ----------
     effective_metric_params_ : dict
         Dictionary containing all information present in `metric_params` as
-        well as on any relevant quantities computed in :meth:`fit`.
+        well as relevant quantities computed in :meth:`fit`.
 
-    homology_dimensions_ : list
+    homology_dimensions_ : tuple
         Homology dimensions seen in :meth:`fit`, sorted in ascending order.
 
     See also
@@ -174,15 +178,23 @@ def fit(self, X, y=None):
         validate_params(
             self.effective_metric_params_, _AVAILABLE_METRICS[self.metric])
 
-        self.homology_dimensions_ = sorted(set(X[0, :, 2]))
+        # Find the unique homology dimensions in the 3D array X passed to `fit`
+        # assuming that they can all be found in its zero-th entry
+        homology_dimensions_fit = np.unique(X[0, :, 2])
+        self.homology_dimensions_ = \
+            _homology_dimensions_to_sorted_ints(homology_dimensions_fit)
 
         self.effective_metric_params_['samplings'], \
             self.effective_metric_params_['step_sizes'] = \
-            _bin(X, metric=self.metric, **self.effective_metric_params_)
+            _bin(X, self.metric, **self.effective_metric_params_)
 
         if self.metric == 'persistence_image':
-            self.effective_metric_params_['weights'] = \
-                _calculate_weights(X, **self.effective_metric_params_)
+            weight_function = self.effective_metric_params_.get(
+                'weight_function', None
+                )
+            weight_function = \
+                np.ones_like if weight_function is None else weight_function
+            self.effective_metric_params_['weight_function'] = weight_function
 
         self._X = X
         return self