Skip to content

Commit

Permalink
Refactor gtda/diagrams (#454)
Browse files Browse the repository at this point in the history
* First attempt at fix of #438

* Fix y-axis in `PersistenceImage` plots, extending #453

* Represent multiplicity of persistence pairs in hovertext in `plot_diagram`

* Change reflect mode of `gaussian_filter` to "constant" from "reflect"

Affects `HeatKernel` and `PersistenceImage`

* Fix `PersistenceLandscape` plot method

- Only the figure corresponding to the first seen homology dimension was returned
- Output is now a figure with subplots, a main title, and one subtitle per plot (homology dimension)

* Improve tests for plot methods in gtda.diagrams.representations

Cover use of `plotly_params`

* Minor docstring linting

* Miscellaneous docstring improvements in gtda/diagrams

* Fix validation dictionary for `metric_params` in the case of `PersistenceImage`

* Change default value of `order` in Amplitude, from 2. to None (vector features)

* Change meaning of default None for `weight_function` in `PersistenceImage`

- ``None`` corresponding the identity means that there really is a non-trivial weighting in that case. Semantically, this does not seem correct ("None" should mean no weighting at all)

* Improve code style and clarity in plot methods in gtda.diagrams.representations

* Refactor gtda/diagrams/_metrics.py to fix several bugs

- Change computation of heat/persistence image distances and amplitudes to yield the continuum limit when `n_bins` tends to infinity.
- Make `sigma` in persistence-image-- and heat-kernel--based representations/distances/amplitudes measured in the same units as the filtration parameter (not in pixels), thus decoupling its effect from the effect of `n_bins`. Also change the default value in PairwiseDistance, Amplitude, HeatKernel and PersistenceImage from 1. to 0.1.
- Remove trivial points from diagrams before creating a sampled image in `heats` and `persistence_images`. This ensures in particular that the trivial points really give no contribution regardless of the weight function used for persistence images.
- Finish fixing #438 and similar issues with PairwiseDistance when the metric is persistence-image--based.
- Ensure `silhouettes` does not create NaNs when a subdiagram is trivial.
- Change default value of `power` in `silhouette_amplitudes`
 and `silhouette_distances` to 1., to agree with Amplitude and PairwiseDistance docs.
- Fix use of np.array_equal in `persistence_image_distances` and `heat_distances` -- so far the boolean check for equal inputs always failed due to the in-place modifications of `diagrams_1`.
- No longer store weights in `effective_metric_params_` attribute of PairwiseDistance and Amplitude when the metric is persistence-image--based.
- Remove _heat from gtda.diagrams._metrics.
- Remove identity from gtda.diagrams._utils and make default `weight_function` kwargs equal to np.ones_like everywhere to agree with default in `PersistenceImage`.
- Other style improvements (variable names, linting)

* Fix trace name when homology dimension is np.inf in `BettiCurve` and `Silhouette`

* Adapt `test_all_pts_the_same` to new behaviour of `heats_` in gtda.diagrams._metrics

* Improve test coverage of `Amplitude` and `PairwiseDistance`

- Make sure silhouettes and persistence images are covered throughout
- Cover `order` parameter throughout

* Add test of zero `weight_function` for `PersistenceImage`

* Make behaviour of `Scaler.fit` when the metric is persistence image the same as `Amplitude`

Accordingly add more combinations of metrics and metric_params in tests for Scaler

* Delete never-used `_matrix_wrapper` and `_arrays_wrapper` functions

* Remove `_pad` from gtda.diagrams._utils as it is never used

* Make `copy=True` in calls to check_diagrams in Scaler.transform and Scaler.inverse_transform

* Make `homology_dimensions_` attributes tuples instead of lists, with integers when possible

* Improve code style

* Hard-code zero array outputs by `heats` and `persistence_images` when step sizes are zero

* Add `homology_dimensions` kwarg to `_bin`

Achieves beautification of self._samplings and self.samplings_ (ints shown instead of floats) in several transformers (and saves some computation)

* Adapt choices of `min_values`, `max_values` and `sigma` in hypothesis-based tests

New meaning of sigma led to overflow issues in existing tests.

* Make all homology dimensions equal in `test_hk_big_sigma`

Also extend this test to `PersistenceImage`, and rename it accordingly

* Cover use of `plotly_params` kwarg in diagram preprocessing classes plot methods

* Extract some common logic from plot methods in gtda.diagrams.representations

* Silence expected warnings from image transformers in test_common

* Implement @wreise's suggestion to abstract away sorting and integer conversion of fit hom dims

* Add tests for `Amplitude` and `PairwiseDistance` to check that a zero weight function yields identically zero amplitudes/distances.

* Refactor `_subdiagrams` to throw informative errors on unfulfilled input properties
  • Loading branch information
ulupo authored Aug 14, 2020
1 parent a445530 commit 0932d85
Show file tree
Hide file tree
Showing 16 changed files with 934 additions and 639 deletions.
419 changes: 251 additions & 168 deletions gtda/diagrams/_metrics.py

Large diffs are not rendered by default.

92 changes: 65 additions & 27 deletions gtda/diagrams/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,58 @@
import numpy as np


def _homology_dimensions_to_sorted_ints(homology_dimensions):
return tuple(
sorted([int(dim) if dim != np.inf else dim
for dim in homology_dimensions])
)


def _subdiagrams(X, homology_dimensions, remove_dim=False):
"""For each diagram in a collection, extract the subdiagrams in a given
list of homology dimensions. It is assumed that all diagrams in X contain
the same number of points in each homology dimension."""
n = len(X)
n_samples = len(X)
X_0 = X[0]

def _subdiagrams_single_homology_dimension(homology_dimension):
n_features_in_dim = np.sum(X_0[:, 2] == homology_dimension)
try:
# In this case, reshape ensures copy
Xs = X[X[:, :, 2] == homology_dimension].\
reshape(n_samples, n_features_in_dim, 3)
return Xs
except ValueError as e:
if e.args[0].lower().startswith("cannot reshape array"):
raise ValueError(
f"All persistence diagrams in the collection must have "
f"the same number of birth-death-dimension triples in any "
f"given homology dimension. This is not true in homology "
f"dimension {homology_dimension}. Trivial triples for "
f"which birth = death may be added or removed to fulfill "
f"this requirement."
)
else:
raise e

if len(homology_dimensions) == 1:
Xs = X[X[:, :, 2] == homology_dimensions[0]].reshape(n, -1, 3)
Xs = _subdiagrams_single_homology_dimension(homology_dimensions[0])
else:
Xs = np.concatenate([X[X[:, :, 2] == dim].reshape(n, -1, 3)
for dim in homology_dimensions],
axis=1)
# np.concatenate will also create a copy
Xs = np.concatenate(
[_subdiagrams_single_homology_dimension(dim)
for dim in homology_dimensions],
axis=1
)
if remove_dim:
Xs = Xs[:, :, :2]
return Xs


def _pad(X, max_diagram_sizes):
X_padded = {dim: np.pad(
X[dim],
((0, 0), (0, max_diagram_sizes[dim] - X[dim].shape[1]),
(0, 0)), 'constant') for dim in X.keys()}
return X_padded


def _sample_image(image, sampled_diag):
# NOTE: Modifies `image` in-place
unique, counts = np.unique(sampled_diag, axis=0, return_counts=True)
def _sample_image(image, diagram_pixel_coords):
# WARNING: Modifies `image` in-place
unique, counts = \
np.unique(diagram_pixel_coords, axis=0, return_counts=True)
unique = tuple(tuple(row) for row in unique.astype(np.int).T)
image[unique] = counts

Expand All @@ -54,7 +79,7 @@ def _multirange(counts):

def _filter(X, filtered_homology_dimensions, cutoff):
n = len(X)
homology_dimensions = sorted(list(set(X[0, :, 2])))
homology_dimensions = sorted(np.unique(X[0, :, 2]))
unfiltered_homology_dimensions = [dim for dim in homology_dimensions if
dim not in filtered_homology_dimensions]

Expand Down Expand Up @@ -97,8 +122,9 @@ def _filter(X, filtered_homology_dimensions, cutoff):
return Xf


def _bin(X, metric, n_bins=100, **kw_args):
homology_dimensions = sorted(list(set(X[0, :, 2])))
def _bin(X, metric, n_bins=100, homology_dimensions=None, **kw_args):
if homology_dimensions is None:
homology_dimensions = sorted(np.unique(X[0, :, 2]))
# For some vectorizations, we force the values to be the same + widest
sub_diags = {dim: _subdiagrams(X, [dim], remove_dim=True)
for dim in homology_dimensions}
Expand Down Expand Up @@ -131,18 +157,30 @@ def _bin(X, metric, n_bins=100, **kw_args):
samplings = {}
step_sizes = {}
for dim in homology_dimensions:
samplings[dim], step_sizes[dim] = np.linspace(min_vals[dim],
max_vals[dim],
retstep=True,
num=n_bins)
samplings[dim], step_sizes[dim] = np.linspace(
min_vals[dim], max_vals[dim], retstep=True, num=n_bins
)
if metric in ['landscape', 'betti', 'heat', 'silhouette']:
for dim in homology_dimensions:
samplings[dim] = samplings[dim][:, [0], None]
step_sizes[dim] = step_sizes[dim][0]
return samplings, step_sizes


def _calculate_weights(X, weight_function, samplings, **kw_args):
weights = {dim: weight_function(samplings[dim][:, 1])
for dim in samplings.keys()}
return weights
def _make_homology_dimensions_mapping(homology_dimensions,
homology_dimensions_ref):
"""`homology_dimensions_ref` is assumed to be a sorted tuple as is e.g.
:attr:`homology_dimensions_` for several transformers."""
if homology_dimensions is None:
homology_dimensions_mapping = list(enumerate(homology_dimensions_ref))
else:
homology_dimensions_mapping = []
for dim in homology_dimensions:
if dim not in homology_dimensions_ref:
raise ValueError(f"All homology dimensions must be in "
f"{homology_dimensions_ref}; {dim} is not.")
else:
homology_dimensions_arr = np.array(homology_dimensions_ref)
inv_idx = np.flatnonzero(homology_dimensions_arr == dim)[0]
homology_dimensions_mapping.append((inv_idx, dim))
return homology_dimensions_mapping
76 changes: 44 additions & 32 deletions gtda/diagrams/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from sklearn.utils.validation import check_is_fitted

from ._metrics import _AVAILABLE_METRICS, _parallel_pairwise
from ._utils import _bin, _calculate_weights
from ._utils import _bin, _homology_dimensions_to_sorted_ints
from ..utils._docs import adapt_fit_transform_docs
from ..utils.intervals import Interval
from ..utils.validation import check_diagrams, validate_params
Expand All @@ -24,9 +24,6 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
matrices or a single distance matrix between pairs of diagrams is
calculated according to the following steps:
Input collections of persistence diagrams for this transformer must satisfy
certain requirements, see e.g. :meth:`fit`.
1. All diagrams are partitioned into subdiagrams corresponding to
distinct homology dimensions.
2. Pairwise distances between subdiagrams of equal homology
Expand All @@ -37,22 +34,29 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
three-dimensional array, or a single distance matrix constructed
by taking norms of the vectors of distances between diagram pairs.
**Important notes**:
- Input collections of persistence diagrams for this transformer must
satisfy certain requirements, see e.g. :meth:`fit`.
- The shape of outputs of :meth:`transform` depends on the value of the
`order` parameter.
Parameters
----------
metric : ``'bottleneck'`` | ``'wasserstein'`` | ``'landscape'`` | \
``'betti'`` | ``'heat'`` | ``'persistence_image'``, | \
``'silhouette'``, optional, default: ``'landscape'``
metric : ``'bottleneck'`` | ``'wasserstein'`` | ``'betti'`` | \
``'landscape'`` | ``'silhouette'`` | ``'heat'`` | \
``'persistence_image'``, optional, default: ``'landscape'``
Distance or dissimilarity function between subdiagrams:
- ``'bottleneck'`` and ``'wasserstein'`` refer to the identically named
perfect-matching--based notions of distance.
- ``'betti'`` refers to the :math:`L^p` distance between Betti curves.
- ``'landscape'`` refers to the :math:`L^p` distance between
persistence landscapes.
- ``'betti'`` refers to the :math:`L^p` distance between Betti curves.
- ``'heat'`` refers to the :math:`L^p` distance between
Gaussian-smoothed diagrams.
- ``'silhouette'`` refers to the :math:`L^p` distance between
silhouettes.
- ``'heat'`` refers to the :math:`L^p` distance between
Gaussian-smoothed diagrams.
- ``'persistence_image'`` refers to the :math:`L^p` distance between
Gaussian-smoothed diagrams represented on birth-persistence axes.
Expand All @@ -61,27 +65,27 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
``None`` is equivalent to passing the defaults described below):
- If ``metric == 'bottleneck'`` the only argument is `delta` (float,
default: ``0.01``). When equal to ``0.``, an exact algorithm is
used; otherwise, a faster approximate algorithm is used.
default: ``0.01``). When equal to ``0.``, an exact algorithm is used;
otherwise, a faster approximate algorithm is used.
- If ``metric == 'wasserstein'`` the available arguments are `p`
(float, default: ``2.``) and `delta` (float, default: ``0.01``).
Unlike the case of ``'bottleneck'``, `delta` cannot be set to
``0.`` and an exact algorithm is not available.
Unlike the case of ``'bottleneck'``, `delta` cannot be set to ``0.``
and an exact algorithm is not available.
- If ``metric == 'betti'`` the available arguments are `p` (float,
default: ``2.``) and `n_bins` (int, default: ``100``).
- If ``metric == 'landscape'`` the available arguments are `p`
(float, default: ``2.``), `n_bins` (int, default: ``100``) and
`n_layers` (int, default: ``1``).
- If ``metric == 'heat'`` the available arguments are `p`
(float, default: ``2.``), `sigma` (float, default: ``1.``) and
`n_bins` (int, default: ``100``).
- If ``metric == 'silhouette'`` the available arguments are `p`
(float, default: ``2.``), `order` (float, default: ``1.``) and
`n_bins` (int, default: ``100``).
- If ``metric == 'landscape'`` the available arguments are `p` (float,
default: ``2.``), `n_bins` (int, default: ``100``) and `n_layers`
(int, default: ``1``).
- If ``metric == 'silhouette'`` the available arguments are `p` (float,
default: ``2.``), `power` (float, default: ``1.``) and `n_bins` (int,
default: ``100``).
- If ``metric == 'heat'`` the available arguments are `p` (float,
default: ``2.``), `sigma` (float, default: ``0.1``) and `n_bins`
(int, default: ``100``).
- If ``metric == 'persistence_image'`` the available arguments are `p`
(float, default: ``2.``), `sigma` (float, default: ``1.``),
`n_bins` (int, default: ``100``) and `weight_function`
(callable or None, default: ``None``).
(float, default: ``2.``), `sigma` (float, default: ``0.1``), `n_bins`
(int, default: ``100``) and `weight_function` (callable or None,
default: ``None``).
order : float or None, optional, default: ``2.``
If ``None``, :meth:`transform` returns for each pair of diagrams a
Expand All @@ -98,9 +102,9 @@ class PairwiseDistance(BaseEstimator, TransformerMixin):
----------
effective_metric_params_ : dict
Dictionary containing all information present in `metric_params` as
well as on any relevant quantities computed in :meth:`fit`.
well as relevant quantities computed in :meth:`fit`.
homology_dimensions_ : list
homology_dimensions_ : tuple
Homology dimensions seen in :meth:`fit`, sorted in ascending order.
See also
Expand Down Expand Up @@ -174,15 +178,23 @@ def fit(self, X, y=None):
validate_params(
self.effective_metric_params_, _AVAILABLE_METRICS[self.metric])

self.homology_dimensions_ = sorted(set(X[0, :, 2]))
# Find the unique homology dimensions in the 3D array X passed to `fit`
# assuming that they can all be found in its zero-th entry
homology_dimensions_fit = np.unique(X[0, :, 2])
self.homology_dimensions_ = \
_homology_dimensions_to_sorted_ints(homology_dimensions_fit)

self.effective_metric_params_['samplings'], \
self.effective_metric_params_['step_sizes'] = \
_bin(X, metric=self.metric, **self.effective_metric_params_)
_bin(X, self.metric, **self.effective_metric_params_)

if self.metric == 'persistence_image':
self.effective_metric_params_['weights'] = \
_calculate_weights(X, **self.effective_metric_params_)
weight_function = self.effective_metric_params_.get(
'weight_function', None
)
weight_function = \
np.ones_like if weight_function is None else weight_function
self.effective_metric_params_['weight_function'] = weight_function

self._X = X
return self
Expand Down
Loading

0 comments on commit 0932d85

Please sign in to comment.