Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove incorrect assumptions in Filtering #436

Merged
merged 9 commits into from
Aug 1, 2020

Conversation

ulupo
Copy link
Collaborator

@ulupo ulupo commented Jul 17, 2020

Reference issues/PRs
Fixes #91. No assumptions are made on the nature of birth/death values beyond the fact that birth >= death.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description
Arrays are no longer sorted by lifetime using the _sort utility function before calling _filter in Filtering's transform. Since _sort was not needed anywhere else, it has been removed completely from the codebase. The fact that no sorting is made before filtering means that the outputs of Filtering are now closer to the inputs in the following sense:

Checklist

  • I have read the guidelines for contributing.
  • My code follows the code style of this project. I used flake8 to check my Python changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed. I used pytest to check this on Python tests.

@gtauzin
Copy link
Collaborator

gtauzin commented Jul 22, 2020

If I am not wrong, the benefit of sorting the array was to be able to free as much memory as possible. The diagrams data structure is already extremely heavy as it is padded and includes the dimensions for each point.

You may want to try to argsort by persistence instead and only keep the points the necessary points so that you a minimally padded array.

Arrays are no longer sorted by lifetime before filtering. Persistence pairs to be filtered out are now replaced by padding points in their original locations.
@ulupo
Copy link
Collaborator Author

ulupo commented Jul 22, 2020

Thanks @gtauzin! Indeed I was hasty and did not think carefully enough about it before pushing. I have now pushed an alternative method which achieves the same reduction in size as before, but makes more minimal calculations than using argsort, resulting in better performance relative to the current implementation : for instance, generating fake input of shape (10000, 2000, 3) where, for each entry, 1000 persistence pairs are in dimension 0 and 1000 are in dimension 1, I observe a 5x speedup when averaging over various values of the cutoff.

UPDATE: The performance figures are incorrect, the savings are there but are much more modest (roughly 40% speedup in the previous example). Profiling the code shows that indeed argsort becomes more and more burdensome as the size of the data grows larger. The reason that the net gains are not more spectacular is that argsort allows to use simple array slices downstream, which I believe is faster than the indexing made necessary by the current approach.

Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.

@giotto-ai giotto-ai deleted a comment from azure-pipelines bot Jul 22, 2020
@ulupo ulupo changed the title Simplify Filtering, remove incorrect assumptions Remove incorrect assumptions in Filtering Jul 23, 2020
Copy link
Collaborator

@gtauzin gtauzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Umbe.

I have made one documentation comment as I think the addition you have made to the diagram input documentation should be propagated through the whole code. However, I have several concerns about the PR.

First, about the bug described in #91, I could not reproduce it and I managed to properly filter a diagrams with negative values.

X = np.array([[[-0.5, -0.2, 0.],  [-3., -1.5, 0.]]])
f = dg.Filtering(epsilon = 0.8)
Xf = f.fit_transform(X)
Xf
array([[[-3. , -1.5,  0. ], [-3. , -3. ,  0. ]]])

Could you please provide an example of the issue?

About the proposed solution, my main concern is that it is particularly complex, and therefore hard to maintain. It seems that it was designed for performance, which is definitely good. I am just wondering if it is not an overkill. How much faster is it than the previous argsort-based solution?

gtda/diagrams/preprocessing.py Show resolved Hide resolved
@ulupo
Copy link
Collaborator Author

ulupo commented Jul 29, 2020

Thanks @gtauzin!

  • Concerning _filter utility function has bad behaviour when death values are negative #91, the incriminated line is saying: "all diagrams in the output array will have as many points as the maximum number of points in any original array whose death is not equal to zero". I believe this line has a strong bias towards homology dimension 0 and interplay with VietorisRipsPersistence. That's because we pretty much know in that case that padding will be made by entries of the form (0., 0., 0.). However, it is clearly inefficient in e.g. homology dimension 1 already. There, one does not expect to have padding by (0., 0., 1) because typically no persistence pair in dimension 1 is born at 0. So the current implementation is quite inefficient in these cases. To illustrate this, take the following example:

    import numpy as np
    from gtda.diagrams import Filtering
    
    X = np.array([[[0.5, 0.5, 1.], [0.5, 1., 1.]]])
    Filtering(epsilon=0.4).fit_transform(X)
    

    clearly one would like to have the output as

    array([[[0.5, 1., 1.]]])
    

    which is the case in the proposed implementation. But the current implementation returns X unchanged because the padding pair is not recognised as such by the code. You also asked about cases in which we introduce more problematic errors: take something like

    import numpy as np
    from gtda.diagrams import Filtering
    
    X = np.array([[[-0.5, 0., 0.], [0., 1., 0.]]])
    Filtering(epsilon=0.4).fit_transform(X)
    

    The current code, incorrectly, returns

    array([[[0., 1., 0.]]])
    

    because the first persistence triple has a death at zero, and hence is erased. The proposed solution returns X unchanged, as it should.

  • I am biased of course, but I am not sure I agree that the solution is substantially more complex (although it is undoubtedly longer). The helper function _multirange has a very simple interpretation and, as I write in its docstrings, it could have been implemented by an easy-to-read one-liner. Concerning the main function (which has grown by 5 lines), let me give a go at adding inline comments to guide the reader through the logic in a transparent way.

  • On performance, I wrote about this in a previous comment on this thread. I observe ~40% speedup compared to the current implementation in "large" datasets. I can make more extensive tests in the small dataset regime if you like.

  • Let me however stress what I now think is actually my main reason for suggesting these changes (beyond the need to fix _filter utility function has bad behaviour when death values are negative #91 and the non-earth-shattering performance improvements). I already articulated these points above but let me quote myself for convenience:

    Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.

@ulupo ulupo requested a review from gtauzin July 30, 2020 08:04
@ulupo
Copy link
Collaborator Author

ulupo commented Jul 30, 2020

@gtauzin I have now added extensive inline comments to guide through the new logic. Let me know what you think!

Copy link
Collaborator

@gtauzin gtauzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great that you added those comments, thanks. LGTM now :)

@ulupo ulupo merged commit 7162b87 into giotto-ai:master Aug 1, 2020
@ulupo ulupo deleted the fix/filtering branch August 1, 2020 21:16
ulupo added a commit that referenced this pull request Oct 9, 2020
* Make path to giotto-tda logo absolute for display on PyPI (#415)

* Fix formatting in CONTRIBUTING.rst (#416)

* Fix typo in Mapper docs for n_sig_figs (#417)

* Adds FlagserPersistence (#339)

* Add pyflagser >= 0.4.0 to requirements, GH pages installation instructions, and README.rst

* Add FlagserPersistence

* Add tests for FlagserPersistence

* Add FlagserPersistence to doc index

* Add See Also entries in simplicial

* Linting in ripser_interface

* Update See also for Cubical

* Fix max_edge_length docs in other simplicial transformers

* Add tests to check max_edge and infinity_values

* Fix low infinity value bug and put dimension padding in _utils

* Add test for low infinity values

* Simplify code in homology/_utils.py

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Umberto <umberto.lupo@gmail.com>
Co-authored-by: Wojciech Reise <reisewojtus@gmail.com>

* Introduce `nightly_release` variable in azure-pipelines.yml (#421)

* Link Giotto-tda logo to raw.githubusercontent.com

* Add FlagserPersistence to the documentation (#425)

* Refactor GraphGeodesicDistance (#422)

* Refactor GraphGeodesicDistance

- Change return type to list for compatibility with homology transformers
- Use scipy's shortest_path instead of sklearn's graph_shortest_path
- Make clearer rules on the role of zero entries, infinity entries, and non-stored values
- Add parameters directed, unweighted, and method
- Support masked arrays
- Rewrite docstring

* Fix See Also in KNeighborsGraph and TransitionGraph

* Prevent using FW algorithms when some edges are zero

- Work around scipy/scipy#12424
- Introduce user warnings when algorithm cannot be chosen automatically or set to FW
- Fix test ground truth
- Improves array conversion in GraphGeodesicDistance.transform

* Add GraphGeodesicDistance unit tests on sparse/masked array input and more dtypes

* Fix Azure CI for nightly releases (#429)

* Add *.pyo and *.pyd to .gitignore (#430)

Extend list of ignored extensions to other Python bytecode-type files

* Modify check_point_clouds to allow for sparse input (#424)

* Modify check_point_cloud to allow for sparse input

- Allow for sparse input in VietorisRipsPersistence and FlagserPersistence when metric is precomputed.
- Fix docstrings to reflect the changes.
- Fix some typos and wording issues.

* Make docstrings more consistent across simplicial transformers

* Fix reference for FlagserPersistence

* Remove warning for different point cloud embedding dimensions

* Eliminate tests for different embedding dimension warnings

* Convert list input to 3D ndarray when possible

* Update my email address

* Accept sparse inputs even in non-precomputed case for VietorisRips and SparseRips

* Cover sparse cases and more metrics in simplicial unit tests

Co-authored-by: Umberto <u.lupo@l2f.ch>

* Update collaborator email addresses (#435)

* Update @lewtun's and @ulupo's emails in CODE_AUTHORS.rst

* Update @lewtun's and @ulupo's emails in GOVERNANCE.rst

* Refactor _subdiagrams to take all homology dimensions into account (#439)

* Fix mmap settings used by joblib.Parallel (#428)

* Fix mmap settings used by joblib.Parallel in HeatKernel and PersistenceImage

* Add tests

* Slice X first inside parallel calls to _subdiagrams

* Minor simplifications in plot_diagram

* Add @NickSale to CODE_AUTHORS.rst

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix axis ranges in `plot_diagram` (#437)

* Remove **input_layout kwargs, refactor layouts in plot_diagram

- Fix #409
- Fix calculation of the maximum filtration parameter which incorrectly included the homology dimensions
- Simplify code

* Transpose the output of PairwiseDistance to match scikit-learn convention (#420)

* Transpose output shape in PairwiseDistance.transform

* Update tests

* Simplify _parallel_pairwise

Co-authored-by: Umberto Lupo <46537483+ulupo@users.noreply.github.com>

* Replace pip install with python -m pip install in docs and Readme (#440)

* Remove incorrect assumptions in Filtering (#436)

* Remove `_sort` and refactor `filter` in gtda/diagrams/_utils.py

Arrays are no longer sorted by lifetime before filtering.

* Fix test for Filtering

* Improve documentation, begin addressing #233

* Small improvement to `_subdiagrams`

* Return figures instead of showing them (#441)

* Refactor of plotting API

- Make all functions in gtda/plotting return figures (or tuples of figures for betti_surfaces) instead of showing them
- `plot` class methods also return figures instead of showing them
- `transform_plot` and `fit_transform_plot` still show figures and only return transformed data
- Add `plot_params` kwarg throughout to allow user customisability of output figures (subtlety: one of the key can be either "trace" when the output figure only has one trace, or "traces" when it has several)

* Suppress user warnings on graph geodesic distance algorithms in tests

* Resolve overflow warnings in mapper filter tests

* Resolve numpy DeprecationWarning in test_validation

* [DOCS] Clarify expected diagram properties to fix #233 (#443)

* Allow plot_heatmap to take boolean arguments (#444)

* Add pullback cover set labels and partial cluster labels to Mapper hovertext, improve docs (#445)

* Add pullback set ID to mapper hovertext

* Add partial cluster label to Mapper hovertext

* Improve docs for plot_static_mapper_graph and plot_interactive_mapper_graph

* Fix tests after changes

* Slight wording improvement

* Further wording clarification

* Make wording on edges even clearer

* Refactor igraph.Graph output of Nerve, modify `fit` behaviour of Nerve, add `store_edge_elements` kwarg to Nerve and make_mapper_pipeline, add Nerve and ParallelClustering to docs (#447)

* Refactor igraph.Graph output, add Nerve and ParallelClustering to docs, add store_edge_elements kwarg to Nerve and make_mapper_pipeline

- Store node metadata not as a graph-level dictionary, but as vertex attributes accessible by graph.vs[attr_name][node_id] or graph.vs[node_id][attr_name] for attr_name in ["pullback_set_label", "partial_cluster_label", "node_elements"]
- Remove "node_id" from node attributes as it always coincided with the igraph.Graph node number anyway.
- Automatically store sizes of intersections as edge weights, accessible by graph.es["weight"].
- Add "store_intersections" kwarg to Nerve and make_mapper_pipeline to allow storing indices of node intersections as edge attributes, accessible via graph.es["edge_elements"].
- Simplify logic of Nerve.fit_transform code
- Change the attributes stored by Nerve.fit. Now the entire graph is stored as graph_ instead.
- Improve documentation of make_mapper_pipeline
- Expose ParallelClustering and Nerve in __init__ and docs.
- Adapt tests, mapper quickstart notebook, and mapper plotting functions.

* Add two tests for the behaviour of store_edge_elements and min_intersection

* Remove check for shape of `layout` in _calculate_graph_data

`layout` can only be a string or a callable, not an ndarray

* Improve test coverage of mapper visualization modules

* Create tests for plot_betti_surfaces and plot_betti_curves

* Add plotly_params to remaining plot methods in diagrams/representations, missed out in #441 

* Fix some linting and docstrings

* Minor improvements

* Avoid shadowing range function in plot_diagram

* Fix minimum birth bug in `plot_diagram` (#449)

* Fix bug introduced in 4bc90b2

np.max should have been np.min in plot_diagram for minimum birth and death calculation

* Clean/simplify plot_diagram further to be more ready for extended persistence and better behave under plotly HTML autoscaling

* Make `store_edge_elements` work with MapperPipeline.get_mapper_params and MapperPipeline.set_params, missed out in 4bc90b2

* Remove displayed SegmentLocal from gif hyperlink in classifying_shapes example notebook (#448)

* Remove displayed SegmentLocal from classifying_shapes example notebook

* Fix doc typos

* Pybind11 quick fix (#451)

* Fix pybind11 broken master

pybind11 checkout version v2.5.0, some issues are observed on master

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>

* Fix bug in use of homology_dimension_ix with samplings_, rename homology_dimension_ix (#452)

* Rename homology_dimension_ix to homology_dimension_idx, fix bug

* Add normalization of persistence entropy (#450)

* Add normalization of persistence entropy via `normalize` kwarg in gtda.diagrams.PersistenceEntropy

* Use entropy from scipy.stats in gtda.diagrams.PersistenceEntropy, gtda.time_series.PermutationEntropy and gtda.mapper.Entropy

* Rename _entropy hidden method in gtda.diagrams.PersistenceEntropy

* Add tests for normalize=True

* Add `fill_nan_value` kwarg to gtda.diagrams.PersistenceEntropy

- See #450 (comment)
- Adapt pipeline tests to use this kwarg

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix y-axis in HeatKernel plots, add default titles to plot methods (#453)

* Fix y-axis in HeatKernel.plot

* Add titles to figures generated via plot_heatmap in plot class methods

* PEP8 E133 improvements

* Refactor gtda/diagrams (#454)

* First attempt at fix of #438

* Fix y-axis in `PersistenceImage` plots, extending #453

* Represent multiplicity of persistence pairs in hovertext in `plot_diagram`

* Change reflect mode of `gaussian_filter` to "constant" from "reflect"

Affects `HeatKernel` and `PersistenceImage`

* Fix `PersistenceLandscape` plot method

- Only the figure corresponding to the first seen homology dimension was returned
- Output is now a figure with subplots, a main title, and one subtitle per plot (homology dimension)

* Improve tests for plot methods in gtda.diagrams.representations

Cover use of `plotly_params`

* Minor docstring linting

* Miscellaneous docstring improvements in gtda/diagrams

* Fix validation dictionary for `metric_params` in the case of `PersistenceImage`

* Change default value of `order` in Amplitude, from 2. to None (vector features)

* Change meaning of default None for `weight_function` in `PersistenceImage`

- ``None`` corresponding the identity means that there really is a non-trivial weighting in that case. Semantically, this does not seem correct ("None" should mean no weighting at all)

* Improve code style and clarity in plot methods in gtda.diagrams.representations

* Refactor gtda/diagrams/_metrics.py to fix several bugs

- Change computation of heat/persistence image distances and amplitudes to yield the continuum limit when `n_bins` tends to infinity.
- Make `sigma` in persistence-image-- and heat-kernel--based representations/distances/amplitudes measured in the same units as the filtration parameter (not in pixels), thus decoupling its effect from the effect of `n_bins`. Also change the default value in PairwiseDistance, Amplitude, HeatKernel and PersistenceImage from 1. to 0.1.
- Remove trivial points from diagrams before creating a sampled image in `heats` and `persistence_images`. This ensures in particular that the trivial points really give no contribution regardless of the weight function used for persistence images.
- Finish fixing #438 and similar issues with PairwiseDistance when the metric is persistence-image--based.
- Ensure `silhouettes` does not create NaNs when a subdiagram is trivial.
- Change default value of `power` in `silhouette_amplitudes`
 and `silhouette_distances` to 1., to agree with Amplitude and PairwiseDistance docs.
- Fix use of np.array_equal in `persistence_image_distances` and `heat_distances` -- so far the boolean check for equal inputs always failed due to the in-place modifications of `diagrams_1`.
- No longer store weights in `effective_metric_params_` attribute of PairwiseDistance and Amplitude when the metric is persistence-image--based.
- Remove _heat from gtda.diagrams._metrics.
- Remove identity from gtda.diagrams._utils and make default `weight_function` kwargs equal to np.ones_like everywhere to agree with default in `PersistenceImage`.
- Other style improvements (variable names, linting)

* Fix trace name when homology dimension is np.inf in `BettiCurve` and `Silhouette`

* Adapt `test_all_pts_the_same` to new behaviour of `heats_` in gtda.diagrams._metrics

* Improve test coverage of `Amplitude` and `PairwiseDistance`

- Make sure silhouettes and persistence images are covered throughout
- Cover `order` parameter throughout

* Add test of zero `weight_function` for `PersistenceImage`

* Make behaviour of `Scaler.fit` when the metric is persistence image the same as `Amplitude`

Accordingly add more combinations of metrics and metric_params in tests for Scaler

* Delete never-used `_matrix_wrapper` and `_arrays_wrapper` functions

* Remove `_pad` from gtda.diagrams._utils as it is never used

* Make `copy=True` in calls to check_diagrams in Scaler.transform and Scaler.inverse_transform

* Make `homology_dimensions_` attributes tuples instead of lists, with integers when possible

* Improve code style

* Hard-code zero array outputs by `heats` and `persistence_images` when step sizes are zero

* Add `homology_dimensions` kwarg to `_bin`

Achieves beautification of self._samplings and self.samplings_ (ints shown instead of floats) in several transformers (and saves some computation)

* Adapt choices of `min_values`, `max_values` and `sigma` in hypothesis-based tests

New meaning of sigma led to overflow issues in existing tests.

* Make all homology dimensions equal in `test_hk_big_sigma`

Also extend this test to `PersistenceImage`, and rename it accordingly

* Cover use of `plotly_params` kwarg in diagram preprocessing classes plot methods

* Extract some common logic from plot methods in gtda.diagrams.representations

* Silence expected warnings from image transformers in test_common

* Implement @wreise's suggestion to abstract away sorting and integer conversion of fit hom dims

* Add tests for `Amplitude` and `PairwiseDistance` to check that a zero weight function yields identically zero amplitudes/distances.

* Refactor `_subdiagrams` to throw informative errors on unfulfilled input properties

* Bump package dependencies to latest available in conda (#457)

* Bump dependencies to latest available in conda (or PyPI if conda unavailable)

* [CI] Bump pip version in .azure-ci/docker_scripts.sh

* Plot infinite deaths in plot_diagrams (#461)

* Plot infinite deaths in plot_diagrams

- Dynamically establish a certain value greater than the maximum observed death as the y-coordinate of infinity
- Plot points with infinite death as having this death value
- Plot a horizontal dashed line to show the position of infinity

* Add first draft of FlagserPersistence notebook

* Fix errors with use of homology_dimensions kwarg in FlagserPersistence

- Implementation was incorrect in cases where minimum in homology_dimensions is greater than 1
- Implementation would crash whenever `flagser_weighted` does not return a list of size self._max_homology_dimension + 1 - self._min_homology_dimension, as can happen with small graphs

* Fix #92

Document removal of one infinite bar in simplicial transformers

* Polish up classifying_shapes notebook

* Remove null cell from vietoris_rips_quickstart

* Upload clique complex filtration images

* Crop image

* Crop and colour images

* New versions of clique complex

* New notebook version

* Test a small version of a png

* Update notebook and pngs

* Remove plotly code

* Add SVGs for directed flag complex illustration

* Correct one SVG

* Add color fill

* Fix color fill

* Reorient one triangle

* Finish up notebook

* Remove some PNGs

* Delete legacy image

* More "correct" name for a PNG

* Reflect name change on notebook

* LaTeX formatting

* Add reference to ripser.py tutorials

* Fix handling of COO matrices by ripser mirroring scikit-tda/ripser.py#104 (#465)

* Update gudhi bindings to latest changes on August 2020 (#468)

* Pull latest changes on Gudhi-devel:giotto

* Add Eigen as a submodule

Eigen is now required for latest Gudhi version

* Add Eigen for gudhi modules that require it

* Update get_filtration and get_skeleton with the new Gudhi API

* Update ccache in CI

* Fix type issue for get_persistence in simplex_tree

* Fix persistence in periodic cubical complex

* Fix simplex tree

* Fix cubical complex

* Beautify setup.py

* Move towards fixing #307 in azure_pipelines.yml (still not addressing docker_scripts.sh)

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Change DISTNAME back form "giotto-tda" to 'giotto-tda' in setup.py (#470)

Avoids uploading nightly distributions accidentally to PyPI from CI as happened on 28.08.2020

* Change title as suggested by @lewtun, fix GitHub capitalization in all notebooks

* Unify postprocessing of diagrams between simplicial and cubical, remove one H0 inf bar in cubical

* Fix imports, reinstate reshape

* Fix slicing and tests

* Document removal of infinite bar in 0D

* Improve readability

* Place _replace_infinity_values and _pad_diagram in scope of _postprocess_diagrams

* Forfeit use of n_jobs in _postprocess_diagrams

Favour use of purely in-place operations by a logic replacing _pad_diagram

* Linting

* Remove unused joblib imports from homology/_utils.py

* Ignore SparseRipsPersistence in code cells as suggested by @lewtun

* LaTeXify indices as suggested by @lewtun

* Fix typos

* Fix typo

* Convert pycairo note from inline comments to markdown text

* Implement some of @lewtun's suggestions for classifying shapes notebook

* Migrate general persistence diagram discussion to vr notebook, create "General API" section

In partial fulfillment of @lewtun's review suggestions

* Fix VR notebook and add explanation of homology_dimensions parameter

* Make PyPI nightly uploads more robust to changes in quote types (#471)

* Make PyPI nightly uploads more robust to changes in quote types

Look for both single and double quotes when modifying DISTNAME in setup.py

* Reinstate double quotes in setup.py

* Change sed -i to sed -i.bak in macOS Azure jobs (#472)

* Adds WeakAlphaPersistence (#464)

* Add WeakAlphaPersistence (gtda.homology.simplicial)

* Change relative order of simplicial transformers

* Add tests of correctness of VietorisRipsPersistence and WeakAlphaPersistence (circle PH)

* Adapt to sklearn 0.23 API in test_common

Silences warnings

* Improve docstrings in gtda.homogy (type of X in fit and transform, and type of Xt in plot)

* Avoid dtype search and always make a float array

* Adapt WeakAlphaPersistence to changes

* Some style consistency

* Fix small bug with plot_diagrams

Trivial points were not excluded from computation of plot ranges

* Bump pyflagser version in requirements

* Improve docs for VietorisRipsPersistence by explaining how precomputed input is handled

* Improve writing following @lewtun's comments, fix hyperlinks in preparation for merge

* Add a regression test for FlagserPersistence

* Introduce a `reduced_homology` kwarg to toggle or not removal of infinite H0 bar

* Change default value of nan_fill_value in PersistenceEntropy to -1 (#474)

* Change default value of nan_fill_value in PersistenceEntropy to -1.

* Remove false URL in README (#476)

* Fix read-only error analogous to #427 in Amplitude and PairwiseDistance (#481)

* Improve naming of parallelism tests for PersistenceImage and HeatKernel

* Extend #428 to Amplitude and PairwiseDistance, add regression test, improve edge case handling in _metrics.py

* Cover n_jobs > 1 in test_features_representations.py

* Fix formatting issues in notebooks (#482)

* Add persistent homology of graphs notebook to docs

* Add Vietoris-Rips gif to examples/images and use it instead of Medium links notebooks

* Change single backticks for monospace to double backticks, recommended for Markdown

* Fix a parenthesis in classifying_shapes.ipynb

* Add bindings for collapser, refactor ripser_interface (#469)

* Add bindings for collapser

* Add collapse_edges kwarg to ripser to enable edge collapser

* Add tests for the collapser bindings

* Add tests for ripser with edge collapse

* Add regression test for #465

* Refactor ripser_interface to simplify code, improve lexsort logic, and improve detection of sparse matrix shape

* Disable computation of cocycles by default

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Enable edge collapse in VietorisRipsPersistence (#483)

* Enable edge collapse in VietorisRipsPersistence

Also cover coefficient different from 2 in tests of collapser

* Cover collapse_edges kwarg in simple VietorisRipsPersistence tests

* Rename TakensEmbedding to SingleTakensEmbedding, introduce new TakensEmbeding transformer on collections and a function for time-delay embedding parameter optimization, replace "width" parameter with "size" in Labeller and SlidingWindow, fix bug with Labeller.resampler when n_steps_future >= size - 1 (#460)

* Extract time delay embedding logic into a _time_delay_embedding function in the new file gtda.time_series._utils

* Rename TakensEmbedding to SingleTakensEmbedding

* Extract and expose code for finding optimal Takens embedding parameters, as a new takens_embedding_optimal_parameters function

* Simplify logic of resample method of SingleTakensEmbedding and SlidingWindow

* Improve tests for Stationarizer

* Fix bug in Labeller.resample when n_steps_future >= width, add test to cover this scenario

* Allow zero width (meaning 1 point) in SlidingWindow and Labeller

* Add simple test for invalid width in Labeller

* Faster implementation of _slice_windows in SlidingWindow and of SlidingWindow.transform

* Make SlidingWindow.slice_windows public, add test

* Rename "width" parameter of Labeller to "size", addressing #94

* Rename "width" parameter of SlidingWindow to "size", addressing #94 

* Create new TakensEmbedding function to deal with collections of univariate or multivariate time series

* Support list of time series input in gtda.time_series.embedding by means of a new check_collection function in gtda.utils.validation

* Fix typos in glossary.tex

* Remove plot method from SlidingWindow and place it in new TakensEmbedding

* Refactor Lorenz attractor notebook following API changes involving TakensEmbedding and SlidingWindow

* Use check_collection in CubicalPersistence

* TransformerResamplerMixin notebook, make Labeller.transform return 1D arrays (#475)

* Add notebook on TransformerResamplerMixins and time series forecasting with sliding windows + topology

* Change return shape or Labeller.transform to 1D

* Fix Labeller docstring following change in #475

* Modify `PlotterMixin.transform_plot` to give a dictionary to the call to `plot` (#484)

* Fix sample numbers in automatically-generated plot titles when using `transform_plot`

* Add test for Takens to use fit_transform_plot

* Remove papermill version from azure-pipelines

Co-authored-by: wreise <wojciech.reise@epfl.ch>

* removed check_collection from docs

* Fix references in takens and simplicial

* Change todo in weak alpha filtration reference to the section name, remove empty line from release

* Add `mode` kwarg to KNeighborsGraph, refactor input-output formats in gtda.graphs, turn outputs of KNeighborsGraph and TransitionGraph into asymmetric adjacency matrices (#478)

* Add mode kwarg to KNeighborsGraph, allow for list input in gtda.graphs, return lists instead of ragged arrays

* Turn output of TransitionGraph into a directed 0-1 adjacency matrix

* Docstring improvements in gtda.graphs

* Add check_collection again

* Reference ripser [1]

* Move the references before the full stops

* Replace missing glossry entries with TODOs, following Umberto's comment

* Docstring improvements following @gtauzin's review

* Tiny wording fix

* Improve docstring following @gtauzin's comment

* Add missing references

* Flag weak_alpha_complex as missing entry

* Reference in the documentation

* Add a target for testing notebooks in the docs

* Make html does not fail on missing versions

* Capitalize "Euclidean" in glossary.tex

* Remove spaces after references, check that publication names are in italic

* Add check_collection to gtda/utils/__init__.py (#491)

* Add check_collection to gtda/utils/__init__.py

* Improve paths in some tests

* Add option to contract Mapper vertices when pairs are in a subset relation, change layout in interactive mode (#456)

* Add contract_nodes kwarg to Nerve

- When True, igraph.Graph's contract_vertices is employed together with the limit flow by a mapping with fixed points to execute the contraction
- Add a test for this option

* Filter out nodes from same pullback set when checking intersections, to increase performance

* Ensure that setting min_intersection < 1 does not lead to fully connected graphs

* Remove unused variable name in ParallelClustering

* Make nerve parameters interactive in plot_interactive_mapper_graph, refactor get_widgets_per_params, change layouts

* Improve method_to_transform code and docstring

* Small improvements to transformer_from_callable_on_rows

* Linting throughout gtda.mapper

* Remove effective_metric_params_ from Eccentricity

* Remove unused import and sort the remaining ones

* Provide an explicit name for the substituition

* Copy images from examples to notebooks

* Remove the use of imgonverter, change the logo back to svg and add supported image types list

* Name for substitution did not work

* Wording improvements in persistent_homology_graphs.ipynb

* Typo fix in notebook

* Fix variable name error in GraphGeodesicDistance.transform

* Extend Padder and Inverter to greyscale images (#489)

* Extend Padder to greyscale images

* Extend Inverter to greyscale images

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>

Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Add notebook for topological time series classification to examples (#458)

* Add notebook for topological time series classification example

* Add Takens embedding GIF to images directory

* Add gravitational wave dataset

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix links to time series classification notebook by @lewtun

* Move datasets.py and gravitational-wave-signals.npy to new data subfolder, and rename them

* Fix errors in check_point_clouds docstrings

* Try easier paths in See also

Test for @wreise

* Create captions across all notebooks

* Small wording improvement in persistent_homology_graphs.ipynb

* Change the path (sub)package in the index

* Implement double backticks consistently

* Simplify some captions

* Slightly reduce size of plots by plot_point_cloud

* Fix some errors in make_mapper_pipeline docs

* Fix typo in Nerve docs

* Change citation style for consistency

* Remove new lines in array printouts in docs

* Make citation style more consistent throughout

* Reorder notebooks, modularize Makefile

* Reformat the references

* More link and backtick fixes in mapper_quickstart

* More RST fixes in notebooks

* Fix linting

* Fix an error in WeakAlphaPersistence Notes

* Fix URLs for GUDHI

* Fix TakensEmbedding docstrings

* Linting

* Try changing some SVG attributes for better display

* Further SVG improvements

* Update CODE_AUTHORS.rst (#497)

* More Mapper tests, small change to `max_fraction` in FirstSimpleGap and FirstHistogramGap (#412)

* Add first tests for plot_interactive_mapper_graph

* Change deprecated 'overflow_y' to 'overflow' property in_logging, and remove unnecessary warning catching

* Avoid name clashes with Python built-ins throughout mapper/tests

* Slightly change meaning of `max_fraction` in FirstSimpleGap and FirstHistogramGap

Make the default 1. instead of None, and give it a simpler interpretation: (the floor of) max_fraction * n_samples is the maximum number of clusters the algorithm can return

* Implement a looser criterion for the hk_pi_big_sigma test

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix some warnings seen in tests (#498)

- `continuous_updates` is deprecated from ToggleButton
- `np.stack` should not take generators
- should explicitly pass dtype=object for ragged arrays

* Add the giotto-tda icon in the browser tab (#500) (favicon)

* Add NumberOfPoints (#496)

* Add NumberOfPoints

* Make See alsos more consistent throughout diagrams.features

Sined-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix and improve validate_params (#502)

* Fix validate_params bug and improve behaviour when both numeric and list-like types are allowed

* Add a test

* Fix mistake in validate_params docs

* Pybind11 as a git submodule (#459)

* Add MODULE keyword in CMakeLists.txt to explicitly show expectation

* Update collapser to latest standard in the CMake file

* Add pybind11 as a submodule

* Remove downloading pybind11 from setup.py

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Fix metric_params documentation in KNeighborsGraph

* Improve ripser bindings performance (#501)

* Update ripser & collapser bindings to allow pass by reference with numpy

* Update making the distance matrix triangular more memory friendly

* Remove unnecessary dict inherited from ripser.py cython

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>

* Add MNIST image classification example notebook (#477)

* Add MNIST image classification/full blown ML example notebook

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add ComplexPolynomial (#479)

* Add ComplexPolynomial

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Fix precomputed behaviour in KNeighborsGraph (#506)

* Fix precomputed behaviour in KNeighborsGraph

- Avoid warnings when metric is passed as 'precomputed'
- Fix errors in docstrings and improve wording
- Improve tests

* Fix #499 by covering case of list input in CubicalPersistence.fit (#503)

* Improve memory use in ripser_interface (#507)

Implement suggestions in #501 (comment)
- Use scipy's squareform function for fast extraction of the upper diagonal part of dm

* Add DensityFiltration (#473)

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add metaestimators subpackage with CollectionTransformer meta-estimator (#495)

* Add gtda/metaestimators with CollectionTransformer meta-estimator 

* Fix docstring for make_mapper_pipeline

* Cchange parallel_backend_prefer default to None in ParallelClustering and make_mapper_pipeline

* Cross-reference two time series notebooks with See also

* Improve time series classification notebook by making use of CollectionTransformer

* Temp fix for macOS CI failures by preventing automatic brew cleanup (#513)

Also make mapper tests more lenient to reduce number of flaky tests

Co-authored-by: ulupo <umberto.lupo@googlemail.com>

* Add a curves subpackage and a StandardFeatures transformer to extract features from them (#480)

* Add curves subpackage with StandardFeatures

* Reshape output of PersistenceLandscape so that it's a multi channel curve

* Add curves to main init and to rst docs

* Add Feature extraction subtitle in curves.rst

* Split simplicial homology into undirected and directed subsections

* Fix typo in validate_params

* Raise deadlines for some mapper tests

* Remove notebook tests even in macOS jobs

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>
Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>

* Add curves.Derivative (#492)

* Add curves.Derivative

* Fix curves.StandardFeatures docs

Signed-off-by: Guillaume Tauzin <guillaume.tauzin@inait.ai>

Co-authored-by: Umberto Lupo <umberto.lupo@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>

* Raise test coverage and make ParallelClustering clusterer parameter required (#508)

* Improve test coverage of mapper visualization tools

* Make pandas part of test requirements in setup.py

* Add pandas installation in Linux Azure jobs

* Avoid notebooks checks even in macOS unless notebooks_checks is true

* Make clusterer a required parameter in ParallelClustering, add ParallelClustering tests

* Change kind of error for bad transformers in CollectionTransformer

* Improve check_diagrams and its tests

* Use validate_params in plot_point_cloud

* Cover plotting of diagrams with infinite deaths in simplicial tests

* Prepare for v0.3.0 release (#516)

* Bump version to 0.3.0

* Add release notes

* Minor corrections in notebooks

Co-authored-by: wreise <wojciech.reise@epfl.ch>

Co-authored-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>
Co-authored-by: Wojciech Reise <reisewojtus@gmail.com>
Co-authored-by: Wojciech Reise <32167173+wreise@users.noreply.github.com>
Co-authored-by: Umberto <u.lupo@l2f.ch>
Co-authored-by: Nicholas Sale <nicholas.j.sale@gmail.com>
Co-authored-by: MonkeyBreaker <julian.burellaperez@heig-vd.ch>
Co-authored-by: REDS institute <reds-heig@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: wreise <wojciech.reise@epfl.ch>
Co-authored-by: ulupo <umberto.lupo@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_filter utility function has bad behaviour when death values are negative
2 participants