Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

ulupo · 2020-11-16T10:53:01Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Description
The ParallelClustering transformer (technically, a metaestimator) is used in mapper pipelines as a second-to-last step, just before Nerve. As things are currently, the outputs of ParallelClustering.fit_transform are in a rather exotic form, i.e. lists of lists of triples, as in e.g.

[
    [(pullback_set_label_0, relative_cluster_label_00, node_elements_00),
     (pullback_set_label_0, relative_cluster_label_01, node_elements_01), ...],
    [(pullback_set_label_1, relative_cluster_label_10, node_elements_10),
     (pullback_set_label_1, relative_cluster_label_11, node_elements_11), ...],
    ...
]

These outputs make the graph construction a little simpler perhaps because they neatly present the final Mapper nodes. However, they seem to be very arcane if one is to use Mapper without the final Nerve step, as is made possible by passing graph_step=False in make_mapper_pipeline.

In this PR I propose that the output of ParallelClustering should be more like the output of any clustering algorithm in scikit-learn. In particular, it should return an array of the same length as the number of samples in the input, with each entry in the array denoting something closely corresponding to a "cluster label". Of course, for Mapper there is more than one cluster per sample in general, so I propose that the output should be a 1D object-type array where each entry is the tuple of cluster identifiers corresponding to that sample. Since this is Mapper, it makes sense to identify a single cluster via a pair (pullback_set_label, relative_cluster label). Hence, the final 1D object array would look like

array([
    ((pullback_set_label_i, relative_cluster_label_ia), (pullback_set_label_j, relative_cluster_label_jb)),
    ((pullback_set_label_k, relative_cluster_label_kc), (pullback_set_label_l, relative_cluster_label_ld)),
    ...
])

This is a major breaking change to ParallelClustering (output), Nerve (input), and make_mapper_pipeline when graph_step=False. Another consequence is that the global node IDs in the final graphs are no longer ordered "lexicographically" with a global node ID being less than or equal to another if and only if the first pullback set label is less than or equal to the second one.

Another breaking change. The fitted single clusterers are no longer stored in the clusterers_ attribute of the fitted ParallelClustering object. clusterers_ was initially designed thinking ahead to a time when a transform method for new data would be available. However, as we are today as close to that target as we were then, I propose we remove this until the need for it and the design for its use becomes clearer.

Checklist

I have read the guidelines for contributing.
My code follows the code style of this project. I used flake8 to check my Python changes.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed. I used pytest to check this on Python tests.

…e data

ulupo · 2020-11-16T10:53:57Z

@lewtun @wreise just asking for thoughts on this idea. The documentation is not yet updated and neither are tests, so don't expect the CI to pass yet.

ulupo · 2020-11-16T10:58:53Z

gtda/mapper/nerve.py


        # Graph construction -- edges with weights given by intersection sizes.
        # In general, we need all information in `nodes` to narrow down the set
-        # of combinations to check when `contract_nodes` is True
+        # of combinations to check, especially when `contract_nodes` is True.
+        nodes = zip(*zip(*nodes_dict), nodes_dict.values())


This is a bit intense.

gtda/mapper/nerve.py

lewtun · 2020-11-17T22:22:42Z

this is a great idea @ulupo ! i have a few tight deadlines right now, but will be able to look at this on the weekend

wreise · 2020-11-18T13:37:37Z

It took me a while to remind myself the logic and check what you propose. I think i agree though. As you mention, tests will need to be updated.

lewtun

This looks very elegant! Have you checked that the numerical outputs in say our Mapper quickstart match with the new proposal? (Or at least visually checked that things look as expected?)

gtda/mapper/nerve.py

gtda/mapper/pipeline.py

ulupo · 2020-11-24T09:36:27Z

@wreise @lewtun I have now update the docstrings and fixed the tests. @lewtun yes I have checked that outputs are the same, but independent checking should be done too.

One absolutely insane problem I met was joblib related and caused second runs of fit_transform in ParallelClustering with the default backend to be much much slower than the first. It seems to be caused by some obscure reference count problem (because one needs to overwrite the labels_ attribute). This is now patched by https://github.com/ulupo/giotto-tda/blob/439cbc64132a76b5f548977ffd522371a1234576/gtda/mapper/cluster.py#L146 and was also there before, but less noticeable for some reason. It seems to point to a bug in joblib and we should probably open an issue there...

ulupo · 2020-12-04T20:47:18Z

@lewtun @wreise I'm thinking that this is ready to merge (we currently have issues with the manylinux CI but they are unrelated to this PR and being solved in parallel). Do you have any objections? I think I addressed all comments in the reviews.

There is another breaking change that I forgot to mention, I have now added it to the PR description as "Another breaking change". Let me know if you agree/disagree.

wreise

Looks good for me, thanks! I agree with the breaking change - let's think about it/respond when a request comes in :)

wreise · 2020-12-11T06:26:52Z

gtda/mapper/nerve.py

+            entry in `X` is a tuple of pairs of the form
+            ``(pullback_set_label, partial_cluster_label)`` where
+            ``partial_cluster_label`` is a cluster label within the pullback
+            cover set identified by ``pullback_set_label``. Unique such pairs


"Unique such pairs..." does not read too good to me. What about "unique pairs...", "The unique pairs...", or "Nodes in the output graph are the unique pairs from X"

I changed it to "The unique pairs"

wreise · 2020-12-11T10:33:36Z

gtda/mapper/cluster.py

-            cloned_clusterer.fit(X_sub, sample_weight=sample_weight)
-        else:
-            cloned_clusterer.fit(X_sub)
+        kwargs = self._sample_weight_computer(rel_indices, sample_weight)


This is very clever!

ulupo · 2020-12-11T11:05:47Z

@lewtun let me know if you still want to review or are happy with @wreise's pass.

lewtun

I agree with the new breaking change - LGTM!

gtda/mapper/cluster.py

ulupo added 4 commits November 5, 2020 09:44

Refactor ParallelClustering to return array of shape the length of th…

c26cdac

…e data

Avoid transpositions in ParallelClustering

715f9cd

Refactor Nerve following change in ParallelClustering output

407947d

Merge branch 'master' into mapper_no_graph_output_refactor

1f384bf

ulupo requested review from lewtun and wreise November 16, 2020 10:53

ulupo commented Nov 16, 2020

View reviewed changes

wreise reviewed Nov 17, 2020

View reviewed changes

gtda/mapper/nerve.py Outdated Show resolved Hide resolved

ulupo added 2 commits November 21, 2020 14:40

Merge branch 'master' into mapper_no_graph_output_refactor

d122968

Simplify structure

a378999

lewtun approved these changes Nov 22, 2020

View reviewed changes

gtda/mapper/nerve.py Outdated Show resolved Hide resolved

gtda/mapper/pipeline.py Outdated Show resolved Hide resolved

ulupo added 7 commits November 22, 2020 12:57

Simplify further

03a9f5f

Solve performance problem with joblib and refitting

8111ce0

Improve variable names following @lewtun's review

fb6918d

Fix linting

5f63025

Add explicit n_nodes arg to be able to pass node as zip object

a163bf3

Fix precomputed behaviour and fix/enhance ParallelClustering tests

6c5d828

Update ParallelClustering and Nerve docs to new API

439cbc6

ulupo marked this pull request as ready for review November 24, 2020 09:32

ulupo added 3 commits December 4, 2020 21:09

Linting in __init__

9749dab

Fix docstring typos in Nerve

72ed8df

Improve variable name following @lewtun's comment

d878722

ulupo added 2 commits December 5, 2020 22:47

Merge branch 'master' into mapper_no_graph_output_refactor

5215c4b

Merge branch 'linting' into mapper_no_graph_output_refactor

2048a16

ulupo requested a review from wreise December 7, 2020 09:21

Merge branch 'master' into mapper_no_graph_output_refactor

d35775d

giotto-ai deleted a comment from azure-pipelines bot Dec 8, 2020

ulupo requested a review from lewtun December 8, 2020 21:33

ulupo added 2 commits December 10, 2020 15:31

Merge branch 'master' into mapper_no_graph_output_refactor

d5ff99b

Fix linting

b84cb81

wreise approved these changes Dec 11, 2020

View reviewed changes

ulupo added 2 commits December 11, 2020 12:01

Fix wording following @wreise's review

24e8ead

Fix "then" -> "the"

43eb7d7

lewtun approved these changes Dec 12, 2020

View reviewed changes

gtda/mapper/cluster.py Show resolved Hide resolved

ulupo merged commit 1ee697b into giotto-ai:master Dec 12, 2020

ulupo deleted the mapper_no_graph_output_refactor branch December 12, 2020 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

ulupo commented Nov 16, 2020 •

edited

Loading

ulupo commented Nov 16, 2020

ulupo Nov 16, 2020

lewtun commented Nov 17, 2020

wreise commented Nov 18, 2020

lewtun left a comment

ulupo commented Nov 24, 2020 •

edited

Loading

ulupo commented Dec 4, 2020

wreise left a comment

wreise Dec 11, 2020

ulupo Dec 11, 2020

wreise Dec 11, 2020

ulupo commented Dec 11, 2020

lewtun left a comment

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

Conversation

ulupo commented Nov 16, 2020 • edited Loading

ulupo commented Nov 16, 2020

ulupo Nov 16, 2020

Choose a reason for hiding this comment

lewtun commented Nov 17, 2020

wreise commented Nov 18, 2020

lewtun left a comment

Choose a reason for hiding this comment

ulupo commented Nov 24, 2020 • edited Loading

ulupo commented Dec 4, 2020

wreise left a comment

Choose a reason for hiding this comment

wreise Dec 11, 2020

Choose a reason for hiding this comment

ulupo Dec 11, 2020

Choose a reason for hiding this comment

wreise Dec 11, 2020

Choose a reason for hiding this comment

ulupo commented Dec 11, 2020

lewtun left a comment

Choose a reason for hiding this comment

ulupo commented Nov 16, 2020 •

edited

Loading

ulupo commented Nov 24, 2020 •

edited

Loading