Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535

Merged
merged 23 commits into from
Dec 12, 2020

Conversation

ulupo
Copy link
Collaborator

@ulupo ulupo commented Nov 16, 2020

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description
The ParallelClustering transformer (technically, a metaestimator) is used in mapper pipelines as a second-to-last step, just before Nerve. As things are currently, the outputs of ParallelClustering.fit_transform are in a rather exotic form, i.e. lists of lists of triples, as in e.g.

[
    [(pullback_set_label_0, relative_cluster_label_00, node_elements_00),
     (pullback_set_label_0, relative_cluster_label_01, node_elements_01), ...],
    [(pullback_set_label_1, relative_cluster_label_10, node_elements_10),
     (pullback_set_label_1, relative_cluster_label_11, node_elements_11), ...],
    ...
]

These outputs make the graph construction a little simpler perhaps because they neatly present the final Mapper nodes. However, they seem to be very arcane if one is to use Mapper without the final Nerve step, as is made possible by passing graph_step=False in make_mapper_pipeline.

In this PR I propose that the output of ParallelClustering should be more like the output of any clustering algorithm in scikit-learn. In particular, it should return an array of the same length as the number of samples in the input, with each entry in the array denoting something closely corresponding to a "cluster label". Of course, for Mapper there is more than one cluster per sample in general, so I propose that the output should be a 1D object-type array where each entry is the tuple of cluster identifiers corresponding to that sample. Since this is Mapper, it makes sense to identify a single cluster via a pair (pullback_set_label, relative_cluster label). Hence, the final 1D object array would look like

array([
    ((pullback_set_label_i, relative_cluster_label_ia), (pullback_set_label_j, relative_cluster_label_jb)),
    ((pullback_set_label_k, relative_cluster_label_kc), (pullback_set_label_l, relative_cluster_label_ld)),
    ...
])

This is a major breaking change to ParallelClustering (output), Nerve (input), and make_mapper_pipeline when graph_step=False. Another consequence is that the global node IDs in the final graphs are no longer ordered "lexicographically" with a global node ID being less than or equal to another if and only if the first pullback set label is less than or equal to the second one.

Another breaking change. The fitted single clusterers are no longer stored in the clusterers_ attribute of the fitted ParallelClustering object. clusterers_ was initially designed thinking ahead to a time when a transform method for new data would be available. However, as we are today as close to that target as we were then, I propose we remove this until the need for it and the design for its use becomes clearer.

Checklist

  • I have read the guidelines for contributing.
  • My code follows the code style of this project. I used flake8 to check my Python changes.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed. I used pytest to check this on Python tests.

@ulupo ulupo requested review from lewtun and wreise November 16, 2020 10:53
@ulupo
Copy link
Collaborator Author

ulupo commented Nov 16, 2020

@lewtun @wreise just asking for thoughts on this idea. The documentation is not yet updated and neither are tests, so don't expect the CI to pass yet.


# Graph construction -- edges with weights given by intersection sizes.
# In general, we need all information in `nodes` to narrow down the set
# of combinations to check when `contract_nodes` is True
# of combinations to check, especially when `contract_nodes` is True.
nodes = zip(*zip(*nodes_dict), nodes_dict.values())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit intense.

gtda/mapper/nerve.py Outdated Show resolved Hide resolved
@lewtun
Copy link
Collaborator

lewtun commented Nov 17, 2020

this is a great idea @ulupo ! i have a few tight deadlines right now, but will be able to look at this on the weekend

@wreise
Copy link
Collaborator

wreise commented Nov 18, 2020

It took me a while to remind myself the logic and check what you propose. I think i agree though. As you mention, tests will need to be updated.

Copy link
Collaborator

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very elegant! Have you checked that the numerical outputs in say our Mapper quickstart match with the new proposal? (Or at least visually checked that things look as expected?)

gtda/mapper/nerve.py Outdated Show resolved Hide resolved
gtda/mapper/pipeline.py Outdated Show resolved Hide resolved
@ulupo ulupo marked this pull request as ready for review November 24, 2020 09:32
@ulupo
Copy link
Collaborator Author

ulupo commented Nov 24, 2020

@wreise @lewtun I have now update the docstrings and fixed the tests. @lewtun yes I have checked that outputs are the same, but independent checking should be done too.

One absolutely insane problem I met was joblib related and caused second runs of fit_transform in ParallelClustering with the default backend to be much much slower than the first. It seems to be caused by some obscure reference count problem (because one needs to overwrite the labels_ attribute). This is now patched by https://github.com/ulupo/giotto-tda/blob/439cbc64132a76b5f548977ffd522371a1234576/gtda/mapper/cluster.py#L146 and was also there before, but less noticeable for some reason. It seems to point to a bug in joblib and we should probably open an issue there...

@ulupo
Copy link
Collaborator Author

ulupo commented Dec 4, 2020

@lewtun @wreise I'm thinking that this is ready to merge (we currently have issues with the manylinux CI but they are unrelated to this PR and being solved in parallel). Do you have any objections? I think I addressed all comments in the reviews.

There is another breaking change that I forgot to mention, I have now added it to the PR description as "Another breaking change". Let me know if you agree/disagree.

@ulupo ulupo requested a review from wreise December 7, 2020 09:21
@giotto-ai giotto-ai deleted a comment from azure-pipelines bot Dec 8, 2020
@ulupo ulupo requested a review from lewtun December 8, 2020 21:33
Copy link
Collaborator

@wreise wreise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me, thanks! I agree with the breaking change - let's think about it/respond when a request comes in :)

entry in `X` is a tuple of pairs of the form
``(pullback_set_label, partial_cluster_label)`` where
``partial_cluster_label`` is a cluster label within the pullback
cover set identified by ``pullback_set_label``. Unique such pairs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unique such pairs..." does not read too good to me. What about "unique pairs...", "The unique pairs...", or "Nodes in the output graph are the unique pairs from X"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to "The unique pairs"

cloned_clusterer.fit(X_sub, sample_weight=sample_weight)
else:
cloned_clusterer.fit(X_sub)
kwargs = self._sample_weight_computer(rel_indices, sample_weight)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever!

@ulupo
Copy link
Collaborator Author

ulupo commented Dec 11, 2020

@lewtun let me know if you still want to review or are happy with @wreise's pass.

Copy link
Collaborator

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the new breaking change - LGTM!

gtda/mapper/cluster.py Show resolved Hide resolved
@ulupo ulupo merged commit 1ee697b into giotto-ai:master Dec 12, 2020
@ulupo ulupo deleted the mapper_no_graph_output_refactor branch December 12, 2020 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants