-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535
Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535
Conversation
gtda/mapper/nerve.py
Outdated
|
||
# Graph construction -- edges with weights given by intersection sizes. | ||
# In general, we need all information in `nodes` to narrow down the set | ||
# of combinations to check when `contract_nodes` is True | ||
# of combinations to check, especially when `contract_nodes` is True. | ||
nodes = zip(*zip(*nodes_dict), nodes_dict.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit intense.
this is a great idea @ulupo ! i have a few tight deadlines right now, but will be able to look at this on the weekend |
It took me a while to remind myself the logic and check what you propose. I think i agree though. As you mention, tests will need to be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very elegant! Have you checked that the numerical outputs in say our Mapper quickstart match with the new proposal? (Or at least visually checked that things look as expected?)
@wreise @lewtun I have now update the docstrings and fixed the tests. @lewtun yes I have checked that outputs are the same, but independent checking should be done too. One absolutely insane problem I met was |
@lewtun @wreise I'm thinking that this is ready to merge (we currently have issues with the There is another breaking change that I forgot to mention, I have now added it to the PR description as "Another breaking change". Let me know if you agree/disagree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me, thanks! I agree with the breaking change - let's think about it/respond when a request comes in :)
gtda/mapper/nerve.py
Outdated
entry in `X` is a tuple of pairs of the form | ||
``(pullback_set_label, partial_cluster_label)`` where | ||
``partial_cluster_label`` is a cluster label within the pullback | ||
cover set identified by ``pullback_set_label``. Unique such pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Unique such pairs..." does not read too good to me. What about "unique pairs...", "The unique pairs...", or "Nodes in the output graph are the unique pairs from X
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to "The unique pairs"
cloned_clusterer.fit(X_sub, sample_weight=sample_weight) | ||
else: | ||
cloned_clusterer.fit(X_sub) | ||
kwargs = self._sample_weight_computer(rel_indices, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the new breaking change - LGTM!
Types of changes
Description
The
ParallelClustering
transformer (technically, a metaestimator) is used in mapper pipelines as a second-to-last step, just beforeNerve
. As things are currently, the outputs ofParallelClustering.fit_transform
are in a rather exotic form, i.e. lists of lists of triples, as in e.g.These outputs make the graph construction a little simpler perhaps because they neatly present the final Mapper nodes. However, they seem to be very arcane if one is to use Mapper without the final
Nerve
step, as is made possible by passinggraph_step=False
inmake_mapper_pipeline
.In this PR I propose that the output of
ParallelClustering
should be more like the output of any clustering algorithm inscikit-learn
. In particular, it should return an array of the same length as the number of samples in the input, with each entry in the array denoting something closely corresponding to a "cluster label". Of course, for Mapper there is more than one cluster per sample in general, so I propose that the output should be a 1D object-type array where each entry is the tuple of cluster identifiers corresponding to that sample. Since this is Mapper, it makes sense to identify a single cluster via a pair(pullback_set_label, relative_cluster label)
. Hence, the final 1D object array would look likeThis is a major breaking change to
ParallelClustering
(output),Nerve
(input), andmake_mapper_pipeline
whengraph_step=False
. Another consequence is that the global node IDs in the final graphs are no longer ordered "lexicographically" with a global node ID being less than or equal to another if and only if the first pullback set label is less than or equal to the second one.Another breaking change. The fitted single clusterers are no longer stored in the
clusterers_
attribute of the fittedParallelClustering
object.clusterers_
was initially designed thinking ahead to a time when atransform
method for new data would be available. However, as we are today as close to that target as we were then, I propose we remove this until the need for it and the design for its use becomes clearer.Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.