-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593
Conversation
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
Thank you for the PR! |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1593 +/- ##
=======================================
Coverage 92.04% 92.04%
=======================================
Files 83 83
Lines 12110 12113 +3
=======================================
+ Hits 11147 11150 +3
Misses 963 963
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, not sure if this is the proper way to handle the bugs.
@@ -289,7 +293,7 @@ def predict(self, x: DNDarray): | |||
|
|||
local_labels = _parallel_batched_kmex_predict( | |||
x.larray, self._cluster_centers.larray, self._p | |||
) | |||
).to(torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do it the other way? Set the heat array to the proper output type? I get the argument that it is an unlikely number of clusters, but it could theoretically happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about this and my arguments for the chosen solution were:
- int32 saves 50% of memory compared to int64 during further processing of the outcome of the clustering
- in theory, more than int32 cluster centers can be thought of, but in practice this is completely out of scope as the runtime of our clustering algorithms heavily depend on the number of cluster centers and also the reason for doing clustering is usually to get an insight in the structure of data by grouping them into a comparably small number of clusters.
@@ -19,20 +19,24 @@ | |||
""" | |||
|
|||
|
|||
def _initialize_plus_plus(X, n_clusters, p, random_state=None): | |||
def _initialize_plus_plus(X, n_clusters, p, random_state=None, max_samples=2**24 - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some unsuspecting user could try to change this value to something higher, and encounter the limit on torch. Should we hard code it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is already hard code as this is an auxiliary function that is not made available to the user directly.
The reason for introducing max_samples
as a variable was to have some flexibility for adapting this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment in the functions description.
for more information, see https://pre-commit.ci
Thank you for the PR! |
for more information, see https://pre-commit.ci
Thank you for the PR! |
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
Thank you for the PR! |
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
Thank you for the PR! |
Due Diligence
Description
Issue/s resolved: #1592
Changes proposed:
Type of change
Memory requirements
Performance
Does this change modify the behaviour of other functions? If so, which?
yes / no