Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

mrfh92 · 2024-08-01T12:11:11Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- documentation updated where needed

Description

Issue/s resolved: #1592

Changes proposed:

output labels had int32-data type as Heat-arrays, but internally they were int64-floats in torch; this causes problems when going on with computations after clustering. Now, the torch labels are cast to int32; this is no problem as nobody will have more than int32 cluster centers.
torch.multinomial used for K-Means++ initialization on each MPI-process has a limit of 2^24 elements (at least on GPU). Thus, if there are more than 2^24 elements to cluster on each process, we now take a uniform subsample before doing K-Means++.

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

yes / no

…+ init

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

github-actions · 2024-08-08T07:26:05Z

Thank you for the PR!

codecov · 2024-08-08T08:12:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.04%. Comparing base (00119f6) to head (3c636cd).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1593   +/-   ##
=======================================
  Coverage   92.04%   92.04%           
=======================================
  Files          83       83           
  Lines       12110    12113    +3     
=======================================
+ Hits        11147    11150    +3     
  Misses        963      963

Flag	Coverage Δ
unit	`92.04% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JuanPedroGHM

Left some comments, not sure if this is the proper way to handle the bugs.

JuanPedroGHM · 2024-08-08T12:50:29Z

heat/cluster/batchparallelclustering.py

@@ -289,7 +293,7 @@ def predict(self, x: DNDarray):

        local_labels = _parallel_batched_kmex_predict(
            x.larray, self._cluster_centers.larray, self._p
-        )
+        ).to(torch.int32)


Why not do it the other way? Set the heat array to the proper output type? I get the argument that it is an unlikely number of clusters, but it could theoretically happen.

I also thought about this and my arguments for the chosen solution were:

int32 saves 50% of memory compared to int64 during further processing of the outcome of the clustering

in theory, more than int32 cluster centers can be thought of, but in practice this is completely out of scope as the runtime of our clustering algorithms heavily depend on the number of cluster centers and also the reason for doing clustering is usually to get an insight in the structure of data by grouping them into a comparably small number of clusters.

JuanPedroGHM · 2024-08-08T12:55:34Z

heat/cluster/batchparallelclustering.py

@@ -19,20 +19,24 @@
 """


-def _initialize_plus_plus(X, n_clusters, p, random_state=None):
+def _initialize_plus_plus(X, n_clusters, p, random_state=None, max_samples=2**24 - 1):


Some unsuspecting user could try to change this value to something higher, and encounter the limit on torch. Should we hard code it?

Actually, this is already hard code as this is an auxiliary function that is not made available to the user directly.
The reason for introducing max_samples as a variable was to have some flexibility for adapting this in the future.

I have added a comment in the functions description.

heat/cluster/batchparallelclustering.py

for more information, see https://pre-commit.ci

github-actions · 2024-08-12T10:32:49Z

Thank you for the PR!

for more information, see https://pre-commit.ci

github-actions · 2024-08-12T10:42:32Z

Thank you for the PR!

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

github-actions · 2024-08-12T13:22:38Z

Thank you for the PR!

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

github-actions · 2024-08-13T07:33:29Z

Thank you for the PR!

Hoppe added 3 commits August 1, 2024 13:40

resolved problem with dtype mismatch: now int64 instead of int32

0d7765d

decided the other way round: int32 instead of int64 now...

0bd9280

resolved problem with the 2^24 limit of torch.multinomial in k-means+…

ddc3ae3

…+ init

mrfh92 added bug Something isn't working cluster high-level functions High-level machine-learning algorithms labels Aug 1, 2024

mrfh92 added 2 commits August 5, 2024 09:23

Merge branch 'main' into bugs/1592-_Bug_Two_bugs_in_batch-parallel_cl…

cf24ee5

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

Merge branch 'main' into bugs/1592-_Bug_Two_bugs_in_batch-parallel_cl…

b58ce24

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

mrfh92 requested review from mtar and JuanPedroGHM August 8, 2024 08:19

JuanPedroGHM requested changes Aug 8, 2024

View reviewed changes

mrfh92 and others added 2 commits August 12, 2024 12:26

Update batchparallelclustering.py

d40826c

[pre-commit.ci] auto fixes from pre-commit.com hooks

8db2810

for more information, see https://pre-commit.ci

mrfh92 and others added 2 commits August 12, 2024 12:37

removed recursion

763b255

[pre-commit.ci] auto fixes from pre-commit.com hooks

4fbf7e4

for more information, see https://pre-commit.ci

Merge branch 'main' into bugs/1592-_Bug_Two_bugs_in_batch-parallel_cl…

3c636cd

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

JuanPedroGHM approved these changes Aug 13, 2024

View reviewed changes

Merge branch 'main' into bugs/1592-_Bug_Two_bugs_in_batch-parallel_cl…

8e5fa37

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels

mrfh92 merged commit 19a6bd9 into main Aug 13, 2024
6 checks passed

mrfh92 deleted the bugs/1592-_Bug_Two_bugs_in_batch-parallel_clustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels branch August 13, 2024 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

mrfh92 commented Aug 1, 2024

github-actions bot commented Aug 8, 2024

codecov bot commented Aug 8, 2024 •

edited

Loading

JuanPedroGHM left a comment

JuanPedroGHM Aug 8, 2024

mrfh92 Aug 12, 2024

JuanPedroGHM Aug 8, 2024

mrfh92 Aug 12, 2024 •

edited

Loading

mrfh92 Aug 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 13, 2024

Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

Conversation

mrfh92 commented Aug 1, 2024

Due Diligence

Description

Changes proposed:

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Aug 8, 2024

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

JuanPedroGHM left a comment

Choose a reason for hiding this comment

JuanPedroGHM Aug 8, 2024

Choose a reason for hiding this comment

mrfh92 Aug 12, 2024

Choose a reason for hiding this comment

JuanPedroGHM Aug 8, 2024

Choose a reason for hiding this comment

mrfh92 Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

mrfh92 Aug 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 13, 2024

codecov bot commented Aug 8, 2024 •

edited

Loading

mrfh92 Aug 12, 2024 •

edited

Loading