ENH small cleaning and optimizations accross the repo #88

fcharras · 2023-02-06T10:25:51Z

Replace signature exposing sizes with a signature exposing one argument shape
Where applicable, use 2D grid of work items rather than 1D + divisions
- note: found this piece of information: https://stackoverflow.com/a/15044884 suggesting that:
  - 2D grid really are better for performance because doing // and % ops in the kernel is expensive
  - it indexes work items in a "row major order" meaning that items in the same row belong to the same sub group (/warp)
- even if the previous information applies to cuda. I assume it's also true for numba-dpex/ sycl
some kernels are now written only for a given shape and reused for other dimension with reshaping tricks.
also use 2D grids for kmeans kernels
document better the use of centroids_private_copies_max_cache_occupancy and improve the heuristic using device.max_compute_units
a couple more nitpicks

~~It's WIP because using 2D groups for the sum over axis 0 kernel seems to trigger weirg bugs like in IntelPython/numba-dpex#892 in some of the tests.~~

Out of WIP, remaining issues were unrelated, PR is green.

edit: confirmed affected by IntelPython/numba-dpex#906

fcharras · 2023-02-07T14:09:57Z

Out of WIP. See the top level post for the list of changes.

jjerphan

Thank you for this PR of general improvements, @fcharras. 🙂

Here is a first pass.

jjerphan · 2023-02-08T08:08:41Z

sklearn_numba_dpex/kmeans/drivers.py

-        n_features, _divide, max_work_group_size, compute_dtype
+    divide_by_n_samples_kernel = make_apply_elementwise_func(
+        (n_features,),
+        _divide_by(compute_dtype(n_samples)),


I do not find this chaining super intuitive. Can you extract it above just above the call to this function and give it a name? 🙂

jjerphan · 2023-02-08T08:11:22Z

sklearn_numba_dpex/kmeans/kernels/kmeans_plusplus.py

+        math.ceil(math.ceil(n_samples / window_n_candidates) / candidates_window_height)
+        * candidates_window_height,


Can you give this a name?

I agree but I don't really now how to make it clear, it's twisted that the shape of the grid is adapted to the sliding window on centroids, but each work item apply to one single input sample (even if it's the right thing to do).

jjerphan · 2023-02-08T08:13:26Z

sklearn_numba_dpex/kmeans/kernels/lloyd_single_step.py

-    n_subgroups = global_size // sub_group_size
+    n_subgroups = math.ceil(n_samples / window_n_centroids)
+    global_size = (
+        math.ceil(n_subgroups / centroids_window_height) * centroids_window_height,


Identically, is it possible to give this a name?

jjerphan · 2023-02-08T08:13:56Z

sklearn_numba_dpex/kmeans/kernels/compute_labels.py

+        math.ceil(math.ceil(n_samples / window_n_centroids) / centroids_window_height)
+        * centroids_window_height,


Identically, is it possible to give this a name?

jjerphan · 2023-02-08T08:21:25Z

sklearn_numba_dpex/kmeans/kernels/lloyd_single_step.py

-        privatization_idx = (
-            sample_idx // sub_group_size
-        ) % n_centroids_private_copies
+        privatization_idx = sub_group_idx % n_centroids_private_copies


Is it possible to avoid using a modulo, here?

I don't think so :-(

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

fcharras · 2023-02-08T13:19:32Z

sklearn_numba_dpex/common/kernels.py

+                # results in some tests.
+                # TODO: create a minimal reproducer and report to `numba_dpex`. Remove
+                # the hack once it is fixed.
+                (_is_cpu or (col_idx < n_cols)) and


There's probably some bug in numba_dpex or with the cpu backend here that causes branching to not be correctly executed in very specific cases, for this very particular condition, on cpu. I had no luck when trying to have a minimal reproducer and I can't see failures on gpu. So I'd be in favor of letting this pass through in the meantime. ( ~~Note that this also fixes the case with work_group_size=1 that I added back to the unit tests.~~ (<-- edit: no it doesn't, ran on wrong device))

Well it's still wrong on the CI. 🤷

fcharras · 2023-02-09T13:39:36Z

Something is wrong with sum(axis=1) on CPU when work_group_size is >=8. Maybe another bug in the JIT or something we don't understand about group sizes on CPU (or a bit of both). In both cases the investigation looks complicated and CPU is not the priority target so the latest commit propose forcing work_group_size == 8 with work_group_size was set to max.

I'll also run this branch on the dev cloud with the flex170 to see if it shows any performance improvement, and play with the group size to see if it affects performance more than it seems to do on local iGPU.

ogrisel

Overall, the changes improve the readability a lot (I think). Looking forward to the performance impacts (once the cause of the bug has been found).

I have seen a bunch of fishing things (see some of the comments below). Maybe this can explain the broken tests?

ogrisel · 2023-02-10T15:57:17Z

sklearn_numba_dpex/common/kernels.py

+def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):
+    n_rows, n_cols = shape


Suggested change

def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):

n_rows, n_cols = shape

def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):

n_rows, n_cols = shape

Here it seems that we assume shape is always 2d (which is fine): maybe the function name should be updated accordingly to drop the "1d".

The intent here was to mean that the broadcasted data is 1d. Would 1d_to_2d make it clearer ?

ogrisel · 2023-02-10T15:57:38Z

sklearn_numba_dpex/common/kernels.py

    """
    ops must be a function that will be interpreted as a dpex.func and is subject to
    the same rules. It is expected to take two scalar arguments and return one scalar
    value. lambda functions are advised against since the cache will not work with lamda
    functions. sklearn_numba_dpex.common._utils expose some pre-defined `ops`.
    """
+    n_rows, n_cols = shape


Same comment for this function name.

ogrisel · 2023-02-10T16:00:20Z

sklearn_numba_dpex/common/kernels.py

        work_group_size = n_sub_groups_per_work_group * sub_group_size

+    _is_cpu = device.has_aspect_cpu


Nit: drop the leading _ in the variable name if you don't expect a name conflict.

I'll revert this _is_cpu change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex

ogrisel · 2023-02-10T16:00:41Z

sklearn_numba_dpex/common/kernels.py

 def _make_partial_sum_reduction_2d_axis0_kernel(
-    n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype
+    n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype, _is_cpu


Same comment here.

ogrisel · 2023-02-10T16:02:04Z

sklearn_numba_dpex/common/kernels.py


        # The current work item use the following second coordinate (given by the
        # position of the window in the grid of windows, and by the local position of
        # the work item in the 2D index):
        col_idx = (
-            (group_id // n_blocks_per_col) * sub_group_size + local_col_idx
+            (dpex.get_group_id(one_idx) * sub_group_size) + local_col_idx
            )


black:

Suggested change

)

)

which makes me think that I wanted to open a PR to fix the black problem but completely forgot about it. I will wait for this PR to me finalized first though.

ogrisel · 2023-02-10T16:26:33Z

sklearn_numba_dpex/kmeans/kernels/lloyd_single_step.py

-    n_centroids_private_copies = int(min(n_subgroups, n_centroids_private_copies))
+
+    # Each set of `sub_group_size` consecutive work items is assigned one private
+    # copy, and several such set can be assigned to the same private copy. Thus, at


Suggested change

# copy, and several such set can be assigned to the same private copy. Thus, at

# copy, and several such sets can be assigned to the same private copy. Thus, at

ogrisel · 2023-02-10T16:29:07Z

sklearn_numba_dpex/kmeans/kernels/_base_kmeans_kernel_funcs.py

@@ -217,7 +222,8 @@ def _initialize_results(results):
        @dpex.func
        # fmt: off
        def _initialize_window_of_centroids(
-            local_work_id,                      # PARAM
+            local_row_idx,                      # PARAM
+            local_col_idx,                      # PARAM
            first_centroid_idx,                 # PARAM
            centroids_half_l2_norm,             # IN
            window_of_centroids_half_l2_norms,  # OUT


Making the shapes explicit after IN / OUT would probably help.

ogrisel · 2023-02-10T16:31:54Z

sklearn_numba_dpex/kmeans/kernels/utils.py

+        if item_idx < n_centroid_items:
+            for copy_idx in range(n_centroids_private_copies):
+                sum_ += centroids_t_private_copies[copy_idx, item_idx]
+                centroids_t[item_idx] = sum_


Isn't there an indentation problem here?

good catch 😌 TY

ogrisel · 2023-02-10T16:34:07Z

sklearn_numba_dpex/kmeans/kernels/utils.py

+                sum_ += centroids_t_private_copies[copy_idx, item_idx]
+                centroids_t[item_idx] = sum_
+
+        elif item_idx < n_sums:


This seems redundant with the if item_idx >= n_sums: early return at the beginning of the kernel.

ogrisel · 2023-02-10T16:39:58Z

sklearn_numba_dpex/kmeans/kernels/utils.py

        cluster_sizes,                 # OUT     (n_clusters,)
-        centroids_t,                   # OUT     (n_features, n_clusters)
+        centroids_t,                   # OUT     (n_features * n_clusters,) (flattened)


Maybe rename to centroids_t_flattened

fcharras · 2023-02-11T11:37:46Z

I've reduced the last failure on the pipeline to what appears to be a JIT issue. The merge will be blocked until it's fixed. It's likely the same JIT issue we've seen before and it gave a much nicer reproducer. See IntelPython/numba-dpex#906 .

jjerphan

LGTM once @ogrisel's comment have been addressed.

jjerphan · 2023-02-13T07:28:13Z

sklearn_numba_dpex/common/kernels.py

        )

    # XXX: The kernels seem to work fine with work_group_size==1 on GPU but fail on CPU.
-    if work_group_size == 1:
+    if math.prod(work_group_shape) == 1:
        raise NotImplementedError("work_group_size==1 is not supported.")


Should the message be changed here?

In my mind work_group_size=math.prod(work_group_shape) so I think it's good.

There is this nuance because I have introduced the syntax work_group_size=max where we try to automatically use the maximum possible value for work_group_size. This change enable checking that max does not result in work_group_size=1.

jjerphan · 2023-02-13T07:29:56Z

sklearn_numba_dpex/common/kernels.py

+    return (
+        work_group_shape,
+        (
+            (partial_sum_reduction, partial_sum_reduction_nofunc),  # kernels
+            reduction_block_size,
+        ),
+        (get_result_shape, get_global_size),
+    )  # shape_update_fn


The last comment likely is off due to black formatting.

What do you think of this suggestion?

Suggested change

return (

work_group_shape,

(

(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels

reduction_block_size,

),

(get_result_shape, get_global_size),

) # shape_update_fn

kernels = (partial_sum_reduction, partial_sum_reduction_nofunc)

shape_update_fn = (get_result_shape, get_global_size)

return (

work_group_shape,

(

kernels,

reduction_block_size,

),

shape_update_fn,

)

jjerphan · 2023-02-13T07:36:50Z

sklearn_numba_dpex/common/kernels.py

+    return (
+        work_group_shape,
+        (
+            (partial_sum_reduction, partial_sum_reduction_nofunc),  # kernels
+            reduction_block_size,
+        ),
+        (get_result_shape, get_global_size),
+    )  # shape_update_fn


The previous suggestion also applies here.

jjerphan · 2023-02-13T07:37:54Z

sklearn_numba_dpex/common/kernels.py

            if (
                (local_row_idx < n_active_sub_groups) and
-                (col_idx < n_cols) and
-                (work_item_row_idx < sum_axis_size)
+                (_is_cpu or ((col_idx < n_cols) and
+                             (work_item_row_idx < sum_axis_size)))
            ):


Can you add inline comment to give some precisions regarding this branch?

I'll revert this _is_cpu change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fcharras · 2023-02-13T11:07:30Z

TY for the careful review @ogrisel @jjerphan the last commit should address your suggestions and I've answered some of your comments.

jjerphan · 2023-02-13T11:23:26Z

Perfect. 👍

I leave @ogrisel merge.

ogrisel · 2023-02-17T17:05:53Z

The tests are still red. Ok for merge once fixed.

fcharras · 2023-02-20T14:34:06Z

The tests fail because of IntelPython/numba-dpex#906 we can merge as soon as it's fixed and we can bump

If there's a conflict with other branches before that we can merge except the diff on the 2d sum kernel.

fcharras · 2023-02-28T16:51:41Z

Closed via #98 and #96

fcharras added 4 commits February 6, 2023 11:08

wip: leftover enhancements

066d614

wip: apply 2D grid of work items to kmeans kernels

9d554db

wip: apply 2D grid of work items to kmeans kernels

7d2e719

Small enhancements accross repo

7b1da9e

fcharras changed the title ~~WIP: ENH leftover tasks~~ ENH small cleaning and optimizations accross the repo Feb 7, 2023

fcharras marked this pull request as ready for review February 7, 2023 13:54

empty commit for github refresh ?

05fa4a9

fcharras requested review from jjerphan and ogrisel February 7, 2023 13:56

bump isort to version that fixes install issue at PyCQA/isort#2077

49dad1e

jjerphan reviewed Feb 8, 2023

View reviewed changes

fcharras and others added 2 commits February 8, 2023 13:58

Fix (hack) issue with some work group sizes on CPU

28513cc

Apply suggestions

d70ab27

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

fcharras commented Feb 8, 2023

View reviewed changes

fcharras added 2 commits February 8, 2023 14:23

fixup sum with wgs 1

a3a7530

hack group sizes for sum(axis=0) on cpu

f9e0994

fcharras added 2 commits February 9, 2023 17:06

investigating sum(axis=0) cpu issue

9bbdfd2

investigating sum(axis=0) cpu issue #2

90725db

ogrisel reviewed Feb 10, 2023

View reviewed changes

jjerphan approved these changes Feb 13, 2023

View reviewed changes

Apply suggestions

33ef2dc

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fcharras mentioned this pull request Feb 28, 2023

Fix work around #906 #96

Merged

3 tasks

fcharras closed this Feb 28, 2023

fcharras mentioned this pull request Mar 3, 2023

ENH Use a 2D grid of work items where applicable #98

Merged

2 tasks

fcharras deleted the minor_changes_enh branch April 24, 2023 05:16

		math.ceil(math.ceil(n_samples / window_n_candidates) / candidates_window_height)
		* candidates_window_height,

		math.ceil(math.ceil(n_samples / window_n_centroids) / centroids_window_height)
		* centroids_window_height,

		def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):
		n_rows, n_cols = shape

		work_group_size = n_sub_groups_per_work_group * sub_group_size

		_is_cpu = device.has_aspect_cpu

	# copy, and several such set can be assigned to the same private copy. Thus, at
	# copy, and several such sets can be assigned to the same private copy. Thus, at

ENH small cleaning and optimizations accross the repo #88

ENH small cleaning and optimizations accross the repo #88

Conversation

fcharras commented Feb 6, 2023 • edited Loading

fcharras commented Feb 7, 2023

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcharras Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcharras Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcharras commented Feb 9, 2023

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcharras commented Feb 11, 2023 • edited Loading

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcharras commented Feb 13, 2023

jjerphan commented Feb 13, 2023

ogrisel commented Feb 17, 2023

fcharras commented Feb 20, 2023 • edited Loading

fcharras commented Feb 28, 2023

fcharras commented Feb 6, 2023 •

edited

Loading

fcharras Feb 8, 2023 •

edited

Loading

fcharras Feb 8, 2023 •

edited

Loading

fcharras commented Feb 11, 2023 •

edited

Loading

fcharras commented Feb 20, 2023 •

edited

Loading