Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH small cleaning and optimizations accross the repo #88

Closed
wants to merge 13 commits into from

Conversation

fcharras
Copy link
Collaborator

@fcharras fcharras commented Feb 6, 2023

  • Replace signature exposing sizes with a signature exposing one argument shape
  • Where applicable, use 2D grid of work items rather than 1D + divisions
    • note: found this piece of information: https://stackoverflow.com/a/15044884 suggesting that:
      • 2D grid really are better for performance because doing // and % ops in the kernel is expensive
      • it indexes work items in a "row major order" meaning that items in the same row belong to the same sub group (/warp)
    • even if the previous information applies to cuda. I assume it's also true for numba-dpex/ sycl
  • some kernels are now written only for a given shape and reused for other dimension with reshaping tricks.
  • also use 2D grids for kmeans kernels
  • document better the use of centroids_private_copies_max_cache_occupancy and improve the heuristic using device.max_compute_units
  • a couple more nitpicks

It's WIP because using 2D groups for the sum over axis 0 kernel seems to trigger weirg bugs like in IntelPython/numba-dpex#892 in some of the tests.

Out of WIP, remaining issues were unrelated, PR is green.

edit: confirmed affected by IntelPython/numba-dpex#906

@fcharras fcharras changed the title WIP: ENH leftover tasks ENH small cleaning and optimizations accross the repo Feb 7, 2023
@fcharras fcharras marked this pull request as ready for review February 7, 2023 13:54
@fcharras fcharras requested review from jjerphan and ogrisel February 7, 2023 13:56
@fcharras
Copy link
Collaborator Author

fcharras commented Feb 7, 2023

Out of WIP. See the top level post for the list of changes.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR of general improvements, @fcharras. 🙂

Here is a first pass.

n_features, _divide, max_work_group_size, compute_dtype
divide_by_n_samples_kernel = make_apply_elementwise_func(
(n_features,),
_divide_by(compute_dtype(n_samples)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not find this chaining super intuitive. Can you extract it above just above the call to this function and give it a name? 🙂

Comment on lines 238 to 239
math.ceil(math.ceil(n_samples / window_n_candidates) / candidates_window_height)
* candidates_window_height,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this a name?

Copy link
Collaborator Author

@fcharras fcharras Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I don't really now how to make it clear, it's twisted that the shape of the grid is adapted to the sliding window on centroids, but each work item apply to one single input sample (even if it's the right thing to do).

n_subgroups = global_size // sub_group_size
n_subgroups = math.ceil(n_samples / window_n_centroids)
global_size = (
math.ceil(n_subgroups / centroids_window_height) * centroids_window_height,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identically, is it possible to give this a name?

Comment on lines 174 to 175
math.ceil(math.ceil(n_samples / window_n_centroids) / centroids_window_height)
* centroids_window_height,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identically, is it possible to give this a name?

privatization_idx = (
sample_idx // sub_group_size
) % n_centroids_private_copies
privatization_idx = sub_group_idx % n_centroids_private_copies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid using a modulo, here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so :-(

fcharras and others added 2 commits February 8, 2023 13:58
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
# results in some tests.
# TODO: create a minimal reproducer and report to `numba_dpex`. Remove
# the hack once it is fixed.
(_is_cpu or (col_idx < n_cols)) and
Copy link
Collaborator Author

@fcharras fcharras Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably some bug in numba_dpex or with the cpu backend here that causes branching to not be correctly executed in very specific cases, for this very particular condition, on cpu. I had no luck when trying to have a minimal reproducer and I can't see failures on gpu. So I'd be in favor of letting this pass through in the meantime. ( Note that this also fixes the case with work_group_size=1 that I added back to the unit tests. (<-- edit: no it doesn't, ran on wrong device))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's still wrong on the CI. 🤷

@fcharras
Copy link
Collaborator Author

fcharras commented Feb 9, 2023

Something is wrong with sum(axis=1) on CPU when work_group_size is >=8. Maybe another bug in the JIT or something we don't understand about group sizes on CPU (or a bit of both). In both cases the investigation looks complicated and CPU is not the priority target so the latest commit propose forcing work_group_size == 8 with work_group_size was set to max.

I'll also run this branch on the dev cloud with the flex170 to see if it shows any performance improvement, and play with the group size to see if it affects performance more than it seems to do on local iGPU.

Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the changes improve the readability a lot (I think). Looking forward to the performance impacts (once the cause of the bug has been found).

I have seen a bunch of fishing things (see some of the comments below). Maybe this can explain the broken tests?

Comment on lines +91 to +92
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):
n_rows, n_cols = shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):
n_rows, n_cols = shape
def make_broadcast_division_1d_2d_axis0_kernel(shape, work_group_size):
n_rows, n_cols = shape

Here it seems that we assume shape is always 2d (which is fine): maybe the function name should be updated accordingly to drop the "1d".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here was to mean that the broadcasted data is 1d. Would 1d_to_2d make it clearer ?

"""
ops must be a function that will be interpreted as a dpex.func and is subject to
the same rules. It is expected to take two scalar arguments and return one scalar
value. lambda functions are advised against since the cache will not work with lamda
functions. sklearn_numba_dpex.common._utils expose some pre-defined `ops`.
"""
n_rows, n_cols = shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for this function name.

work_group_size = n_sub_groups_per_work_group * sub_group_size

_is_cpu = device.has_aspect_cpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: drop the leading _ in the variable name if you don't expect a name conflict.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this _is_cpu change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex

def _make_partial_sum_reduction_2d_axis0_kernel(
n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype
n_cols, work_group_size, sub_group_size, fused_elementwise_func, dtype, _is_cpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.


# The current work item use the following second coordinate (given by the
# position of the window in the grid of windows, and by the local position of
# the work item in the 2D index):
col_idx = (
(group_id // n_blocks_per_col) * sub_group_size + local_col_idx
(dpex.get_group_id(one_idx) * sub_group_size) + local_col_idx
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black:

Suggested change
)
)

which makes me think that I wanted to open a PR to fix the black problem but completely forgot about it. I will wait for this PR to me finalized first though.

n_centroids_private_copies = int(min(n_subgroups, n_centroids_private_copies))

# Each set of `sub_group_size` consecutive work items is assigned one private
# copy, and several such set can be assigned to the same private copy. Thus, at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# copy, and several such set can be assigned to the same private copy. Thus, at
# copy, and several such sets can be assigned to the same private copy. Thus, at

@@ -217,7 +222,8 @@ def _initialize_results(results):
@dpex.func
# fmt: off
def _initialize_window_of_centroids(
local_work_id, # PARAM
local_row_idx, # PARAM
local_col_idx, # PARAM
first_centroid_idx, # PARAM
centroids_half_l2_norm, # IN
window_of_centroids_half_l2_norms, # OUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the shapes explicit after IN / OUT would probably help.

if item_idx < n_centroid_items:
for copy_idx in range(n_centroids_private_copies):
sum_ += centroids_t_private_copies[copy_idx, item_idx]
centroids_t[item_idx] = sum_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an indentation problem here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 😌 TY

sum_ += centroids_t_private_copies[copy_idx, item_idx]
centroids_t[item_idx] = sum_

elif item_idx < n_sums:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with the if item_idx >= n_sums: early return at the beginning of the kernel.

cluster_sizes, # OUT (n_clusters,)
centroids_t, # OUT (n_features, n_clusters)
centroids_t, # OUT (n_features * n_clusters,) (flattened)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to centroids_t_flattened

@fcharras
Copy link
Collaborator Author

fcharras commented Feb 11, 2023

I've reduced the last failure on the pipeline to what appears to be a JIT issue. The merge will be blocked until it's fixed. It's likely the same JIT issue we've seen before and it gave a much nicer reproducer. See IntelPython/numba-dpex#906 .

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @ogrisel's comment have been addressed.

)

# XXX: The kernels seem to work fine with work_group_size==1 on GPU but fail on CPU.
if work_group_size == 1:
if math.prod(work_group_shape) == 1:
raise NotImplementedError("work_group_size==1 is not supported.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the message be changed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind work_group_size=math.prod(work_group_shape) so I think it's good.

There is this nuance because I have introduced the syntax work_group_size=max where we try to automatically use the maximum possible value for work_group_size. This change enable checking that max does not result in work_group_size=1.

Comment on lines 408 to 415
return (
work_group_shape,
(
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels
reduction_block_size,
),
(get_result_shape, get_global_size),
) # shape_update_fn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comment likely is off due to black formatting.

What do you think of this suggestion?

Suggested change
return (
work_group_shape,
(
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels
reduction_block_size,
),
(get_result_shape, get_global_size),
) # shape_update_fn
kernels = (partial_sum_reduction, partial_sum_reduction_nofunc)
shape_update_fn = (get_result_shape, get_global_size)
return (
work_group_shape,
(
kernels,
reduction_block_size,
),
shape_update_fn,
)

Comment on lines 622 to 629
return (
work_group_shape,
(
(partial_sum_reduction, partial_sum_reduction_nofunc), # kernels
reduction_block_size,
),
(get_result_shape, get_global_size),
) # shape_update_fn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous suggestion also applies here.

Comment on lines 747 to 751
if (
(local_row_idx < n_active_sub_groups) and
(col_idx < n_cols) and
(work_item_row_idx < sum_axis_size)
(_is_cpu or ((col_idx < n_cols) and
(work_item_row_idx < sum_axis_size)))
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add inline comment to give some precisions regarding this branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this _is_cpu change, in fact it does not fix the JIT issue. We must wait for the root cause to be addressed in numba_dpex

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@fcharras
Copy link
Collaborator Author

TY for the careful review @ogrisel @jjerphan the last commit should address your suggestions and I've answered some of your comments.

@jjerphan
Copy link
Member

Perfect. 👍

I leave @ogrisel merge.

@ogrisel
Copy link
Collaborator

ogrisel commented Feb 17, 2023

The tests are still red. Ok for merge once fixed.

@fcharras
Copy link
Collaborator Author

fcharras commented Feb 20, 2023

The tests fail because of IntelPython/numba-dpex#906 we can merge as soon as it's fixed and we can bump

If there's a conflict with other branches before that we can merge except the diff on the 2d sum kernel.

@fcharras fcharras mentioned this pull request Feb 28, 2023
3 tasks
@fcharras
Copy link
Collaborator Author

Closed via #98 and #96

@fcharras fcharras closed this Feb 28, 2023
@fcharras fcharras deleted the minor_changes_enh branch April 24, 2023 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants