Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for additonal attribute preferred_work_group_size_multiple #886

Closed
fcharras opened this issue Aug 17, 2022 · 10 comments · Fixed by #894
Closed

Request for additonal attribute preferred_work_group_size_multiple #886

fcharras opened this issue Aug 17, 2022 · 10 comments · Fixed by #894

Comments

@fcharras
Copy link
Contributor

dpctl.SyclDevice exposes a number of useful attributes when scheduling kernels, but preferred_work_group_size_multiple seems to be an important information (used when choosing kernel local size) that is missing.

It is available with clinfo:

> clinfo
...
Device Name                                     Intel(R) UHD Graphics [0x9a60]
...
Device Type                                     GPU
...
Max work group size                             512
Preferred work group size multiple              64
...

or exposed in pyopencl

@fcharras
Copy link
Contributor Author

(global and local cache sizes would be a useful addition too)

@oleksandr-pavlyk
Copy link
Collaborator

The preferred_work_groups_size_multiple is exposed in SYCL as a device-specific information about a kernel, see Table 136 in the SYCL 2020 spec rev. 5.

Querying it requires a kernel, which can only be extract from a kernel bundle in the executable state.

@oleksandr-pavlyk
Copy link
Collaborator

Device object could definitely expose

sycl::info::device::global_mem_cache_type (can be none, read_only, read_write), sycl::info::device::global_mem_cache_line_size and sycl::info::device::global_mem_cache_size.

There does not seem to be any descriptor to query local cache size. Only the local_mem_size which is already exposed as dpctl.SyclDevice.local_mem_size property.

@fcharras Could you please clarify what you mean by "local cache sizes" and perhaps refer to it in clinfo output?

@fcharras
Copy link
Contributor Author

The preferred_work_groups_size_multiple is exposed in SYCL as a device-specific information about a kernel, see Table 136 in the SYCL 2020 spec rev. 5.

Is there a way to access the information elsewhere from python (maybe in SyclKernel or numba_dpex.compiler.Kernel attributes ?)

@oleksandr-pavlyk
Copy link
Collaborator

The preferred_work_groups_size_multiple is exposed in SYCL as a device-specific information about a kernel, see Table 136 in the SYCL 2020 spec rev. 5.

Is there a way to access the information elsewhere from python (maybe in SyclKernel or numba_dpex.compiler.Kernel attributes ?)

Not yet, but perhaps exposing it for dpctl.SyclKernel is the right thing to do. I will make it happen and tag you in the PR.

@oleksandr-pavlyk
Copy link
Collaborator

BTW, notice that in OpenCL the property is also specific to kernel and device, see CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

clinfo just queries it (https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c#L1584) for a simple kernel https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c#L395-L402 amounting to out[i] = in1[i] + in2[i] where out, in1 and in2 are float *.

@fcharras
Copy link
Contributor Author

fcharras commented Aug 25, 2022

Thank you very much.

Querying it requires a kernel, which can only be extract from a kernel bundle in the executable state.

If I understand correctly, the information on the preferred work group size is only accessible once the kernel as been compiled (which makes sense because the compiler could have extra information on this value). Then if it is exposed with dpctl.SyclKernel, the correct order of instruction with numba_dpex would be:

  • explicitly pass a signature to numba_dpex.kernel to trigger compilation immediately rather than delaying it to the first call (I don't know if this works yet but I think it's a common use case in numba, that's probably a good practice I've missed)
  • query the preferred_work_group_size_multiple from the kernel it returns
  • use this information to choose the work_group_size accordingly
  • start the compute

edit: that's probably the right workflow considering https://github.com/IntelPython/numba-dpex/blob/12cbcf80f09da38bad23cfc7327266da6a4fc5e1/numba_dpex/decorators.py#L28

@fcharras
Copy link
Contributor Author

fcharras commented Aug 25, 2022

Could you please clarify what you mean by "local cache sizes"

You're right, that doesn't seem to exist in clinfo, sorry for the misdirection. The point only holds for global_mem_cache_line_size and global_mem_cache_size and global_mem_cache_type then.

FYI, in soda-inria/sklearn-numba-dpex#2 we've been using global_mem_cache_size to adjust the strategy ("privatization of updated arrays") used to avoid collisions on atomic updates in global memory, which is about creating some temporary writing structure that is larger but is hoped to be cached correctly. (and we've observed that the strategy is very effective, it almosts completely alleviates the bottleneck on atomics)

@fcharras
Copy link
Contributor Author

Not yet, but perhaps exposing it for dpctl.SyclKernel is the right thing to do. I will make it happen and tag you in the PR.

Thank you, that would be awesome

oleksandr-pavlyk added a commit that referenced this issue Aug 26, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
oleksandr-pavlyk added a commit that referenced this issue Aug 26, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
oleksandr-pavlyk added a commit that referenced this issue Aug 26, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
oleksandr-pavlyk added a commit that referenced this issue Aug 26, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
oleksandr-pavlyk added a commit that referenced this issue Aug 28, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
@fcharras
Copy link
Contributor Author

Thank you for the PR @oleksandr-pavlyk

I've also come accross papers (e.g this one for top k) where the authors use some other information about the device for the finest grained optimizations, like bank size or size of simd units, it's also exposed in pyopencl although it's a different API , but I don't think it's covered by the sycl spec at the moment.

oleksandr-pavlyk added a commit that referenced this issue Sep 4, 2022
gh-886: Added 3 new device attributes and kernel's device-specific attributes
ndgrigorian pushed a commit to ndgrigorian/dpctl that referenced this issue Sep 8, 2022
These are DPCTLDevice_GetGlobalMemCacheSize, DPCTLDevice_GlobalMemCacheLineSize,
and DPCTLDevice_GetGlobalMemCacheType.

To support the latter, introduced DPCTLGlobalMemCacheType enum in dpctl_sycl_enum_types.h

Tests are added to test_capi target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants