[jit][cpu] Disable loop-interchange for CPU offloading #405

jjfumero · 2024-05-06T15:13:06Z

Description

This PR disables the loop interchange when the code is specialised for the CPU multi-core.

When running on Intel CPU i9-10885H and OpenCL CPU Runtime 2024.17.3.0.08_160000 from oneAPI, the time that takes to compute the Blur Filter goes from ~57 seconds in the first iteration down to ~7 seconds (speedup of 7.9x).

Execution trace in develop: fcdebe5

tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 57798473152 -- seconds = 57.79847315200001

Execution trace with this feature:

tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.green
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.blue
	Backend           : OPENCL
	Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [16, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 7243037295 -- seconds = 7.243037295000001

Problem description

The TornadoVM JIT compiler specialises the thread-block when compiling and deploying the application for multi-core CPUs. By default, the TornadoVM JIT compiler transform from 2D to 1D kernels. The problem is that, when having 2D kernels, the loop interchange might end-up running slower than expected due to the inner loop having more work to do per thread. This PR disables loop interchange for this case.

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

make
make tests

mairooni · 2024-05-09T08:56:20Z

As far as I understand, the loop interchange is always disabled when running on the CPU. But is this only beneficial when the inner loop has more work than the outer loop? If so, does the performance deteriorate when this is not the case?

jjfumero · 2024-05-09T09:39:42Z

The performance decreases, yes. I think we should only enable this optimization under demand explicitly. This is because, for CPUs, TornadoVM selects, as default, 1D with number of threads equal to number of visible CPU cores.

mairooni · 2024-05-09T09:56:10Z

The performance decreases, yes. I think we should only enable this optimization under demand explicitly. This is because, for CPUs, TornadoVM selects, as default, 1D with number of threads equal to number of visible CPU cores.

In this case, should we add a flag to allow developers to enable it if suitable?

jjfumero · 2024-05-09T09:57:49Z

The flag already exists. In this PR is -Dtornado.loops.reverse=False but this has been refactored in other PRs.

stratika

LGMT, I think the problem is not obvious when the Intel oneAPI driver is used. At least in my setup. But if it is working on your system (that you did the performance analysis), we can proceed.

Improvements ~~~~~~~~~~~~~~~~~~ - beehive-lab#402 <beehive-lab#402>: Support for TornadoNativeArrays from FFI buffers. - beehive-lab#403 <beehive-lab#403>: Clean-up and refactoring for the code analysis of the loop-interchange. - beehive-lab#405 <beehive-lab#405>: Disable Loop-Interchange for CPU offloading.. - beehive-lab#407 <beehive-lab#407>: Debugging OpenCL Kernels builds improved. - beehive-lab#410 <beehive-lab#410>: CPU block scheduler disabled by default and option to switch between different thread-schedulers added. - beehive-lab#418 <beehive-lab#418>: TornadoOptions and TornadoLogger improved. - beehive-lab#423 <beehive-lab#423>: MxM using ns instead of ms to report performance. - beehive-lab#425 <beehive-lab#425>: Vector types for ``Float<Width>`` and ``Int<Width>`` supported. - beehive-lab#429 <beehive-lab#429>: Documentation of the installation process updated and improved. - beehive-lab#432 <beehive-lab#432>: Support for SPIR-V code generation and dispatcher using the TornadoVM OpenCL runtime. Compatibility ~~~~~~~~~~~~~~~~~~ - beehive-lab#409 <beehive-lab#409>: Guidelines to build the documentation. - beehive-lab#411 <beehive-lab#411>: Windows installer improved. - beehive-lab#412 <beehive-lab#412>: Python installer improved to check download all Python dependencies before the main installer. - beehive-lab#413 <beehive-lab#413>: Improved documentation for installing all configurations of backends and OS. - beehive-lab#424 <beehive-lab#424>: Use Generic GPU Scheduler for some older NVIDIA Drivers for the OpenCL runtime. - beehive-lab#430 <beehive-lab#430>: Improved the installer by checking that the TornadoVM environment is loaded upfront. Bug Fixes ~~~~~~~~~~~~~~~~~~ - beehive-lab#400 <beehive-lab#400>: Fix batch computation when the global thread indexes are used to compute the outputs. - beehive-lab#414 <beehive-lab#414>: Recover Test-Field unit-tests using Panama types. - beehive-lab#415 <beehive-lab#415>: Check style errors fixed. - beehive-lab#416 <beehive-lab#416>: FPGA execution with multiple tasks in a task-graph fixed. - beehive-lab#417 <beehive-lab#417>: Lazy-copy out fixed for Java fields. - beehive-lab#420 <beehive-lab#420>: Fix Mandelbrot example. - beehive-lab#421 <beehive-lab#421>: OpenCL 2D thread-scheduler fixed for NVIDIA GPUs. - beehive-lab#422 <beehive-lab#422>: Compilation for NVIDIA Jetson Nano fixed. - beehive-lab#426 <beehive-lab#426>: Fix Logger for all backends. - beehive-lab#428 <beehive-lab#428>: Math cos/sin operations supported for vector types. - beehive-lab#431 <beehive-lab#431>: Jenkins files fixed.

[jit][cpu] Disable loop-interchange for CPU offloading

9d802c1

jjfumero added compiler OpenCL labels May 6, 2024

jjfumero requested review from mairooni and stratika May 6, 2024 15:13

jjfumero self-assigned this May 6, 2024

mairooni approved these changes May 9, 2024

View reviewed changes

jjfumero added 3 commits May 9, 2024 15:39

Merge branch 'develop' into feat/cpu/scheduler

12cd371

Merge with develop

f2968a4

Merge branch 'develop' into feat/cpu/scheduler

4cae3df

stratika approved these changes May 10, 2024

View reviewed changes

jjfumero merged commit 304b122 into beehive-lab:develop May 10, 2024
2 checks passed

jjfumero deleted the feat/cpu/scheduler branch May 10, 2024 14:07

jjfumero mentioned this pull request May 28, 2024

[release] TornadoVM 1.0.5 #433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jit][cpu] Disable loop-interchange for CPU offloading #405

[jit][cpu] Disable loop-interchange for CPU offloading #405

jjfumero commented May 6, 2024 •

edited

Loading

mairooni commented May 9, 2024

jjfumero commented May 9, 2024

mairooni commented May 9, 2024

jjfumero commented May 9, 2024

stratika left a comment

[jit][cpu] Disable loop-interchange for CPU offloading #405

[jit][cpu] Disable loop-interchange for CPU offloading #405

Conversation

jjfumero commented May 6, 2024 • edited Loading

Description

Problem description

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

mairooni commented May 9, 2024

jjfumero commented May 9, 2024

mairooni commented May 9, 2024

jjfumero commented May 9, 2024

stratika left a comment

Choose a reason for hiding this comment

jjfumero commented May 6, 2024 •

edited

Loading