[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers #410

jjfumero · 2024-05-08T12:19:46Z

Description

This PR provides an enhancement for the thread-block scheduler when running on the CPUs. The default thread scheduler assigns a CPU thread per CPU core. This might not be the best strategy and it really depends on the OpenCL runtime/driver implementation. This patch, provides a switch for CPU-Block versus fine-grained scheduler, and it sets, by default, the CPU thread scheduler to the fine-grained.

Problem description

The main problem is performance. For example, when running on the CPU using the PoCL OpenCL implementation, the CPU implementation takes, in average ~46 seconds to complete, while the Intel oneAPI takes ~5 seconds.

If, instead of the block-scheduler for CPU, we use the "iteration" of the fine-grained scheduler, TornadoVM with PoCL runs in ~2.9-3.2 (s) per iteration.

Here's a trace with PoCL with block and without block scheduler: The application is taken from the TornadoVM-Examples repository: https://github.com/jjfumero/tornadovm-examples

$ tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3 

WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.green
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.blue
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 47729662509 -- seconds = 47.729662509

Using the fine-grained scheduler:

tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [54, 64, 1]
	Number of workgroups  : [72, 81]

Task info: blur.green
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [54, 64, 1]
	Number of workgroups  : [72, 81]

Task info: blur.blue
	Backend           : OPENCL
	Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [54, 64, 1]
	Number of workgroups  : [72, 81]

TornadoVM Total Time (ns) = 3513510495 -- seconds = 3.5135104950000002

Speedup on CPUs is 13x compared to the previous default scheduler in TornadoVM.

We can find also speedups using the Intel oneAPI OpenCL runtime instead of PoCL:

With Block Scheduler:

tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.green
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

Task info: blur.blue
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [20, 1]
	Local  work size  : null
	Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 5649483400 -- seconds = 5.6494834

With fine-grained scheduler:

tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [81, 81, 1]
	Number of workgroups  : [48, 64]

Task info: blur.green
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [81, 81, 1]
	Number of workgroups  : [48, 64]

Task info: blur.blue
	Backend           : OPENCL
	Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [3888, 5184]
	Local  work size  : [81, 81, 1]
	Number of workgroups  : [48, 64]

TornadoVM Total Time (ns) = 2347847562 -- seconds = 2.347847562

Speedup from the first iteration is 2.4x

Performance Gains

Using PoCL, this is the performance graph compared to the version in develop (gets block-thread as default) vs this branch. The Baseline is block thread and, if the value is higher than 1, the iteration-scheduler (fine-grained) is faster.

Interactive graph:

https://docs.google.com/spreadsheets/d/e/2PACX-1vRJiSmP8Hewlbkcm6jyfagSd0u7_X06NF8eiWNCmjpLfLd6np6uA0qO3QIhlIopg8CZ0u1bVdm__XSG/pubchart?oid=1645903604&format=interactive

Speedups ranges from 5% to 13x in average.

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

make

… compiler and runtime

mikepapadim

LGTM

mairooni

LGTM

stratika

LGTM

Improvements ~~~~~~~~~~~~~~~~~~ - beehive-lab#402 <beehive-lab#402>: Support for TornadoNativeArrays from FFI buffers. - beehive-lab#403 <beehive-lab#403>: Clean-up and refactoring for the code analysis of the loop-interchange. - beehive-lab#405 <beehive-lab#405>: Disable Loop-Interchange for CPU offloading.. - beehive-lab#407 <beehive-lab#407>: Debugging OpenCL Kernels builds improved. - beehive-lab#410 <beehive-lab#410>: CPU block scheduler disabled by default and option to switch between different thread-schedulers added. - beehive-lab#418 <beehive-lab#418>: TornadoOptions and TornadoLogger improved. - beehive-lab#423 <beehive-lab#423>: MxM using ns instead of ms to report performance. - beehive-lab#425 <beehive-lab#425>: Vector types for ``Float<Width>`` and ``Int<Width>`` supported. - beehive-lab#429 <beehive-lab#429>: Documentation of the installation process updated and improved. - beehive-lab#432 <beehive-lab#432>: Support for SPIR-V code generation and dispatcher using the TornadoVM OpenCL runtime. Compatibility ~~~~~~~~~~~~~~~~~~ - beehive-lab#409 <beehive-lab#409>: Guidelines to build the documentation. - beehive-lab#411 <beehive-lab#411>: Windows installer improved. - beehive-lab#412 <beehive-lab#412>: Python installer improved to check download all Python dependencies before the main installer. - beehive-lab#413 <beehive-lab#413>: Improved documentation for installing all configurations of backends and OS. - beehive-lab#424 <beehive-lab#424>: Use Generic GPU Scheduler for some older NVIDIA Drivers for the OpenCL runtime. - beehive-lab#430 <beehive-lab#430>: Improved the installer by checking that the TornadoVM environment is loaded upfront. Bug Fixes ~~~~~~~~~~~~~~~~~~ - beehive-lab#400 <beehive-lab#400>: Fix batch computation when the global thread indexes are used to compute the outputs. - beehive-lab#414 <beehive-lab#414>: Recover Test-Field unit-tests using Panama types. - beehive-lab#415 <beehive-lab#415>: Check style errors fixed. - beehive-lab#416 <beehive-lab#416>: FPGA execution with multiple tasks in a task-graph fixed. - beehive-lab#417 <beehive-lab#417>: Lazy-copy out fixed for Java fields. - beehive-lab#420 <beehive-lab#420>: Fix Mandelbrot example. - beehive-lab#421 <beehive-lab#421>: OpenCL 2D thread-scheduler fixed for NVIDIA GPUs. - beehive-lab#422 <beehive-lab#422>: Compilation for NVIDIA Jetson Nano fixed. - beehive-lab#426 <beehive-lab#426>: Fix Logger for all backends. - beehive-lab#428 <beehive-lab#428>: Math cos/sin operations supported for vector types. - beehive-lab#431 <beehive-lab#431>: Jenkins files fixed.

jjfumero added 3 commits May 8, 2024 13:54

[JIT][ocl] option to enable/disable block-thread scheduler in the JIT…

13d2f18

… compiler and runtime

Option forceAllGPU removed

6a2d75e

Option USE_CPU_SCHEDULER moved to USE_BLOCK_SCHEDULER

1aea567

jjfumero added compiler OpenCL runtime cpu labels May 8, 2024

jjfumero requested review from mikepapadim, mairooni and stratika May 8, 2024 12:19

jjfumero self-assigned this May 8, 2024

License header updated

355471b

mikepapadim approved these changes May 8, 2024

View reviewed changes

mairooni approved these changes May 9, 2024

View reviewed changes

Merge branch 'develop' into feat/cpu/block-scheduler

8465d87

stratika approved these changes May 10, 2024

View reviewed changes

jjfumero merged commit 65f3e1f into beehive-lab:develop May 10, 2024
2 checks passed

jjfumero deleted the feat/cpu/block-scheduler branch May 20, 2024 10:02

jjfumero mentioned this pull request May 28, 2024

[release] TornadoVM 1.0.5 #433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers #410

[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers #410

jjfumero commented May 8, 2024 •

edited

Loading

mikepapadim left a comment

mairooni left a comment

stratika left a comment

[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers #410

[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers #410

Conversation

jjfumero commented May 8, 2024 • edited Loading

Description

Problem description

Performance Gains

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

mikepapadim left a comment

Choose a reason for hiding this comment

mairooni left a comment

Choose a reason for hiding this comment

stratika left a comment

Choose a reason for hiding this comment

jjfumero commented May 8, 2024 •

edited

Loading