Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows variant of Linux installer without MSys2 #356

Merged
merged 12 commits into from
Mar 26, 2024
Merged

Windows variant of Linux installer without MSys2 #356

merged 12 commits into from
Mar 26, 2024

Conversation

otabuzzman
Copy link
Contributor

Description

The PR is about an installer script to simplify installation on Windows. The script is supposed to work similar to the Linux one. It downloads and compiles all repos necessary to build TornadoVM. The script requires standard installations of Windows tools (Visual Studio Community 2022, CMake, Maven, and Python) as well as GraalVM unpacked somewhere in the file system.

The script is stored in bin. The name is tornadovm-installer.cmd. It provides a help option (--help). Further information is in an additional section on Windows installation in the documentation (readthedocs) of TornadoVM.

The script downloads the forked beehive-lab repos of the SPIR-V Toolkit and the LevelZero JNI, and checks out the winstall branch of each. Repo urls and branch names are hard-coded into the script. Both need to be changed after merging, if you decide to do so.

Repo urls and branch names have also been hard-coded into the bin/compile script used by the Linux installer. This has been done for testing purposes on Linux. The compile script thus too needs the above changes after merging.

Problem description

n/ a.

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

The unit tests provided with TornadoVM have been executed on Windows 11, Windows Server 2022 and Amazon Linux 2. Details are in this Google sheet. Some notes after a rough inspection:

  • The test method testBatchNotEven failed on every system for every backend with same extepcted/ was values for each failure. Might thus be a principal problem.
  • The test methods testTornadoMathSinPIDouble and testTornadoMathCosPIDouble failed on every system for the PTX backend with compile errors. CosPi/ SinPI might thus not be implemented at all for PTX.
  • The test method testCopyInWithDevice fails sometimes. Might be due to different timings and a too small generous value for deltain assertEqual.
  • The remaining failed test methods only affected Windows 11 and the SPIR-V backend. These need investigation.

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

On a Windows box:

  • Install Visual Studio Community 2022, CMake, Maven, GraalVM, Python (using respective Windows installer for each)
  • Run Windwos installer script bin\tornadovm-installer.cmd
  • Setup environment with command setvars.cmd
  • List devices with command python %TORNADO_SDK%\bin\tornado --devices

@CLAassistant
Copy link

CLAassistant commented Mar 19, 2024

CLA assistant check
All committers have signed the CLA.

@jjfumero
Copy link
Member

jjfumero commented Mar 19, 2024

Thank you @otabuzzman . This is awesome! I was planing to do something like this soon, so very timely. Give me a few days to check with my windows PC and try all instructions step by step.

@otabuzzman
Copy link
Contributor Author

otabuzzman commented Mar 20, 2024 via email

@jjfumero jjfumero self-assigned this Mar 20, 2024
@jjfumero
Copy link
Member

I will start with the dependencies and then switch to this main repo.

@jjfumero jjfumero requested a review from stratika March 21, 2024 13:49
@jjfumero
Copy link
Member

jjfumero commented Mar 21, 2024

I could make it work. However, depending on the backend, I get errors.

OpenCL:

python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

 -> Returned: -5
        Single Threaded CPU Execution: 2.63 GFlops, Total time = 102 ms
        Streams Execution: 16.78 GFlops, Total time = 16 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 102.0x
        Verification false

But the same kernel, running with SPIR-V (Level Zero) and CUDA PTX works fine:

python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

Task info: s0.t0
        Backend           : PTX
        Device            : NVIDIA GeForce RTX 3070 GPU
        Dims              : 2
        Thread dimensions : [512, 512]
        Blocks dimensions : [16, 16, 1]
        Grids dimensions  : [32, 32, 1]

        Single Threaded CPU Execution: 2.63 GFlops, Total time = 102 ms
        Streams Execution: 16.78 GFlops, Total time = 16 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 102.0x
        Verification true
python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

Task info: s0.t0
        Backend           : SPIRV
        Device            : SPIRV LevelZero - Intel(R) UHD Graphics 770 GPU
        Dims              : 2
        Global work offset: [0, 0]
        Global work size  : [512, 512]
        Local  work size  : [512, 1, 1]
        Number of workgroups  : [1, 512]

        Single Threaded CPU Execution: 2.40 GFlops, Total time = 112 ms
        Streams Execution: 17.90 GFlops, Total time = 15 ms
        TornadoVM Execution on GPU (Accelerated): 22.37 GFlops, Total Time = 12 ms
        Speedup: 9.333333333333334x
        Verification true

It looks to me a driver issue, but this test passes on Linux and OSx.

OpenCL devices:

python %TORNADO_SDK%\bin\tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2


[TornadoVM-OCL-JNI] ERROR : clReleaseContext -> Returned: -34

The errors seems to be related to the FPGA, that we need to access in emulation mode.

@@ -144,7 +144,7 @@ def build_levelzero_jni_lib(rebuild=False):
[
"git",
"clone",
"https://github.com/beehive-lab/levelzero-jni",
"https://github.com/otabuzzman/levelzero-jni#winstall",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep a note: We should merge first the dependencies and then update this URL to the official repos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we merge the develop of levelzero-jni to master, we can revert this link.

@@ -184,7 +184,7 @@ def build_spirv_toolkit_and_level_zero(rebuild=False):
[
"git",
"clone",
"https://github.com/beehive-lab/beehive-spirv-toolkit.git",
"https://github.com/otabuzzman/beehive-spirv-toolkit.git#winstall",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

bin/tornadovm-installer.cmd Show resolved Hide resolved
docs/source/CHANGELOG.rst Outdated Show resolved Hide resolved
docs/source/CHANGELOG.rst Outdated Show resolved Hide resolved
docs/source/CHANGELOG.rst Outdated Show resolved Hide resolved
@jjfumero
Copy link
Member

jjfumero commented Mar 21, 2024

Strange, with the OpenCL and my setup, nothing works. It looks to me a problem with my configuration:

python %TORNADO_SDK%\bin\tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2


[TornadoVM-OCL-JNI] ERROR : clReleaseContext -> Returned: -34

C:\Users\jjfum\source\repos\TornadoVM>python %TORNADO_SDK%\bin\tornado-test
python C:/Users/jjfum/source/repos/TornadoVM/bin/sdk/bin/tornado  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=False "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.foundation.TestIntegers"
WARNING: Using incubator modules: jdk.incubator.vector

[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_fp16 : enable

@otabuzzman
Copy link
Contributor Author

Strange behavior, indeed. What oneAPI components are installed in your setup? In my there is only the Intel® CPU Runtime for OpenCL™ Applications with SYCL support. To make it work the steps given on the webpage in section Known Issues and Limitations needed to be applied.

What is that FPGA emulator? Can you switch it off?

@jjfumero
Copy link
Member

In my case I installed the oneAPI Base Toolkit, which includes the FPGA emulation and other tools. I also have installed the Intel ARC GPU Drivers, since time to time, I switch my 3070 for the ARC 750 for experiments, and this might be causing the problem.
The thing is:

  • Using Msys64 tool on Windows runs fine with OpenCL
  • Native Windows (using VS Tools) runs fine with PTX and SPIR-V. PTX runs on the same NVIDIA 3070 that "fails" with OpenCL.

I will dig in to investigate the problem, but good to know it works for you. I will also work with Thanos to try to reproduce this on a different machine.

@jjfumero
Copy link
Member

jjfumero commented Mar 22, 2024

Update:

  • I updated the NVIDIA Driver and removed an old installation of oneAPI Toolkit (I had 2, 2022 and 2024.0.1) and unittests are passing with the OpenCL backend on the RTX 3070. Not all of them, though, and still the Matrix Multiplication benchmark fails.
> python %TORNADO_SDK%\bin\tornado --devices

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2


> python %TORNADO_SDK%\bin\tornado-test -V

Test: class uk.ac.manchester.tornado.unittests.foundation.TestIntegers
        Running test: test01                     ................  [PASS]
        Running test: test03                     ................  [PASS]
        Running test: test04                     ................  [PASS]
        Running test: test05                     ................  [PASS]
        Running test: test06                     ................  [PASS]
        Running test: test07                     ................  [PASS]
        Running test: test02                     ................  [PASS]

Test: class uk.ac.manchester.tornado.unittests.foundation.TestFloats
        Running test: testFloatsCopy             ................  [PASS]
        Running test: testVectorFloatMul         ................  [PASS]
        Running test: testVectorFloatDiv         ................  [PASS]
        Running test: testVectorFloatAdd         ................  [PASS]
        Running test: testVectorFloatSub         ................  [PASS]

Test: class uk.ac.manchester.tornado.unittests.foundation.TestDoubles
        Running test: testDoublesMul             ................  [PASS]
        Running test: testDoublesCopy            ................  [PASS]
        Running test: testDoublesAdd             ................  [PASS]
        Running test: testDoublesDiv             ................  [PASS]
        Running test: testDoublesSub             ................  [PASS]

        ...
Test: class uk.ac.manchester.tornado.unittests.compute.ComputeTests
        Running test: testNBodyBigNoWorker       ................  [PASS]
        Running test: testBlackScholes           ................  [PASS]
        Running test: testHilbert                ................  [PASS]
        Running test: testNBodySmall             ................  [PASS]
        Running test: testDFTVectorTypes         ................  [PASS]
        Running test: matrixVector               ................  [PASS]
        Running test: testDFTFloat               ................  [PASS]
        Running test: testRenderTrack            ................  [PASS]
        Running test: testDFTDouble              ................  [PASS]
        Running test: testMandelbrot             ................  [FAILED]
                \_[REASON] expected:<8> but was:<9>
        Running test: testMontecarlo             ................  [PASS]
        Running test: matrixVectorFloat4         ................  [PASS]
        Running test: testJuliaSets              ................  [FAILED]
                \_[REASON] expected:<-1000.0> but was:<1.5197569>
        Running test: testNBody                  ................  [PASS]
        Running test: testEuler                  ................  [PASS]
        ...

==================================================
              Unit tests report
==================================================

{'[PASS]': 579, '[FAILED]': 16, '[UNSUPPORTED]': 22}
Coverage [PASS/(PASS+FAIL)]: 97.31%
Coverage [PASS/(PASS+FAIL+UNSUPPORTED)]: 93.84%

==================================================
....

@jjfumero
Copy link
Member

Based on the previous test, I am more towards a misconfiguration regarding the OpenCL on my Windows 11.

@jjfumero
Copy link
Member

I used the cmd.exe tool. Later I realized that using Python would have been better since it is necessary to run and test TornadoVM interactively anyway. I now think that customizing the original installer should be possible with little effort and am considering giving it a try.

Ok. My only concern is that, as it is, it kind of branches away from the style we have for Linux and OSx. To simplify the process of merging and review, my suggestion is that, for this iteration of the code, we move on with this CMD tool, and you can open a second PR with the Python migration if you want. Is this something you would like to try?

@jjfumero
Copy link
Member

More updates regarding NVIDIA OpenCL support on Windows 11:

  • I uninstalled oneAPI just to see if that was the problem. Same failure for the Matrix Multiply
  • I installed the ARC Drivers -> same behaviour.
  • In a closer look, I noticed that the Matrix Multiply in OpenCL using the RTX 3070 via NVIDIA is correct for small matrices (less than 32x32). Which makes me think it is related to the block size. The same GPU is used in WSL under Linux Ubuntu and it works. The only difference is the driver. The same local workgroup is selected for the PTX CUDA Backend in TornadoVM, and it works. So this suggests to me it is a matter of drivers.
  • I updated my NVIDIA Driver from "stable" to "gaming", and I noticed the same behaviour.

I am running out of ideas, but at least we know it is not due to the installation of oneAPI + ARC Drivers.

@jjfumero
Copy link
Member

Ok, I think I got it.

So the error is printed by the Driver and captured in our JNI code to dispatch OpeNCL kernels:

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

This mainly suggests an issue with the block size. Since I noticed that smaller block sizes are executed correctly with OpenCL, I modified the Matrix Multiplication example in TorandoVM as follows:

TaskGraph taskGraph = new TaskGraph("s0") //
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, matrixA, matrixB) //
                .task("t0", MatrixMultiplication2D::matrixMultiplication, matrixA, matrixB, matrixC, size) //
                .transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC);

        ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
        TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);

        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
        workerGrid.setLocalWork(16, 16, 1);

        executor.withGridScheduler(gridScheduler).withWarmUp();

Diff:

diff --git a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
index 0426e2dbb..a28ed57c6 100644
--- a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
+++ b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
@@ -20,9 +20,7 @@ package uk.ac.manchester.tornado.examples.compute;
 import java.util.Random;
 import java.util.stream.IntStream;

-import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
-import uk.ac.manchester.tornado.api.TaskGraph;
-import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
+import uk.ac.manchester.tornado.api.*;
 import uk.ac.manchester.tornado.api.annotations.Parallel;
 import uk.ac.manchester.tornado.api.enums.DataTransferMode;
 import uk.ac.manchester.tornado.api.enums.TornadoDeviceType;
@@ -97,7 +95,12 @@ public class MatrixMultiplication2D {

         ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
         TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
-        executor.withWarmUp();
+
+        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+        workerGrid.setLocalWork(16, 16, 1);
+
+        executor.withGridScheduler(gridScheduler).withWarmUp();

         // 1. Warm up Tornado
         for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {

So I forced to execute in blocks of 16x16 instead of the default value of 32x32, and the execution is correct.

Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0, 0]
        Global work size  : [512, 512, 1]
        Local  work size  : [16, 16, 1]
        Number of workgroups  : [32, 32, 1]

        Single Threaded CPU Execution: 2.58 GFlops, Total time = 104 ms
        Streams Execution: 15.79 GFlops, Total time = 17 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 104.0x
        Verification true

Takeaways:

  • This issue (results not correct in OpenCL for MxM) does not have anything to do with the new installation for Windows, so we can move on with the PR.
  • This is a new issue (new for us at least) regarding the block sizes. The TornadoVM Runtime selects the block size using the NVIDIA Guidelines for OpenCL: https://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf
    So TornadoVM does not reinvent the wheel, and that block size should be valid because TornadoVM queries the device properties first. We will investigate this for Windows in a separate issue.
  • We can have ARC drivers + oneAPI drivers in combination with NVIDIA for Windows 11.

@otabuzzman
Copy link
Contributor Author

otabuzzman commented Mar 22, 2024 via email

@@ -144,7 +144,7 @@ def build_levelzero_jni_lib(rebuild=False):
[
"git",
"clone",
"https://github.com/beehive-lab/levelzero-jni",
"https://github.com/otabuzzman/levelzero-jni#winstall",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we merge the develop of levelzero-jni to master, we can revert this link.

tornado-assembly/src/bin/test-native.cmd Show resolved Hide resolved
@jjfumero jjfumero added the enhancement New feature or request label Mar 26, 2024
Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think we can iterate to simplify the part of installation in the documentation. Very good work for native installation in Windows.

I tested it on Windows 11.

@jjfumero
Copy link
Member

I will merge this. Awesome work @otabuzzman . Thank you!

@jjfumero jjfumero merged commit 6bb982e into beehive-lab:develop Mar 26, 2024
2 checks passed
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Mar 27, 2024
Improvements
~~~~~~~~~~~~~~~~~~

- `beehive-lab#344 <https://github.com/beehive-lab/TornadoVM/pull/344>`_: Support for Multi-threaded Execution Plans.
- `beehive-lab#347 <https://github.com/beehive-lab/TornadoVM/pull/347>`_: Enhanced examples.
- `beehive-lab#350 <https://github.com/beehive-lab/TornadoVM/pull/350>`_: Obtain internal memory segment for the Tornado Native Arrays without the object header.
- `beehive-lab#357 <https://github.com/beehive-lab/TornadoVM/pull/357>`_: API extensions to query and apply filters to backends and devices from the ``TornadoExecutionPlan``.
- `beehive-lab#359 <https://github.com/beehive-lab/TornadoVM/pull/359>`_: Support Factory Methods for FFI-based array collections to be used/composed in TornadoVM Task-Graphs.

Compatibility
~~~~~~~~~~~~~~~~~~

- `beehive-lab#351 <https://github.com/beehive-lab/TornadoVM/pull/351>`_: Compatibility of TornadoVM Native Arrays with the Java Vector API.
- `beehive-lab#352 <https://github.com/beehive-lab/TornadoVM/pull/352>`_: Refactor memory limit to take into account primitive types and wrappers.
- `beehive-lab#354 <https://github.com/beehive-lab/TornadoVM/pull/354>`_: Add DFT-sample benchmark in FP32.
- `beehive-lab#356 <https://github.com/beehive-lab/TornadoVM/pull/356>`_: Initial support for Windows 11 using Visual Studio Development tools.
- `beehive-lab#361 <https://github.com/beehive-lab/TornadoVM/pull/361>`_: Compatibility with the SPIR-V toolkit v0.0.4.
- `beehive-lab#366 <https://github.com/beehive-lab/TornadoVM/pull/363>`_: Level Zero JNI Dependency updated to 0.1.3.

Bug Fixes
~~~~~~~~~~~~~~~~~~

- `beehive-lab#346 <https://github.com/beehive-lab/TornadoVM/pull/346>`_: Computation of local-work group sizes for the Level Zero/SPIR-V backend fixed.
- `beehive-lab#360 <https://github.com/beehive-lab/TornadoVM/pull/358>`_: Fix native tests to check the JIT compiler for each backend.
- `beehive-lab#355 <https://github.com/beehive-lab/TornadoVM/pull/355>`_: Fix custom exceptions when a driver/device is not found.
@jjfumero jjfumero mentioned this pull request Mar 27, 2024
8 tasks
@otabuzzman otabuzzman deleted the winstall branch March 27, 2024 10:28
@jjfumero jjfumero mentioned this pull request May 14, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

4 participants