Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743

Open
sr1gh opened this issue Aug 9, 2024 · 11 comments

Comments

@sr1gh
Copy link

sr1gh commented Aug 9, 2024

Describe the bug
If GPU computation is suspended during use or when an exclusive application is running, when computation resumes, BOINC sometimes swaps which task is on which GPU. This causes a computation error for asteroids@home tasks when using multiple AMD GPUs, for example, an RX 7600 XT and an RX 6600. This might be an application specific issue, but it might be a good idea to have an option to not switch tasks between GPUs if possible, unless, for example, one GPU is removed, in which case all the tasks would have to run on the remaining GPU.

Steps To Reproduce

  1. Start the asteroids@home period search application on a system with 2 different AMD gpus
  2. Suspend computation partway through the computation, observing which task is on which gpu, the resume computation
  3. Repeat if necessary until BOINC switches tasks between gpus resulting in a computation error.

Expected behavior
I would expect the task to stay on the GPU it started on if that is necessary for the task to finish. An option to disable gpu task switching is a potential solution, or tasks could specify weather or not they can be switched.

Screenshots

System Information

  • OS: Windows 10 (Latest)
  • BOINC Version: 8.0.4

Additional context

@AenBleidd
Copy link
Member

I'm pretty much sure this is an issue of the project application, because for every task we assign on start-up ID (0, 1, 2, etc) of the GPU to be used.
If the project application doesn't use it but instead relies on some other mechanism - then there might be a collision.
@sr1gh, may I ask you for a favor?
Could you please go to the %BOINCDATA%\slots%N%
where

  • %BOINDATA% is the folder where your BOINC data is located (usually C:\ProgramData\BOINC)
  • %N% number of the slot
    locate there two running tasks in the %N% folders (two different folders), open their init_data.xml and check <gpu_device_num> value.
    These numbers should be different for different tasks, but should stay consistent after task is suspended and run again.
    If these numbers stay the same but the application crashes - then this is definitely an issue with the project application, and it should be reported to their admins.
    Anyway please report these results back to us, and we will check that there is no issue on our side.

@davidpanderson
Copy link
Contributor

AFAIK the client doesn't have a mechanism for pinning a job to a GPU.
I need to verify that it rewrites init_data.xml before restarting a job;
otherwise a collision could happen.

@RichardHaselgrove
Copy link
Contributor

The same issue is a significant problem at GPUGrid.

init_data contains the correct <gpu_device_num> for a running task. But if BOINC is stopped and restarted, there is no guarantee that the same GPU will be assigned by BOINC. If the new GPU is identical to the previous run, the task restarts normally.

But if it is not identical, the task crashes, potentially losing several hours of work. The crash is initiated by the project application, but could be prevented by the BOINC client remembering and reusing the device allocation at startup.

NB consider respecting previous OpenCL device numbers too, although I've only seen the problem for cuda apps.

@davidpanderson
Copy link
Contributor

The issue is whether the task crashes because it runs an a different GPU than where it started,
or because 2 tasks are trying to use the same GPU.

The former seems odd - why would a checkpoint file be specific to a GPU instance?

@RichardHaselgrove
Copy link
Contributor

I'm looking through my recent errors for an example of the specific failure case, but I haven't found one yet.

From memory, the problem occurs from the 'just in time' GPU code compiler. At GPUGrid, this produces code which is specific to the individual GPU type used in the first run, If the second GPU is different, the by now pre-compiled code is incompatible with the hardware.

@RichardHaselgrove
Copy link
Contributor

Can't find an error on my own machines - I know from bitter experience that I have to avoid shutdowns when GPUGrid work is running.

But see https://www.gpugrid.net/forum_thread.php?id=5461 for a report/response on their message board.

@AenBleidd
Copy link
Member

The former seems odd - why would a checkpoint file be specific to a GPU instance?

@davidpanderson, as @RichardHaselgrove already mentioned, it's very important that the task that started to run on particular GPU will stick to it forever, otherwise it's not guaranteed that the computation could be continued even from the checkpoint.

@sr1gh
Copy link
Author

sr1gh commented Aug 9, 2024

Yes, it appears that some GPU apps generate the code for the specific hardware used. Here is the error output from a failed task from asteroids at home.

<stderr_txt>
BOINC client version 8.0.4
BOINC GPU type 'ATI', deviceId=1, slot=0
Application: period_search_10220_windows_x86_64__opencl_102_amd_win.exe
Version: 102.20.0.0
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 1
OpenCL device name: AMD Radeon RX 6600 7GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 14
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 448 = 2 * 14 * 16
Block dim: 128
Binary build log for AMD Radeon RX 6600:
OK (0)
Program build log for AMD Radeon RX 6600:
OK (0)
Prefered kernel work group size multiple: 32
Setting Grid Dim to 256
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 0
OpenCL device name: AMD Radeon RX 7600 XT 15GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 16
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 512 = 2 * 16 * 16
Block dim: 128
Build log: AMD Accelerated Parallel Processing | AMD Radeon RX 7600 XT:
Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102
Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

Error creating queue: build program failure(-11)

</stderr_txt>

@sr1gh
Copy link
Author

sr1gh commented Aug 9, 2024

The <gpu_device_num> values appeared the same after resuming computation, but in BOINC manager, the task that said "device 0" likely said "device 1" before the error, but the error happens immediately after resuming, so it is hard to tell, although I have seen this swap occur with other applications from other projects. And the following error from the above post would indicate that the tasks are sometimes swapping GPUs:

Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102

gfx1032 is RX 6600
gfx1102 is RX 7600 XT

@davidpanderson
Copy link
Contributor

One option would be for the app to compile its kernels each time it starts.

@davidpanderson
Copy link
Contributor

If we pin each GPU job to a GPU instance, the following could happen:
jobs A and B are running on GPUs 0 and 1 respectively.
Job C arrives, with an early deadline, so it preempts job A and starts running on GPU 0.
Job B finishes.

We now have 2 jobs pinned to GPU 0; GPU 1 is idle.
The work fetch logic (which doesn't know about GPU assignments)
thinks that both GPUs are busy, so it doesn't fetch more jobs.

To avoid this, we'd have to extend the simulation done by the work fetch logic
to model GPU assignments (in addition to per-project GPU exclusions,
max concurrency, etc.). This would be quite difficult.
It would be better if apps could recompile their kernels on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

4 participants