-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hipamd: SIGSEGV when code for particular device architecture is absent #4
Comments
can you share the code, compile command and your system config (GPU name). |
Reproduction: https://github.com/shibe2/hipamd-crash-4 Tested on multiple systems with different AMD GPUs, each has 1 GPU. Real world occurrence of this bug is that PyTorch crashes if it was compiled with ROCm, but without the code for particular GPU that end user has: AUTOMATIC1111/stable-diffusion-webui#11712 |
IMO, the desired behaviour would be that the GPU for which there are missing kernels is not detected as a device, but no crash happens and other GPUs can be used (same effect as masking out the GPU with ROCR_VISIBLD_DEVICE). This seems particularly relevant to me for scenarios where a user might have an unsupported APU but a supported discrete GPU. |
can you run the example with Also you might need |
@Epliz It must be noted that multiple fat binaries may be loaded in a single process, each with different supported architectures. @cjatin I believe, in my case, PIC is automatically enabled when needed. I used |
After adding
It might be HIP version difference. Can you tell me the HIP version you are using. It can be seen via My makefile changes:
|
For me ./app native1.so gfx801.so native2.so hipcc -v |
I get the same behaviour as well (
I too, had traced it to that null pointer from |
I think the issue might be the iGPU present in the system. |
That certainly can be a problem. I helped somebody out on discord who was having that issue with a ryzen 7000 series and a 7900 xtx. The software found the integrated GPU before the discrete GPU. We had to use environment variables to get it to ignore the integrated one. It's not my situation though, I only have a discreet GPU in the system. |
I tested it with ROCm 6.0.2. It no longer crashes, but it fails with hipErrorSharedObjectInitFailed. For example:
but
That is, presence of a kernel with missing architecture causes all other kernels to fail. If would be better if in my example native1 continued to work. This report is specifically about a crash, and that seems to be fixed, so I'm closing this. Also, it may only affect cases when modules are loaded before HIP initialization. |
ROCm 5.6.0
This bug has 2 parts.
PlatformState::init
returns immediately ifdigestFatBinary
fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.hip::Function::getStatFunc
and other functions use null pointer frommodules_
, and the program crashes.The text was updated successfully, but these errors were encountered: