hipamd: SIGSEGV when code for particular device architecture is absent #4

shibe2 · 2023-07-27T21:19:00Z

ROCm 5.6.0

This bug has 2 parts.

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

The text was updated successfully, but these errors were encountered:

cjatin · 2023-07-28T09:07:45Z

can you share the code, compile command and your system config (GPU name).

shibe2 · 2023-07-29T11:14:44Z

Reproduction: https://github.com/shibe2/hipamd-crash-4

Tested on multiple systems with different AMD GPUs, each has 1 GPU.

Real world occurrence of this bug is that PyTorch crashes if it was compiled with ROCm, but without the code for particular GPU that end user has: AUTOMATIC1111/stable-diffusion-webui#11712

Epliz · 2023-08-02T05:41:41Z

IMO, the desired behaviour would be that the GPU for which there are missing kernels is not detected as a device, but no crash happens and other GPUs can be used (same effect as masking out the GPU with ROCR_VISIBLD_DEVICE).

This seems particularly relevant to me for scenarios where a user might have an unsupported APU but a supported discrete GPU.

cjatin · 2023-08-03T09:25:42Z

can you run the example with AMD_LOG_LEVEL=7 environment variable and share the logs.

Also you might need -fPIC with -shared

shibe2 · 2023-08-05T11:32:00Z

@Epliz It must be noted that multiple fat binaries may be loaded in a single process, each with different supported architectures.

@cjatin I believe, in my case, PIC is automatically enabled when needed. I used AMD_LOG_LEVEL when I was investigating the crash. I put my findings in the original report. Whoever will be working on this issue can play with my reproduction code and set any options they like.

cjatin · 2023-09-04T11:38:50Z

After adding -fPIC to the Makefile

./app native1.so gfx801.so native2.so
native1.so: ok
gfx801.so: hipErrorInvalidDeviceFunction
native2.so: ok

It might be HIP version difference. Can you tell me the HIP version you are using.

It can be seen via hipcc -v or apt show hip-dev

My makefile changes:

native%.so: lib.cpp
	hipcc -o $@ -fPIC -shared $<

gfx%.so: lib.cpp
	hipcc --offload-arch=gfx$* -o $@ -fPIC -shared $<

shibe2 · 2023-09-04T13:06:30Z

For me -fPIC makes no difference.

./app native1.so gfx801.so native2.so
native1.so: Segmentation fault (core dumped)

hipcc -v
clang version 16.0.0
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1
Candidate multilib: .;@ m64
Candidate multilib: 32;@ m32
Selected multilib: .;@ m64
Found HIP installation: /opt/rocm, version 5.6.31061

WeeBull · 2023-09-17T23:59:52Z

I get the same behaviour as well (-fPIC or not). For me, I have:

GPU: gfx1102
CPU: 5900X
Kernel: 6.5.3
HIP: 5.6.31062 (hipcc -v)

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

I too, had traced it to that null pointer from modules_, but I hadn't discovered why it was null.

cjatin · 2023-11-15T16:27:28Z

I think the issue might be the iGPU present in the system.
Can someone seeing failure share the logs while running with AMD_LOG_LEVEL=7

WeeBull · 2023-11-15T17:33:23Z

I think the issue might be the iGPU present in the system.

That certainly can be a problem. I helped somebody out on discord who was having that issue with a ryzen 7000 series and a 7900 xtx. The software found the integrated GPU before the discrete GPU. We had to use environment variables to get it to ignore the integrated one.

It's not my situation though, I only have a discreet GPU in the system.

shibe2 · 2024-04-06T19:06:55Z

I tested it with ROCm 6.0.2. It no longer crashes, but it fails with hipErrorSharedObjectInitFailed. For example:

./app native1.so
native1.so: ok

but

./app native1.so gfx908.so
native1.so: hipErrorSharedObjectInitFailed
gfx908.so: hipErrorSharedObjectInitFailed

That is, presence of a kernel with missing architecture causes all other kernels to fail. If would be better if in my example native1 continued to work.

This report is specifically about a crash, and that seems to be fixed, so I'm closing this.

Also, it may only affect cases when modules are loaded before HIP initialization.

viebrix mentioned this issue Aug 4, 2023

Possible to update PyTorch build to support Torch 1.13.1 Rocm5.2? xuhuisheng/rocm-gfx803#27

Open

shibe2 closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hipamd: SIGSEGV when code for particular device architecture is absent #4

hipamd: SIGSEGV when code for particular device architecture is absent #4

shibe2 commented Jul 27, 2023

cjatin commented Jul 28, 2023

shibe2 commented Jul 29, 2023

Epliz commented Aug 2, 2023

cjatin commented Aug 3, 2023 •

edited

Loading

shibe2 commented Aug 5, 2023

cjatin commented Sep 4, 2023 •

edited

Loading

shibe2 commented Sep 4, 2023 •

edited

Loading

WeeBull commented Sep 17, 2023

cjatin commented Nov 15, 2023

WeeBull commented Nov 15, 2023

shibe2 commented Apr 6, 2024

hipamd: SIGSEGV when code for particular device architecture is absent #4

hipamd: SIGSEGV when code for particular device architecture is absent #4

Comments

shibe2 commented Jul 27, 2023

cjatin commented Jul 28, 2023

shibe2 commented Jul 29, 2023

Epliz commented Aug 2, 2023

cjatin commented Aug 3, 2023 • edited Loading

shibe2 commented Aug 5, 2023

cjatin commented Sep 4, 2023 • edited Loading

shibe2 commented Sep 4, 2023 • edited Loading

WeeBull commented Sep 17, 2023

cjatin commented Nov 15, 2023

WeeBull commented Nov 15, 2023

shibe2 commented Apr 6, 2024

cjatin commented Aug 3, 2023 •

edited

Loading

cjatin commented Sep 4, 2023 •

edited

Loading

shibe2 commented Sep 4, 2023 •

edited

Loading