Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hipamd: SIGSEGV when code for particular device architecture is absent #4

Closed
shibe2 opened this issue Jul 27, 2023 · 11 comments
Closed

Comments

@shibe2
Copy link

shibe2 commented Jul 27, 2023

ROCm 5.6.0

This bug has 2 parts.

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

@cjatin
Copy link
Contributor

cjatin commented Jul 28, 2023

can you share the code, compile command and your system config (GPU name).

@shibe2
Copy link
Author

shibe2 commented Jul 29, 2023

Reproduction: https://github.com/shibe2/hipamd-crash-4

Tested on multiple systems with different AMD GPUs, each has 1 GPU.

Real world occurrence of this bug is that PyTorch crashes if it was compiled with ROCm, but without the code for particular GPU that end user has: AUTOMATIC1111/stable-diffusion-webui#11712

@Epliz
Copy link

Epliz commented Aug 2, 2023

IMO, the desired behaviour would be that the GPU for which there are missing kernels is not detected as a device, but no crash happens and other GPUs can be used (same effect as masking out the GPU with ROCR_VISIBLD_DEVICE).

This seems particularly relevant to me for scenarios where a user might have an unsupported APU but a supported discrete GPU.

@cjatin
Copy link
Contributor

cjatin commented Aug 3, 2023

can you run the example with AMD_LOG_LEVEL=7 environment variable and share the logs.

Also you might need -fPIC with -shared

@shibe2
Copy link
Author

shibe2 commented Aug 5, 2023

@Epliz It must be noted that multiple fat binaries may be loaded in a single process, each with different supported architectures.

@cjatin I believe, in my case, PIC is automatically enabled when needed. I used AMD_LOG_LEVEL when I was investigating the crash. I put my findings in the original report. Whoever will be working on this issue can play with my reproduction code and set any options they like.

@cjatin
Copy link
Contributor

cjatin commented Sep 4, 2023

After adding -fPIC to the Makefile

./app native1.so gfx801.so native2.so
native1.so: ok
gfx801.so: hipErrorInvalidDeviceFunction
native2.so: ok

It might be HIP version difference. Can you tell me the HIP version you are using.

It can be seen via hipcc -v or apt show hip-dev

My makefile changes:

native%.so: lib.cpp
	hipcc -o $@ -fPIC -shared $<

gfx%.so: lib.cpp
	hipcc --offload-arch=gfx$* -o $@ -fPIC -shared $<

@shibe2
Copy link
Author

shibe2 commented Sep 4, 2023

For me -fPIC makes no difference.

./app native1.so gfx801.so native2.so
native1.so: Segmentation fault (core dumped)

hipcc -v
clang version 16.0.0
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1
Candidate multilib: .;@ m64
Candidate multilib: 32;@ m32
Selected multilib: .;@ m64
Found HIP installation: /opt/rocm, version 5.6.31061

@WeeBull
Copy link

WeeBull commented Sep 17, 2023

I get the same behaviour as well (-fPIC or not). For me, I have:

  • GPU: gfx1102
  • CPU: 5900X
  • Kernel: 6.5.3
  • HIP: 5.6.31062 (hipcc -v)

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

I too, had traced it to that null pointer from modules_, but I hadn't discovered why it was null.

@cjatin
Copy link
Contributor

cjatin commented Nov 15, 2023

I think the issue might be the iGPU present in the system.
Can someone seeing failure share the logs while running with AMD_LOG_LEVEL=7

@WeeBull
Copy link

WeeBull commented Nov 15, 2023

I think the issue might be the iGPU present in the system.

That certainly can be a problem. I helped somebody out on discord who was having that issue with a ryzen 7000 series and a 7900 xtx. The software found the integrated GPU before the discrete GPU. We had to use environment variables to get it to ignore the integrated one.

It's not my situation though, I only have a discreet GPU in the system.

@shibe2
Copy link
Author

shibe2 commented Apr 6, 2024

I tested it with ROCm 6.0.2. It no longer crashes, but it fails with hipErrorSharedObjectInitFailed. For example:

./app native1.so
native1.so: ok

but

./app native1.so gfx908.so
native1.so: hipErrorSharedObjectInitFailed
gfx908.so: hipErrorSharedObjectInitFailed

That is, presence of a kernel with missing architecture causes all other kernels to fail. If would be better if in my example native1 continued to work.

This report is specifically about a crash, and that seems to be fixed, so I'm closing this.

Also, it may only affect cases when modules are loaded before HIP initialization.

@shibe2 shibe2 closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants