Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocminfo fails when amdgpu is built into the kernel #42

Open
FireBurn opened this issue Jun 29, 2021 · 15 comments · May be fixed by #65
Open

rocminfo fails when amdgpu is built into the kernel #42

FireBurn opened this issue Jun 29, 2021 · 15 comments · May be fixed by #65

Comments

@FireBurn
Copy link

Is there another way of detecting amdgpu is loaded then running lsmod?

@fxkamd
Copy link

fxkamd commented Jun 29, 2021

I'm not sure what your question is about. Do you want to find out whether amdgpu loaded successfully? Or are you asking whether rocminfo could use some alternative way to detect amdgpu?

For your own trouble-shooting, check dmesg or "journalctl -k -b".

@FireBurn
Copy link
Author

Sorry I should have been more clear rocminfo doesn't get past https://github.com/RadeonOpenCompute/rocminfo/blob/10da0a71da6700c91e8cd204927cca0d9461b586/rocminfo.cc#L1041

rocminfo
ROCk module is NOT loaded, possibly no GPU devices

@skeelyamd
Copy link
Collaborator

rocminfo tests for a working kfd and installation very cautiously since many unrelated things can go wrong. If you've baked amdgpu into your kernel then you could skip the lsmod check since you know you attempted to load the driver. If for some reason the driver fails to initialize then /dev/kfd will not be present and the next check will detect that. If you're looking for assistance with your local, custom, build you could simply remove the lsmod check. On the other hand if you're looking to contribute a rocminfo PR then recording the lsmod failure and continuing on is probably the right direction. This way rocminfo can print all the possible causes for failure encountered along the way, yet remain quiet if hsa_init actually succeeds despite the failed checks.

@FireBurn
Copy link
Author

I'll think of another way of detecting amdgpu / amdkfd being available

diff --git a/rocminfo.cc b/rocminfo.cc
index ee01f60..366982d 100755
--- a/rocminfo.cc
+++ b/rocminfo.cc
@@ -1034,17 +1034,6 @@ AcquireAndDisplayAgentInfo(hsa_agent_t agent, void* data) {
 }
 
 int CheckInitialState(void) {
-  // Check kernel module for ROCk is loaded
-  FILE *fd = popen("lsmod | grep amdgpu", "r");
-  char buf[16];
-  if (fread (buf, 1, sizeof (buf), fd) <= 0) {
-    printf("%sROCk module is NOT loaded, possibly no GPU devices%s\n",
-                                                          COL_RED, COL_RESET);
-    return -1;
-  } else {
-    printf("%sROCk module is loaded%s\n", COL_WHT, COL_RESET);
-  }
-
   // Check if user belongs to the group for /dev/kfd (e.g. "video" or
   // "render")
   // @note: User who are not members of "video"

Get's rocminfo working locally for now

@FireBurn
Copy link
Author

Is it enough to check that /sys/module/amdgpu exists?

@littlewu2508
Copy link

littlewu2508 commented Aug 18, 2021

Same issue here.

Is it enough to check that /sys/module/amdgpu exists?

I think it's a better way. On my machine, both loaded module and builtin module provide /sys/module/amdgpu (linux-5.13), while on another machine without amdgpu this path doesn't exists (5.10).

@littlewu2508
Copy link

littlewu2508 commented Aug 18, 2021

Is it enough to check that /sys/module/amdgpu exists?

I created a PR to implement this: #43

littlewu2508 added a commit to littlewu2508/rocminfo that referenced this issue Apr 2, 2022
Closes: ROCm#42

Signed-off-by: YiyangWu <xgreenlandforwyy@gmail.com>
@dmitrii-galantsev
Copy link

fixed in 94b4b3f

@littlewu2508
Copy link

fixed in 94b4b3f

I think this commit does not fix the issue. The builtin amdgpu kernel module does not have /sys/module/amdgpu/initstate. So even after this fix, rocminfo still fails with

ROCk module is NOT loaded, possibly no GPU devices

littlewu2508 added a commit to littlewu2508/rocminfo that referenced this issue Dec 19, 2023
Closes: ROCm#42

Signed-off-by: YiyangWu <xgreenlandforwyy@gmail.com>
@dmitrii-galantsev
Copy link

Hm. I will need to compile the kernel with the driver built-in to test that this approach works.
Will respond in this thread once I make some progress.

Thank you @littlewu2508 for being insistent :)

@ppanchad-amd
Copy link

@dmitrii-galantsev Any update on this issue? Thanks!

@FireBurn
Copy link
Author

littlewu2508 added a commit to littlewu2508/rocminfo that referenced this issue Aug 15, 2024
Closes: ROCm#42

Signed-off-by: YiyangWu <xgreenlandforwyy@gmail.com>
@littlewu2508
Copy link

@ppanchad-amd The issue persists #43 can be still applied and fix this issue, so I rebased it to amd-staging. Please reopen the PR and have a review.

@dmitrii-galantsev
Copy link

Apologies all. rocminfo fell off my radar. All effort on https://github.com/ROCm/amdsmi...

So it works on gentoo because it includes this patch https://gitweb.gentoo.org/repo/gentoo.git/tree/dev-util/rocminfo/files/rocminfo-6.0.0-detect-builtin-amdgpu.patch #65

@FireBurn Thanks for that, I will try to apply it.

@dmitrii-galantsev
Copy link

@littlewu2508 I'm on a modern kernel (6.8.0-1-default+) and was able to rebuild it with amdgpu built in. However it crashed on boot, which is fine.. Even though it crashed, the /sys/module/amdgpu/ was created, but /sys/module/amdgpu/initstate wasn't.

On your system could you please check if amdgpu at all works? Specifically, is there /sys/class/drm/card*/device/gpu_metrics ? That would be a good indicator of amdgpu driver working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment