Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a pre-check whether host/node is GPU-aware before running any other GPU commands #44

Closed
Tracked by #188
bast opened this issue Jun 16, 2023 · 4 comments
Closed
Tracked by #188
Assignees
Labels
enhancement New feature or request

Comments

@bast
Copy link
Member

bast commented Jun 16, 2023

Follow-up from #40. This pre-check should not be really needed since the subprocess is implemented to hopefully never crash or hang but I would still feel better if we do a quick check first before running other commands.

@lars-t-hansen
Copy link
Collaborator

For nvidia we could maybe:

  • check for /dev/nvidia*
  • check to see if nvidia-smi exists
  • try to run nvidia-smi -L to see if there's any output at all

I'd love for there to be something absolutely solid but those are my best ideas so far, for run-time detection. There's module avail but then we take a dependency on module plus I know for a fact we have non-NVIDIA machines that have the NVIDIA stack.

Other options are:

  • command line switch to sonar ps to tell it to look for nvidia cards, but this adds friction to setup
  • compile-time switch, but this adds friction to deployment

@lars-t-hansen
Copy link
Collaborator

There's some variant of lspci | egrep 'VGA.*(NVIDIA Corporation|Advanced Micro Devices)' I guess. On my systems this finds all the cards. Indeed, finding the NVIDIA string means we should be able to run nvidia-smi, finding the AMD string that we should be able to run rocm-smi.

@lars-t-hansen
Copy link
Collaborator

A complementary approach is that used by the sysinfo prototype: it uses Go's LookPath primitive to resolve the paths for nvidia-smi and rocm-smi to ensure that the programs are found in the path but not in the current directory. This doesn't quite address the core of the problem - we want to be sure that the probe programs don't hang - but may in any case be desirable.

@lars-t-hansen
Copy link
Collaborator

As #188 shows, the presence of nvidia-smi is by itself insufficient.

lars-t-hansen pushed a commit to lars-t-hansen/sonar that referenced this issue Oct 7, 2024
@bast bast closed this as completed in 09e4145 Oct 11, 2024
bast added a commit that referenced this issue Oct 11, 2024
Fix #44 - check for presence of GPUs before running probes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants