Skip ENOENT for vram_str_path and sdma_str_path if files not created when passing some, but not all, GPUs to a docker image #194

jamesxu2 · 2024-08-28T20:07:39Z

This PR is intended to resolve these issues where rocm-smi --showpids unexpectedly returns UNKNOWNs:
ROCm/ROCm#3002

When running rocm-smi --showpids when a process does not have access to all GPUs (eg. in a docker image where some, but not all devices are passed through), GetProcessInfoForPID attempts to enumerate all GPUs on the host and search for vram/sdma/cu_occupancy data. In the VRAM example:

  for (itr = gpu_set->begin(); itr != gpu_set->end(); itr++) {
    uint64_t gpu_id = (*itr);
    std::string vram_str_path = proc_str_path;
    vram_str_path += "/vram_";
    vram_str_path += std::to_string(gpu_id);

    err = ReadSysfsStr(vram_str_path, &tmp); //returns ENOENT
    [...]

So, we attempt to access file proc_str_path/vram_{gpu_id} for each gpu on the host (where rocm-smi is invoked). However, since the host and the monitored process have different perspectives on which GPUs exist, there is an inconsistency:

The monitored process will only create vram_{gpu_id} files for the GPUs it thinks exists, while the host expects to see vram_{gpu_id} files for all GPUs. This results in host attempting to read nonexistent files and return early out of the loop with ENOENT.

This PR will ignore ENOENTs when enumerating vram_ and sdma_ files but still return early if other errors are encountered, instead of prematurely returning from the loop. A previous PR - #155 - has handled the case for cu_occupancy which may be absent due to device non-support.

when passing some, but not all, GPUs to a docker image.

Skip missing vram_str_path and sdma_str_path if sysfs files not created

8cad1cf

when passing some, but not all, GPUs to a docker image.

jamesxu2 requested review from bill-shuzhou-liu, dmitrii-galantsev, charis-poag-amd and oliveiradan as code owners August 28, 2024 20:07

jamesxu2 mentioned this pull request Aug 28, 2024

[Issue]: rocmi-smi and rsmi_compute_process_info_by_pid_get() don't show values for processes that don't have access to all installed GPUs (Linux) ROCm/ROCm#3002

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip ENOENT for vram_str_path and sdma_str_path if files not created when passing some, but not all, GPUs to a docker image #194

Skip ENOENT for vram_str_path and sdma_str_path if files not created when passing some, but not all, GPUs to a docker image #194

jamesxu2 commented Aug 28, 2024

Skip ENOENT for vram_str_path and sdma_str_path if files not created when passing some, but not all, GPUs to a docker image #194

Are you sure you want to change the base?

Skip ENOENT for vram_str_path and sdma_str_path if files not created when passing some, but not all, GPUs to a docker image #194

Conversation

jamesxu2 commented Aug 28, 2024