Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dutyCycle loses data #12

Open
michaelkoetter opened this issue Jul 16, 2019 · 2 comments
Open

dutyCycle loses data #12

michaelkoetter opened this issue Jul 16, 2019 · 2 comments

Comments

@michaelkoetter
Copy link

dutyCycle is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:

Percent of time over the past sample period during which one or more kernels was executing on the GPU.

Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.

Instead it would be better to have a ..._seconds_total counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).

@rohitagarwal003
Copy link
Member

Yes, that's why it's marked as a Gauge and not as a Counter. Unfortunately, the NVML API doesn't provide counter like values.

@rohitagarwal003
Copy link
Member

You can potentially use https://github.com/mindprince/gonvml/blob/b364b296c7320f5d3dc084aa536a3dba33b68f90/bindings.go#L250-L266 but that would make the exporter more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants