Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU at 100%+ when Nvidia drivers are installed #6057

Closed
numkem opened this issue Aug 1, 2019 · 7 comments · Fixed by #6201
Closed

CPU at 100%+ when Nvidia drivers are installed #6057

numkem opened this issue Aug 1, 2019 · 7 comments · Fixed by #6201

Comments

@numkem
Copy link

numkem commented Aug 1, 2019

Nomad version

Nomad 0.9.4

Operating system and Environment details

Debian 9 and 10

Issue

When the nvidia drivers are installed and nomad detects that an nvidia card can be used, nomad uses 100%+ CPU, usually hovering around 116% (on a dual core machine).

Reproduction steps

Install Debian 10, install nvidia-driver from the backports repos and start nomad.

Job file (if appropriate)

N/A

Nomad Client logs (if appropriate)

I've sent the pprof file generated by nomad to the oss email.

@notnoop
Copy link
Contributor

notnoop commented Aug 2, 2019

Thanks for reporting the issue. It looks like the issue is that we collect device stats every second, and it seems looking up the gpu temperature is slow e.g. https://github.com/hashicorp/nomad/blob/v0.9.4/plugins/shared/cmd/launcher/command/device.go#L345-L351 .

Given that stats collectors usually sample every 10 seconds or so, collecting stats every second is excessive. We should downsample it to 5 or 10 seconds, and see if there are some optimizations we can apply.

@numkem
Copy link
Author

numkem commented Aug 2, 2019

I probably should add the NVIDIA card in question is a GTX 650, so rather old but good enough for transcoding.

nvidia-smi reports the right temperature.

@ionosphere80
Copy link

I'm also having this issue on a p2.xlarge instance in AWS with a Tesla K80 GPU running Ubuntu 18.04.

@ionosphere80
Copy link

Any plans to fix this issue?

@endocrimes
Copy link
Contributor

@ionosphere80 @numkem the short term solution would be to configure the nomad telemetry stanza to have a collection interval of ~10s. We'll attempt to adjust this in a future release but we may have to rethink some of our nvidia stats collection.

@numkem
Copy link
Author

numkem commented Aug 15, 2019

@endocrimes I've tried what you said and even removing the whole telemetry stanza to no avail.

notnoop pushed a commit that referenced this issue Aug 23, 2019
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes #6057 .
preetapan pushed a commit that referenced this issue Sep 18, 2019
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes #6057 .
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants