Skip to content

Commit

Permalink
docs: update nvidia driver documentation
Browse files Browse the repository at this point in the history
notably:
- name of the compiled binary is 'nomad-device-nvidia', not 'nvidia-gpu'
- link to Nvidia docs for installing the container runtime toolkit
- list docker v19.03 as minimum version, to track with nvidia's new container runtime toolkit
  • Loading branch information
shoenig committed May 2, 2022
1 parent dfda28d commit d352ab2
Showing 1 changed file with 35 additions and 84 deletions.
119 changes: 35 additions & 84 deletions website/content/plugins/devices/nvidia.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: The Nvidia Device Plugin detects and makes Nvidia devices available

# Nvidia GPU Device Plugin

Name: `nvidia-gpu`
Name: `nomad-device-nvidia`

The Nvidia device plugin is used to expose Nvidia GPUs to Nomad.

Expand Down Expand Up @@ -97,23 +97,29 @@ documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-va

## Installation Requirements

In order to use the `nvidia-gpu` the following prerequisites must be met:
In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:

1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
4. Docker v19.03+

### Docker Driver Requirements
### Container Toolkit Installation

Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should
be able to run this simple command to test your environment and produce meaningful
output.

```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```

The Nvidia driver plugin currently only supports the older v1.0 version of the
Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with
the Docker driver, please follow the installation instructions for
[`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation).

## Plugin Configuration

```hcl
plugin "nvidia-gpu" {
plugin "nomad-device-nvidia" {
config {
enabled = true
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
Expand All @@ -122,7 +128,7 @@ plugin "nvidia-gpu" {
}
```

The `nvidia-gpu` device plugin supports the following configuration in the agent
The `nomad-device-nvidia` device plugin supports the following configuration in the agent
config:

- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
Expand All @@ -133,86 +139,40 @@ config:
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
device changes.

## Restrictions
## Limitations

The Nvidia integration only works with drivers who natively integrate with
Nvidia's [container runtime
library](https://github.com/NVIDIA/libnvidia-container).

Nomad has tested support with the [`docker` driver][docker-driver] and plans to
bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
tested or documented by Nomad.
Nomad has tested support with the [`docker` driver][docker-driver]. Support for
[`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook]
but is not tested or documented by Nomad.

## Source Code & Compiled Binaries

The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
can also find pre-built binaries on the [releases page][nvidia_plugin_download].

## Examples

Inspect a node with a GPU:

```shell-session
$ nomad node status 4d46e59f
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m43s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
// ...TRUNCATED...
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Allocations
No allocations placed
```

Display detailed statistics on a node with a GPU:

```shell-session
$ nomad node status -stats 4d46e59f
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m59s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
// ...TRUNCATED...
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Expand All @@ -232,9 +192,6 @@ Memory state = 0 / 11441 MiB
Memory utilization = 0 %
Power usage = 37 / 149 W
Temperature = 34 C
Allocations
No allocations placed
```

Run the following example job to see that that the GPU was mounted in the
Expand All @@ -250,7 +207,7 @@ job "gpu-test" {
driver = "docker"
config {
image = "nvidia/cuda:9.0-base"
image = "nvidia/cuda:11.0-base"
command = "nvidia-smi"
}
Expand Down Expand Up @@ -280,18 +237,8 @@ $ nomad run example.nomad
==> Evaluation "21bd7584" finished with status "complete"
$ nomad alloc status d250baed
ID = d250baed
Eval ID = 21bd7584
Name = gpu-test.smi[0]
Node ID = 4d46e59f
Job ID = example
Job Version = 0
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 7s ago
Modified = 2s ago
// ...TRUNCATED...
Task "smi" is "dead"
Task Resources
Expand Down Expand Up @@ -334,10 +281,14 @@ Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
```


[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /docs/drivers/java 'Nomad java Driver'
[lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver'
[`plugin`]: /docs/configuration/plugin
[`plugin_dir`]: /docs/configuration#plugin_dir
[nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
[source]: https://github.com/hashicorp/nomad-device-nvidia

0 comments on commit d352ab2

Please sign in to comment.