Skip to content

Commit

Permalink
variants: add aws-k8s-1.21-nvidia
Browse files Browse the repository at this point in the history
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
  • Loading branch information
arnaldo2792 committed Jan 24, 2022
1 parent 8896f88 commit 54415cb
Show file tree
Hide file tree
Showing 22 changed files with 209 additions and 1 deletion.
28 changes: 28 additions & 0 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,34 @@ licenses = [
]
```

#### NVIDIA variants

If you want to build the `aws-k8s-1.21-nvidia` variant, you can follow these steps to prepare a `Licenses.toml` file using the [License for customer use of NVIDIA software](https://www.nvidia.com/en-us/drivers/nvidia-license/):

1. Create a `Licenses.toml` file in your Bottlerocket root directory, with the following content:

```toml
[nvidia]
spdx-id = "LicensesRef-NVIDIA-Customer-Use"
licenses = [
{ path = "LICENSE", license-url = "https://www.nvidia.com/en-us/drivers/nvidia-license/" }
]
```

2. Fetch the licenses with this command:

```shell
cargo make fetch-licenses -e BUILDSYS_UPSTREAM_LICENSES_FETCH=true
```

3. Build your image, setting the `BUILDSYS_UPSTREAM_SOURCE_FALLBACK` flag to `true`, if you haven't cached the driver's sources:

```shell
cargo make \
-e BUILDSYS_VARIANT=aws-k8s-1.21-nvidia \
-e BUILDSYS_UPSTREAM_SOURCE_FALLBACK="true"
```

### Register an AMI

To use the image in Amazon EC2, we need to register the image as an AMI.
Expand Down
14 changes: 14 additions & 0 deletions QUICKSTART-EKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,3 +369,17 @@ Once it launches, you should be able to run pods on your Bottlerocket instance u

For example, to run busybox:
`kubectl run -i -t busybox --image=busybox --restart=Never`

### aws-k8s-1.21-nvidia variant

The `aws-k8s-1.21-nvidia` variant includes the required packages and configurations to leverage NVIDIA GPUs.
It comes with the [NVIDIA Tesla driver](https://docs.nvidia.com/datacenter/tesla/drivers/index.html) along with the libraries required by the [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) included in your orchestrated containers.
It also includes the [NVIDIA k8s device plugin](https://github.com/NVIDIA/k8s-device-plugin).
If you already have a daemonset for the device plugin in your cluster, you may need to use taints and tolerations to keep it from running on Bottlerocket nodes.

Additional NVIDIA tools such as [DCGM](https://github.com/NVIDIA/dcgm-exporter) and [GPU Feature Discovery](https://github.com/NVIDIA/gpu-feature-discovery) will work as expected.
You can install them in your cluster by following the `helm install` instructions provided for each project.

The [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator) can also be used to install these tools.
However, it is cumbersome to select the right subset of features to avoid conflicts with the software included in the variant.
Therefore we recommend installing the tools individually if they are required.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ The following variants support EKS, as described above:
- `aws-k8s-1.19`
- `aws-k8s-1.20`
- `aws-k8s-1.21`
- `aws-k8s-1.21-nvidia`

The following variant supports ECS:

Expand Down
2 changes: 2 additions & 0 deletions SECURITY_FEATURES.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ All binaries are linked with the following options:

Together these enable [full RELRO support](https://www.redhat.com/en/blog/hardening-elf-binaries-using-relocation-read-only-relro) which makes [ROP](https://en.wikipedia.org/wiki/Return-oriented_programming) attacks more difficult to execute.

**Note:** Certain variants, such as the ones for NVIDIA, include precompiled binaries that may not have been built with these hardening flags.

### SELinux enabled in enforcing mode

Bottlerocket enables SELinux by default, sets it to enforcing mode, and loads the policy during boot.
Expand Down
1 change: 1 addition & 0 deletions sources/logdog/conf/logdog.aws-k8s-1.21-nvidia.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[configuration-files.containerd-config-toml]
# No override to path
template-path = "/usr/share/templates/containerd-config-toml_k8s_nvidia"

# Image registries
[metadata.settings.container-registry]
affected-services = ["containerd", "host-containers", "bootstrap-containers"]
13 changes: 13 additions & 0 deletions sources/models/shared-defaults/nvidia-oci-hooks.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[settings.oci-hooks]
log4j-hotpatch-enabled = false

[metadata.settings.oci-hooks]
affected-services = ["oci-hooks"]

[services.oci-hooks]
configuration-files = ["oci-hooks"]
restart-commands = []

[configuration-files.oci-hooks]
path = "/etc/shimpei/nvidia-oci-hooks.json"
template-path = "/usr/share/templates/nvidia-oci-hooks-json"
29 changes: 29 additions & 0 deletions sources/models/src/aws-k8s-1.21-nvidia/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
use model_derive::model;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;

use crate::modeled_types::Identifier;
use crate::{
AwsSettings, BootstrapContainer, HostContainer, KernelSettings, KubernetesSettings,
MetricsSettings, NetworkSettings, NtpSettings, OciHooks, PemCertificate, RegistrySettings,
UpdatesSettings,
};

// Note: we have to use 'rename' here because the top-level Settings structure is the only one
// that uses its name in serialization; internal structures use the field name that points to it
#[model(rename = "settings", impl_default = true)]
struct Settings {
motd: String,
kubernetes: KubernetesSettings,
updates: UpdatesSettings,
host_containers: HashMap<Identifier, HostContainer>,
bootstrap_containers: HashMap<Identifier, BootstrapContainer>,
ntp: NtpSettings,
network: NetworkSettings,
kernel: KernelSettings,
aws: AwsSettings,
metrics: MetricsSettings,
pki: HashMap<Identifier, PemCertificate>,
container_registry: RegistrySettings,
oci_hooks: OciHooks,
}
50 changes: 49 additions & 1 deletion variants/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions variants/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ members = [
"aws-k8s-1.21",
"metal-dev",
"metal-k8s-1.21",
"aws-k8s-1.21-nvidia",
"vmware-dev",
"vmware-k8s-1.20",
"vmware-k8s-1.21",
Expand Down
7 changes: 7 additions & 0 deletions variants/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,13 @@ It supports self-hosted clusters and clusters managed by [EKS](https://aws.amazo

This variant is compatible with Kubernetes 1.21, 1.22, and 1.23 clusters.

### aws-k8s-1.21-nvidia: Kubernetes 1.21 node

The [aws-k8s-1.21-nvidia](aws-k8s-1.21-nvidia/Cargo.toml) variant includes the packages needed to run a Kubernetes node in AWS.
It also includes the required packages to configure containers to leverage NVIDIA GPUs.
It supports self-hosted clusters and clusters managed by [EKS](https://aws.amazon.com/eks/).
This variant is compatible with Kubernetes 1.21, 1.22, and 1.23 clusters.

### aws-ecs-1: Amazon ECS container instance

The [aws-ecs-1](aws-ecs-1/Cargo.toml) variant includes the packages needed to run an [Amazon ECS](https://ecs.aws)
Expand Down
39 changes: 39 additions & 0 deletions variants/aws-k8s-1.21-nvidia/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
[package]
# This is the aws-k8s-1.21-nvidia variant. "." is not allowed in crate names, but we
# don't use this crate name anywhere.
name = "aws-k8s-1_21-nvidia"
version = "0.1.0"
edition = "2018"
publish = false
build = "build.rs"
# Don't rebuild crate just because of changes to README.
exclude = ["README.md"]

[package.metadata.build-variant]
included-packages = [
"aws-iam-authenticator",
"cni",
"cni-plugins",
"kernel-5.10",
"kubelet-1.21",
"release",
"nvidia-container-toolkit",
"kmod-5.10-nvidia-tesla-470"
]
kernel-parameters = [
"console=tty0",
"console=ttyS0,115200n8",
]

[lib]
path = "lib.rs"

[build-dependencies]
aws-iam-authenticator = { path = "../../packages/aws-iam-authenticator" }
cni = { path = "../../packages/cni" }
cni-plugins = { path = "../../packages/cni-plugins" }
kernel-5_10 = { path = "../../packages/kernel-5.10" }
kubernetes-1_21 = { path = "../../packages/kubernetes-1.21" }
release = { path = "../../packages/release" }
nvidia-container-toolkit = { path = "../../packages/nvidia-container-toolkit" }
kmod-5_10-nvidia = { path = "../../packages/kmod-5.10-nvidia" }
9 changes: 9 additions & 0 deletions variants/aws-k8s-1.21-nvidia/build.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
use std::process::{exit, Command};

fn main() -> Result<(), std::io::Error> {
let ret = Command::new("buildsys").arg("build-variant").status()?;
if !ret.success() {
exit(1);
}
Ok(())
}
1 change: 1 addition & 0 deletions variants/aws-k8s-1.21-nvidia/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
// not used

0 comments on commit 54415cb

Please sign in to comment.