Skip to content

Commit

Permalink
Remove nv-peer-mem
Browse files Browse the repository at this point in the history
Signed-off-by: Fred Rolland <frolland@nvidia.com>
  • Loading branch information
rollandf committed Jul 10, 2023
1 parent b83f735 commit c4eebc2
Show file tree
Hide file tree
Showing 20 changed files with 8 additions and 623 deletions.
13 changes: 3 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,6 @@ NICClusterPolicy CRD Spec includes the following sub-states:
and related configurations.
- `sriovDevicePlugin`: [SR-IOV Network Device Plugin](https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin)
and related configurations.
- `nvPeerDriver`: [NVIDIA Peer Memory client driver container](https://github.com/Mellanox/ofed-docker)
to be deployed on RDMA & GPU supporting nodes (required for GPUDirect workloads).
For NVIDIA GPU driver version < 465. Check [compatibility notes](#compatibility-notes) for details
- `ibKubernetes`: [InfiniBand Kubernetes](https://github.com/Mellanox/ib-kubernetes/) and related configurations.
- `secondaryNetwork`: Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:
- [Multus-CNI](https://github.com/intel/multus-cni): Delegate CNI plugin to support secondary networks in Kubernetes
Expand All @@ -109,8 +106,7 @@ to be deployed on RDMA & GPU supporting nodes (required for GPUDirect workloads)

##### Example for NICClusterPolicy resource:
In the example below we request OFED driver to be deployed together with RDMA shared device plugin
but without NV Peer Memory driver.
In the example below we request OFED driver to be deployed together with RDMA shared device plugin.

```
apiVersion: mellanox.com/v1alpha1
Expand Down Expand Up @@ -263,8 +259,6 @@ status:
state: ignore
- name: state-RDMA-device-plugin
state: ready
- name: state-NV-Peer
state: ignore
- name: state-ib-kubernetes
state: ignore
- name: state-nv-ipam-cni
Expand Down Expand Up @@ -443,8 +437,7 @@ The following Network Adapters have been tested with NVIDIA Network Operator:

## Compatibility Notes
* NVIDIA Network Operator is compatible with NVIDIA GPU Operator v1.5.2 and above
* Network Operator will deploy nvPeerDriver POD on a node only if NVIDIA GPU driver version < 465.
Starting from v465 NVIDIA GPU driver includes a built-in nvidia_peermem module
* Starting from v465 NVIDIA GPU driver includes a built-in nvidia_peermem module
which is a replacement for nv_peer_mem module. NVIDIA GPU operator manages nvidia_peermem module loading.

## Deployment Example
Expand Down Expand Up @@ -480,7 +473,7 @@ making them available to the kernel.

While this approach may seem odd. It provides a way to deliver drivers to immutable systems.

[Mellanox OFED and NV Peer Memory driver container](https://github.com/Mellanox/ofed-docker)
[Mellanox OFED container](https://github.com/Mellanox/ofed-docker)

## Upgrade
Check [Upgrade section in Helm Chart documentation](deployment/network-operator/README.md#upgrade) for details.
Expand Down
9 changes: 0 additions & 9 deletions api/v1alpha1/nicclusterpolicy_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -152,14 +152,6 @@ type DrainSpec struct {
DeleteEmptyDir bool `json:"deleteEmptyDir,omitempty"`
}

// NVPeerDriverSpec describes configuration options for NV Peer Memory driver
type NVPeerDriverSpec struct {
// Image information for nv peer memory driver container
ImageSpec `json:""`
// GPU driver sources path - Optional
GPUDriverSourcePath string `json:"gpuDriverSourcePath,omitempty"`
}

// DevicePluginSpec describes configuration options for device plugin
// 1. Image information for device plugin
// 2. Device plugin configuration
Expand Down Expand Up @@ -227,7 +219,6 @@ type NicClusterPolicySpec struct {
NodeAffinity *v1.NodeAffinity `json:"nodeAffinity,omitempty"`
Tolerations []v1.Toleration `json:"tolerations,omitempty"`
OFEDDriver *OFEDDriverSpec `json:"ofedDriver,omitempty"`
NVPeerDriver *NVPeerDriverSpec `json:"nvPeerDriver,omitempty"`
RdmaSharedDevicePlugin *DevicePluginSpec `json:"rdmaSharedDevicePlugin,omitempty"`
SriovDevicePlugin *DevicePluginSpec `json:"sriovDevicePlugin,omitempty"`
IBKubernetes *IBKubernetesSpec `json:"ibKubernetes,omitempty"`
Expand Down
21 changes: 0 additions & 21 deletions api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

25 changes: 0 additions & 25 deletions config/crd/bases/mellanox.com_nicclusterpolicies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -299,31 +299,6 @@ spec:
- repository
- version
type: object
nvPeerDriver:
description: NVPeerDriverSpec describes configuration options for
NV Peer Memory driver
properties:
gpuDriverSourcePath:
description: GPU driver sources path - Optional
type: string
image:
pattern: '[a-zA-Z0-9\-]+'
type: string
imagePullSecrets:
items:
type: string
type: array
repository:
pattern: '[a-zA-Z0-9\.\-\/]+'
type: string
version:
pattern: '[a-zA-Z0-9\.-]+'
type: string
required:
- image
- repository
- version
type: object
ofedDriver:
description: OFEDDriverSpec describes configuration options for OFED
driver
Expand Down
4 changes: 2 additions & 2 deletions controllers/nicclusterpolicy_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,8 @@ func (r *NicClusterPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Req

// Create a new State service catalog
sc := state.NewInfoCatalog()
if instance.Spec.OFEDDriver != nil || instance.Spec.NVPeerDriver != nil ||
instance.Spec.RdmaSharedDevicePlugin != nil || instance.Spec.SriovDevicePlugin != nil {
if instance.Spec.OFEDDriver != nil || instance.Spec.RdmaSharedDevicePlugin != nil ||
instance.Spec.SriovDevicePlugin != nil {
// Create node infoProvider and add to the service catalog
reqLogger.V(consts.LogLevelInfo).Info("Creating Node info provider")
nodeList := &corev1.NodeList{}
Expand Down
15 changes: 1 addition & 14 deletions deployment/network-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,17 +380,6 @@ imagePullSecrets:
| `ofedDriver.readinessProbe.initialDelaySeconds` | int | 10 | Mellanox OFED readiness probe initial delay |
| `ofedDriver.readinessProbe.periodSeconds` | int | 30 | Mellanox OFED readiness probe interval |

#### NVIDIA Peer memory driver

| Name | Type | Default | description |
| ---- | ---- | ------- | ----------- |
| `nvPeerDriver.deploy` | bool | `false` | deploy NVIDIA Peer memory driver container |
| `nvPeerDriver.repository` | string | `mellanox` | NVIDIA Peer memory driver image repository |
| `nvPeerDriver.image` | string | `nv-peer-mem-driver` | NVIDIA Peer memory driver image name |
| `nvPeerDriver.version` | string | `1.1-0` | NVIDIA Peer memory driver version |
| `nvPeerDriver.imagePullSecrets` | list | `[]` | An optional list of references to secrets to use for pulling any of the NVIDIA Peer memory driver image |
| `nvPeerDriver.gpuDriverSourcePath` | string | `/run/nvidia/driver` | GPU driver soruces root filesystem path(usually used in tandem with [gpu-operator](https://github.com/NVIDIA/gpu-operator)) |

#### RDMA Device Plugin

| Name | Type | Default | description |
Expand Down Expand Up @@ -595,7 +584,7 @@ rdmaSharedDevicePlugin:

#### Example 2

Network Operator deployment with the default version of OFED and NV Peer mem driver, RDMA device plugin with two RDMA
Network Operator deployment with the default version of OFED, RDMA device plugin with two RDMA
resources, the first mapped to `enp1` and `enp2`, the second mapped to `ib0`.

__values.yaml:__
Expand All @@ -604,8 +593,6 @@ __values.yaml:__
deployCR: true
ofedDriver:
deploy: true
nvPeerDriver:
deploy: true
rdmaSharedDevicePlugin:
deploy: true
resources:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -299,31 +299,6 @@ spec:
- repository
- version
type: object
nvPeerDriver:
description: NVPeerDriverSpec describes configuration options for
NV Peer Memory driver
properties:
gpuDriverSourcePath:
description: GPU driver sources path - Optional
type: string
image:
pattern: '[a-zA-Z0-9\-]+'
type: string
imagePullSecrets:
items:
type: string
type: array
repository:
pattern: '[a-zA-Z0-9\.\-\/]+'
type: string
version:
pattern: '[a-zA-Z0-9\.-]+'
type: string
required:
- image
- repository
- version
type: object
ofedDriver:
description: OFEDDriverSpec describes configuration options for OFED
driver
Expand Down
16 changes: 0 additions & 16 deletions deployment/network-operator/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -101,22 +101,6 @@ imagePullSecrets helpers
{{- $imagePullSecrets | toJson }}
{{- end }}

{{- define "network-operator.nvPeerDriver.imagePullSecrets" }}
{{- $imagePullSecrets := list }}
{{- if .Values.nvPeerDriver.imagePullSecrets }}
{{- range .Values.nvPeerDriver.imagePullSecrets }}
{{- $imagePullSecrets = append $imagePullSecrets . }}
{{- end }}
{{- else }}
{{- if .Values.imagePullSecrets }}
{{- range .Values.imagePullSecrets }}
{{- $imagePullSecrets = append $imagePullSecrets . }}
{{- end }}
{{- end }}
{{- end }}
{{- $imagePullSecrets | toJson }}
{{- end }}

{{- define "network-operator.rdmaSharedDevicePlugin.imagePullSecrets" }}
{{- $imagePullSecrets := list }}
{{- if .Values.rdmaSharedDevicePlugin.imagePullSecrets }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,14 +67,6 @@ spec:
deleteEmptyDir: {{ .Values.ofedDriver.upgradePolicy.drain.deleteEmptyDir | default false}}
{{- end }}
{{- end }}
{{- if .Values.nvPeerDriver.deploy }}
nvPeerDriver:
image: {{ .Values.nvPeerDriver.image }}
repository: {{ .Values.nvPeerDriver.repository }}
version: {{ .Values.nvPeerDriver.version }}
imagePullSecrets: {{ include "network-operator.nvPeerDriver.imagePullSecrets" . }}
gpuDriverSourcePath: {{ .Values.nvPeerDriver.gpuDriverSourcePath }}
{{- end }}
{{- if .Values.rdmaSharedDevicePlugin.deploy }}
rdmaSharedDevicePlugin:
# {{ required "A valid value for .Values.rdmaSharedDevicePlugin.resources is required" .Values.rdmaSharedDevicePlugin.resources }}
Expand Down
8 changes: 0 additions & 8 deletions deployment/network-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -198,14 +198,6 @@ ofedDriver:
timeoutSeconds: 300
deleteEmptyDir: false

nvPeerDriver:
deploy: false
image: nv-peer-mem-driver
repository: mellanox
version: 1.1-0
# imagePullSecrets: []
gpuDriverSourcePath: /run/nvidia/driver

rdmaSharedDevicePlugin:
deploy: true
image: k8s-rdma-shared-dev-plugin
Expand Down
1 change: 0 additions & 1 deletion example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ We assume familiarity with RDMA, Kubernetes and related CNI project.
network-operator, at this stage, deploys and configures the follwoing components:
* [Mellanox OFED](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) driver container
* [RDMA device plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin)
* [NVIDIA peer memory client](https://github.com/Mellanox/nv_peer_memory) driver container
* SecondaryNetwork`: Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:
- [Multus-CNI](https://github.com/intel/multus-cni): Delegate CNI plugin to support secondary networks in Kubernetes
- CNI plugins: Currently only [containernetworking-plugins](https://github.com/containernetworking/plugins) is supported
Expand Down
10 changes: 1 addition & 9 deletions example/dgx-helm-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# Custom Helm Chart values example to deploy Network Operator
# at NVIDIA DGX SYSTEMS.
# We don't need to deploy MOFED and nv-peer-mem containers.
# We don't need to deploy MOFED container.

nfd:
enabled: true
Expand All @@ -24,14 +24,6 @@ deployCR: true
ofedDriver:
deploy: false

nvPeerDriver:
deploy: false
image: nv-peer-mem-driver
repository: mellanox
version: 1.1-0
imagePullSecrets: []
gpuDriverSourcePath: /run/nvidia/driver

secondaryNetwork:
deploy: true
cniPlugins:
Expand Down
8 changes: 0 additions & 8 deletions hack/templates/values/values.template
Original file line number Diff line number Diff line change
Expand Up @@ -198,14 +198,6 @@ ofedDriver:
timeoutSeconds: 300
deleteEmptyDir: false

nvPeerDriver:
deploy: false
image: nv-peer-mem-driver
repository: mellanox
version: 1.1-0
# imagePullSecrets: []
gpuDriverSourcePath: /run/nvidia/driver

rdmaSharedDevicePlugin:
deploy: true
image: {{ .RdmaSharedDevicePlugin.Image }}
Expand Down

This file was deleted.

16 changes: 0 additions & 16 deletions manifests/state-nv-peer-mem-driver/0020_role.openshift.yaml

This file was deleted.

18 changes: 0 additions & 18 deletions manifests/state-nv-peer-mem-driver/0030_rolebinding.openshift.yaml

This file was deleted.

Loading

0 comments on commit c4eebc2

Please sign in to comment.