Skip to content

Latest commit

 

History

History
40 lines (31 loc) · 2.15 KB

README.md

File metadata and controls

40 lines (31 loc) · 2.15 KB

Elastic Fabric Adapter support in Openshift

This is a proof of concept showing how to get working EFA support in Openshift.

Prerequisites

  • a working Openshift cluster on Amazon Web Services with available kubeconfig
  • aws-cli configured with proper AWS credentials

Node configuration

  1. Hugepages configured and reserved: manifests/hugepages.yaml
  2. Memlock ulimits set to unlimited (hard and soft): manifests/unlimited-memlock.yaml
  3. Daemonset running in order to expose EFA capabilities: manifests/efa-k8s-device-plugin.yml

Openshift configuration

  1. Node Feature Discovery operator
  2. Patched MPI operator (patched to keep using kubectl exec instead of [rs]sh): manifests/mpi-operator.yaml
  3. Provide AWS ECR registry credentials (example for us-east-1 and us-west-2):
us_east_1_token=$(echo AWS:$(/usr/local/bin/aws ecr get-login-password --region us-east-1) | base64 -w0)
us-west-2_token=$(echo AWS:$(/usr/local/bin/aws ecr get-login-password --region us-west-2) | base64 -w0)

The registries used are 994408522926.dkr.ecr.us-east-1.amazonaws.com and 602401143452.dkr.ecr.us-west-2.amazonaws.com

Jobs configuration

  1. Container images used for actual MPI: container/Dockerfile-[intel|openmpi]. Prebuilt images available at quay.io/cgament/mpi:efa and quay.io/cgament/mpi:intel
  2. We are using benchmarks for MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE benchmarks from Ohio State Univeristy, specifically the osu_latency. Example MpiJobs:
    • jobs/mpi-tcp.yaml -- an MPI job running over TCP, used to determine network connectivity between nodes
    • jobs/mpi-support.yaml -- will just show detected EFA support in the pod logs
    • jobs/mpi-latency.yaml -- latency benchmark using OpenMPI
    • jobs/mpi-intel.yaml -- latency benchmark using IntelMPI