Skip to content

Releases: kubeflow/mpi-operator

v0.6.0

16 Oct 16:13
v0.6.0
Compare
Choose a tag to compare

Changes since v0.5.0

  • Features:
    • Support ManagedBy feature (.spec.runPolicy.managedBy) inspired by batch/v1 Job.
      • This allows us to dispatch MPIJobs to the multiple clusters powered by Kueue's MultiKueue. (#650, @mszadkow)
  • Clean ups:
    • Upgrade k8s libraries to v1.31 (#664, @ArangoGutierrez)
    • Upgrade debian version to bookworm and MPI versions are upgraded in the following: (#661, @tenzen-y)
      • OpenMPI: v4.1.0 -> v4.1.4
      • MPICH: 3.4.1 -> 4.0.2

Acknowledgments

Thank you to all the contributors (in no particular order): @mszadkow @mimowo @alculquicondor @terrytangyuan @ArangoGutierrez @tenzen-y

Full Changelog: v0.5.0...v0.6.0

v0.5.0

18 Apr 18:38
v0.5.0
Compare
Choose a tag to compare

Changes since v0.4.0

  • Features:
    • Add support for MPICH (#562, @sheevy)
    • Field runLauncherAsWorker allows to add the launcher pod into the hostfile as a worker (#612, @kuizhiqing)
    • Add PodGroup minResources calculation for volcano integration (#566, @lowang-bh)
  • Bug fixes:
    • Fix panic when using PodGroups and PriorityClasses (#561, @tenzen-y)
    • Fix installation of mpijob Python module (#579, @vsoch)
    • Fix hostfile when jobs in different namespaces have the same name (#622, @kuizhiqing)
  • Clean ups:

Acknowledgments

Thank you to all the contributors (in no particular order): @sheevy @alculquicondor @terrytangyuan @tenzen-y @kuizhiqing @lowang-bh @vsoch @emsixteeen @wang-mask @benash @yeahdongcn @xhejtman @pheianox @lianghao208

v0.4.0

05 Apr 20:54
c77dfcf
Compare
Choose a tag to compare

Changes since 0.3.0

  • Breaking changes
    • Removed v1 operator. If you want to use MPIJob v1, you can use the training-operator.
  • Support for suspending semantics. Third party controllers can leverage the suspend field to implement queuing and preemption for an MPIJob.
  • Support for the coscheduling plugins of the scheduler-plugins.
  • The operator supports multi-architecture (amd64, aarch64, and ppc64le).
  • Bug fixes
    • Fix support for elastic Horovod.

Acknowledgements

Special thanks to @tenzen-y for multiple contributions.
Thank you to all the contributors (in no particular order): @mimowo @adilhusain-s @davidLif @ArangoGutierrez @shaowei-su @ggaaooppeenngg @pugangxa @HeGaoYuan @Dimss @alculquicondor @terrytangyuan

v0.3.0

07 Sep 20:46
db6930d
Compare
Choose a tag to compare

Release v0.3.0

  • Scalability improvements
    • Worker start up no longer issues requests to kube-apiserver.
    • Dropped kubectl-delivery init container, reducing stress on kube-apiserver.
  • Support for Intel MPI.
  • Support for runPolicy (ttlSecondsAfterFinish, activeDeadlineSeconds, backoffLimit)
    by using a k8s Job for the launcher.
  • Samples for plain MPI applications.
  • Production readiness improvements:
    • Increased coverage throughout unit, integration and E2E tests.
    • More robust API validation.
    • Revisited v2beta1 MPIJob API.
    • Using fully-qualified label names, in consistency with other kubeflow operators.

v0.2.3

19 May 14:16
aa96794
Compare
Choose a tag to compare

Enhancements

  • Added support for RH OCP4.1 and RH OCP4.2
  • Added additional installation methods
  • Added support for Go Modules and removed vendor directories
  • Added default ephemeral storage for init container
  • Overwrite NVIDIA env vars to avoid using GPUs on launcher
  • Added health check and callbacks around various leader election phases
  • Honor user-specified worker command
  • Exposed main container name as a configurable field
  • Added RunPolicy to MPIJobSpec that reuses kubeflow/common spec
  • Allow to specify the name of the gang scheduler and priority for pod group
  • Added error log when pod spec does not have any containers
  • Switched to use distroless images
  • Refactored the kubectl-delivery to improve the launcher performance
  • Added Prometheus metrics for job monitoring
  • Added experimental version of v1 MPIJob controller and APIs
  • Support Volcano as a scheduler
  • Switched to use pods for launcher job and statefulset workers
  • Switched to use klog for logging
  • More consistent labels with other Kubeflow operators

Fixes

  • Fixed nil pointer exceptions that could accidentally restart the pod
  • Updated status to running only when launcher is active and all workers are ready
  • Fixed the incorrect namespace for initializing informers and endpoints of leader election
  • Fixed issue in v1 controller's CRD existence check

Documentation

v0.2.2

16 Sep 16:41
Compare
Choose a tag to compare
  • Added default resource requirements for init container
  • Merged multiple deployment configuration files into a single YAML file
  • Switched to use JobStatus from kubeflow/common
  • Launcher and workers are now created together

v0.2.1

15 Jul 13:08
Compare
Choose a tag to compare
  • Switch Docker files and examples to use v1alpha2 MPI Operator.

v0.2.0

03 Jul 17:19
Compare
Choose a tag to compare

API Changes

  • Add v1alpha2 version of the MPI Operator with more consistent API spec with other Kubeflow operators
  • Support ActiveDeadlineSeconds in MPIJobSpec
  • Support custom resource types other than GPUs
  • Remove launcherOnMaster field

Enhancements

  • Support gang scheduling
  • Add StartTime and CompletionTime in job status
  • Add leader election
  • Switch to use pod group for gang scheduling
  • Add example on Apache MXNet using v1alpha1 version of the MPI Operator

Initial release

11 Jan 00:53
071a9bc
Compare
Choose a tag to compare

Initial release of the MPI Operator.