Releases · kubeflow/mpi-operator

16 Oct 16:13

tenzen-y

v0.6.0

c983759

v0.6.0 Latest

Latest

Changes since v0.5.0

Features:
- Support ManagedBy feature (.spec.runPolicy.managedBy) inspired by batch/v1 Job.
  - This allows us to dispatch MPIJobs to the multiple clusters powered by Kueue's MultiKueue. (#650, @mszadkow)
Clean ups:
- Upgrade k8s libraries to v1.31 (#664, @ArangoGutierrez)
- Upgrade debian version to bookworm and MPI versions are upgraded in the following: (#661, @tenzen-y)
  - OpenMPI: v4.1.0 -> v4.1.4
  - MPICH: 3.4.1 -> 4.0.2

Acknowledgments

Thank you to all the contributors (in no particular order): @mszadkow @mimowo @alculquicondor @terrytangyuan @ArangoGutierrez @tenzen-y

Full Changelog: v0.5.0...v0.6.0

Contributors

alculquicondor, terrytangyuan, and 4 other contributors

Assets 3

18 Apr 18:38

tenzen-y

v0.5.0

7c62516

v0.5.0

Changes since v0.4.0

Features:
- Add support for MPICH (#562, @sheevy)
- Field runLauncherAsWorker allows to add the launcher pod into the hostfile as a worker (#612, @kuizhiqing)
- Add PodGroup minResources calculation for volcano integration (#566, @lowang-bh)
Bug fixes:
- Fix panic when using PodGroups and PriorityClasses (#561, @tenzen-y)
- Fix installation of mpijob Python module (#579, @vsoch)
- Fix hostfile when jobs in different namespaces have the same name (#622, @kuizhiqing)
Clean ups:
- Upgrade k8s libraries to v1.29 (#633, @tenzen-y)
- Fail the mpi-operator binary if access to API is denied (#619, @emsixteeen)

Acknowledgments

Thank you to all the contributors (in no particular order): @sheevy @alculquicondor @terrytangyuan @tenzen-y @kuizhiqing @lowang-bh @vsoch @emsixteeen @wang-mask @benash @yeahdongcn @xhejtman @pheianox @lianghao208

Contributors

benash, vsoch, and 12 other contributors

Assets 3

05 Apr 20:54

alculquicondor

v0.4.0

c77dfcf

v0.4.0

Changes since 0.3.0

Breaking changes
- Removed v1 operator. If you want to use MPIJob v1, you can use the training-operator.
Support for suspending semantics. Third party controllers can leverage the suspend field to implement queuing and preemption for an MPIJob.
Support for the coscheduling plugins of the scheduler-plugins.
The operator supports multi-architecture (amd64, aarch64, and ppc64le).
Bug fixes
- Fix support for elastic Horovod.

Acknowledgements

Special thanks to @tenzen-y for multiple contributions.
Thank you to all the contributors (in no particular order): @mimowo @adilhusain-s @davidLif @ArangoGutierrez @shaowei-su @ggaaooppeenngg @pugangxa @HeGaoYuan @Dimss @alculquicondor @terrytangyuan

Contributors

alculquicondor, terrytangyuan, and 10 other contributors

Assets 3

07 Sep 20:46

terrytangyuan

v0.3.0

db6930d

v0.3.0

Release v0.3.0

Scalability improvements
- Worker start up no longer issues requests to kube-apiserver.
- Dropped kubectl-delivery init container, reducing stress on kube-apiserver.
Support for Intel MPI.
Support for runPolicy (ttlSecondsAfterFinish, activeDeadlineSeconds, backoffLimit)
by using a k8s Job for the launcher.
Samples for plain MPI applications.
Production readiness improvements:
- Increased coverage throughout unit, integration and E2E tests.
- More robust API validation.
- Revisited v2beta1 MPIJob API.
- Using fully-qualified label names, in consistency with other kubeflow operators.

Assets 2

19 May 14:16

terrytangyuan

v0.2.3

aa96794

v0.2.3

Enhancements

Added support for RH OCP4.1 and RH OCP4.2
Added additional installation methods
- Using kustomize and kubeflow/manifests
- Using Helm Chart
Added support for Go Modules and removed vendor directories
Added default ephemeral storage for init container
Overwrite NVIDIA env vars to avoid using GPUs on launcher
Added health check and callbacks around various leader election phases
Honor user-specified worker command
Exposed main container name as a configurable field
Added RunPolicy to MPIJobSpec that reuses kubeflow/common spec
Allow to specify the name of the gang scheduler and priority for pod group
Added error log when pod spec does not have any containers
Switched to use distroless images
Refactored the kubectl-delivery to improve the launcher performance
Added Prometheus metrics for job monitoring
Added experimental version of v1 MPIJob controller and APIs
Support Volcano as a scheduler
Switched to use pods for launcher job and statefulset workers
Switched to use klog for logging
More consistent labels with other Kubeflow operators

Fixes

Fixed nil pointer exceptions that could accidentally restart the pod
Updated status to running only when launcher is active and all workers are ready
Fixed the incorrect namespace for initializing informers and endpoints of leader election
Fixed issue in v1 controller's CRD existence check

Documentation

Added the list of adopters
Added roadmap document
Revamped contributing guidelines
Added MPIJob API reference page on Kubeflow website
Added a blog post for an introduction to MPI Operator and its industry adoption
Added a CPU-only example
Added licenses used by the dependencies

Assets 2

16 Sep 16:41

terrytangyuan

v0.2.2

a657fd4

v0.2.2

Added default resource requirements for init container
Merged multiple deployment configuration files into a single YAML file
Switched to use JobStatus from kubeflow/common
Launcher and workers are now created together

Assets 2

15 Jul 13:08

terrytangyuan

v0.2.1

635e145

v0.2.1

Switch Docker files and examples to use v1alpha2 MPI Operator.

Assets 2

03 Jul 17:19

terrytangyuan

v0.2.0

6f627a8

v0.2.0

API Changes

Add v1alpha2 version of the MPI Operator with more consistent API spec with other Kubeflow operators
Support ActiveDeadlineSeconds in MPIJobSpec
Support custom resource types other than GPUs
Remove launcherOnMaster field

Enhancements

Support gang scheduling
Add StartTime and CompletionTime in job status
Add leader election
Switch to use pod group for gang scheduling
Add example on Apache MXNet using v1alpha1 version of the MPI Operator

Assets 2

11 Jan 00:53

rongou

0.1.0

071a9bc

Initial release

Initial release of the MPI Operator.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes since v0.5.0

Acknowledgments

Contributors

Changes since v0.4.0

Acknowledgments

Contributors

Changes since 0.3.0

Acknowledgements

Contributors

Release v0.3.0

Enhancements

Fixes

Documentation

API Changes

Enhancements

Releases: kubeflow/mpi-operator

v0.6.0

Changes since v0.5.0

Acknowledgments

Contributors

v0.5.0

Changes since v0.4.0

Acknowledgments

Contributors

v0.4.0

Changes since 0.3.0

Acknowledgements

Contributors

v0.3.0

Release v0.3.0

v0.2.3

Enhancements

Fixes

Documentation

v0.2.2

v0.2.1

v0.2.0

API Changes

Enhancements

Initial release