Releases: kubeflow/mpi-operator
Releases · kubeflow/mpi-operator
v0.6.0
Changes since v0.5.0
- Features:
- Clean ups:
- Upgrade k8s libraries to v1.31 (#664, @ArangoGutierrez)
- Upgrade debian version to bookworm and MPI versions are upgraded in the following: (#661, @tenzen-y)
- OpenMPI: v4.1.0 -> v4.1.4
- MPICH: 3.4.1 -> 4.0.2
Acknowledgments
Thank you to all the contributors (in no particular order): @mszadkow @mimowo @alculquicondor @terrytangyuan @ArangoGutierrez @tenzen-y
Full Changelog: v0.5.0...v0.6.0
v0.5.0
Changes since v0.4.0
- Features:
- Add support for MPICH (#562, @sheevy)
- Field runLauncherAsWorker allows to add the launcher pod into the hostfile as a worker (#612, @kuizhiqing)
- Add PodGroup minResources calculation for volcano integration (#566, @lowang-bh)
- Bug fixes:
- Clean ups:
- Upgrade k8s libraries to v1.29 (#633, @tenzen-y)
- Fail the mpi-operator binary if access to API is denied (#619, @emsixteeen)
Acknowledgments
Thank you to all the contributors (in no particular order): @sheevy @alculquicondor @terrytangyuan @tenzen-y @kuizhiqing @lowang-bh @vsoch @emsixteeen @wang-mask @benash @yeahdongcn @xhejtman @pheianox @lianghao208
v0.4.0
Changes since 0.3.0
- Breaking changes
- Removed v1 operator. If you want to use MPIJob v1, you can use the training-operator.
- Support for suspending semantics. Third party controllers can leverage the suspend field to implement queuing and preemption for an MPIJob.
- Support for the coscheduling plugins of the scheduler-plugins.
- The operator supports multi-architecture (amd64, aarch64, and ppc64le).
- Bug fixes
- Fix support for elastic Horovod.
Acknowledgements
Special thanks to @tenzen-y for multiple contributions.
Thank you to all the contributors (in no particular order): @mimowo @adilhusain-s @davidLif @ArangoGutierrez @shaowei-su @ggaaooppeenngg @pugangxa @HeGaoYuan @Dimss @alculquicondor @terrytangyuan
v0.3.0
Release v0.3.0
- Scalability improvements
- Worker start up no longer issues requests to kube-apiserver.
- Dropped kubectl-delivery init container, reducing stress on kube-apiserver.
- Support for Intel MPI.
- Support for
runPolicy
(ttlSecondsAfterFinish
,activeDeadlineSeconds
,backoffLimit
)
by using a k8s Job for the launcher. - Samples for plain MPI applications.
- Production readiness improvements:
- Increased coverage throughout unit, integration and E2E tests.
- More robust API validation.
- Revisited v2beta1 MPIJob API.
- Using fully-qualified label names, in consistency with other kubeflow operators.
v0.2.3
Enhancements
- Added support for RH OCP4.1 and RH OCP4.2
- Added additional installation methods
- Using kustomize and kubeflow/manifests
- Using Helm Chart
- Added support for Go Modules and removed vendor directories
- Added default ephemeral storage for init container
- Overwrite NVIDIA env vars to avoid using GPUs on launcher
- Added health check and callbacks around various leader election phases
- Honor user-specified worker command
- Exposed main container name as a configurable field
- Added RunPolicy to MPIJobSpec that reuses kubeflow/common spec
- Allow to specify the name of the gang scheduler and priority for pod group
- Added error log when pod spec does not have any containers
- Switched to use distroless images
- Refactored the kubectl-delivery to improve the launcher performance
- Added Prometheus metrics for job monitoring
- Added experimental version of v1 MPIJob controller and APIs
- Support Volcano as a scheduler
- Switched to use pods for launcher job and statefulset workers
- Switched to use klog for logging
- More consistent labels with other Kubeflow operators
Fixes
- Fixed nil pointer exceptions that could accidentally restart the pod
- Updated status to running only when launcher is active and all workers are ready
- Fixed the incorrect namespace for initializing informers and endpoints of leader election
- Fixed issue in v1 controller's CRD existence check
Documentation
- Added the list of adopters
- Added roadmap document
- Revamped contributing guidelines
- Added MPIJob API reference page on Kubeflow website
- Added a blog post for an introduction to MPI Operator and its industry adoption
- Added a CPU-only example
- Added licenses used by the dependencies
v0.2.2
- Added default resource requirements for init container
- Merged multiple deployment configuration files into a single YAML file
- Switched to use
JobStatus
from kubeflow/common - Launcher and workers are now created together
v0.2.1
- Switch Docker files and examples to use
v1alpha2
MPI Operator.
v0.2.0
API Changes
- Add v1alpha2 version of the MPI Operator with more consistent API spec with other Kubeflow operators
- Support
ActiveDeadlineSeconds
inMPIJobSpec
- Support custom resource types other than GPUs
- Remove
launcherOnMaster
field
Enhancements
- Support gang scheduling
- Add
StartTime
andCompletionTime
in job status - Add leader election
- Switch to use pod group for gang scheduling
- Add example on Apache MXNet using v1alpha1 version of the MPI Operator
Initial release
Initial release of the MPI Operator.