You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training Operator 1.3 operator has been released for few months. It's time to build wishlist for the next release. Community is collecting roadmaps for kubeflow 1.5 release as well (Jan or Feb?). kubeflow/community#535
I think for the next release, we can put more time to build a decent elastic training story.
@gaocegege is working on PyTorch parts and the large PR has been merged.
On the other hand, mpi-operator v1 has been integrated into training-operator and we can enrich the elastic work (expose arbitrary worker to scale in instead of just operating the numbers) based on what @zw0610 did in the past.
If we do have additional time, we can revisit tensor flow elastic training story.
There's some more meaningful tasks like GenericJob which can support flexible framework is nice to have. Supporting different gang definition maybe something worth to explore as well. Feel free to brainstorm the ideas and we can summarize a roadmap and then recruit contributors and release managers.
/cc @kubeflow/wg-training-leads
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Training Operator 1.3 operator has been released for few months. It's time to build wishlist for the next release. Community is collecting roadmaps for kubeflow 1.5 release as well (Jan or Feb?). kubeflow/community#535
I think for the next release, we can put more time to build a decent elastic training story.
There's some more meaningful tasks like GenericJob which can support flexible framework is nice to have. Supporting different gang definition maybe something worth to explore as well. Feel free to brainstorm the ideas and we can summarize a roadmap and then recruit contributors and release managers.
/cc @kubeflow/wg-training-leads
The text was updated successfully, but these errors were encountered: