-
Notifications
You must be signed in to change notification settings - Fork 1.8k
A new training service implementation for Kubernetes by Petuum #3022
Conversation
Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week? |
Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you: |
Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email? |
Thanks @pw2393, this Wed does not work out in the end, I pick up a same time slot for next Wed and send out the invitation, hope it work for you. |
Thanks @scarlett2018, the mail invite is accepted. |
@SparkSnail done here: https://github.com/microsoft/nni/blob/7a73286699a4da148ec7319763ba472a147470c8/docs/en_US/TrainingService/AdaptDLMode.md |
I'll help to add an adl pipeline and test the code in pipeline. after the pipeline is ready, this pr can be merged. Will give a notification when I have any progress. |
Hi NNI Developers,
We are Petuum developers and we'd like to contribute a new training service to NNI 😄, which is
AdlTrainingService
that implements the originalKubernetesTrainingService
abstraction.Background
Recently, collaborating with a CMU lab, we released an open-source project: AdaptDL, aiming to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
It consists of a Kubernetes job scheduler and an adaptive training library. The scheduler part directly schedules the trial jobs and optimizes cluster-wide training performance and resource utilization. The added training service
AdlTrainingService
uses the Kubernetes client library to interact with the AdaptDL scheduler CRD, to control the lifecycles of the trials of experiments defined upon this new training service.We have already adopted this integration on a fork for our machine-learning engineers internally, and it's very cool and useful! Therefore, we'd like to contribute back to NNI, so that we could grow our open-source community together.
Outline
While developing, we are following a principle that all changes added for the new training service does NOT affect any features that NNI already has, so that all NNI functionalities remain the same as the master.
This pull request consists of the following parts:
AdlTrainingService
and its relevant features all compatible withnnictl
:message
in job info to display the status-related short message for a trial job together with status. For example, when an AdaptDL trial job is pending status, the message can distinguish it from "waiting in the queue" or "pulling the image" etc. So NNI status and message together could be informative enough, and that users don't need to query more info directly via Kubernetes anymore.Test
Note that, the core test cases for the new Kubernetes training service in this PR, are integration testing, that needs a Kubernetes cluster for it. If skipping integration test like the NNI Kubeflow training service, all original unit-test cases behave the same as those in master.
Hope that we could successfully merge it later after some review discussions and changes.
If you guys have any questions, let us know!
Thank you! 👍