Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

A new training service implementation for Kubernetes by Petuum #3022

Merged
merged 45 commits into from
Nov 23, 2020

Conversation

pengwu22
Copy link

@pengwu22 pengwu22 commented Oct 22, 2020

Hi NNI Developers,

We are Petuum developers and we'd like to contribute a new training service to NNI 😄, which is AdlTrainingService that implements the original KubernetesTrainingService abstraction.

Background

Recently, collaborating with a CMU lab, we released an open-source project: AdaptDL, aiming to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

It consists of a Kubernetes job scheduler and an adaptive training library. The scheduler part directly schedules the trial jobs and optimizes cluster-wide training performance and resource utilization. The added training service AdlTrainingService uses the Kubernetes client library to interact with the AdaptDL scheduler CRD, to control the lifecycles of the trials of experiments defined upon this new training service.

We have already adopted this integration on a fork for our machine-learning engineers internally, and it's very cool and useful! Therefore, we'd like to contribute back to NNI, so that we could grow our open-source community together.

Outline

While developing, we are following a principle that all changes added for the new training service does NOT affect any features that NNI already has, so that all NNI functionalities remain the same as the master.

This pull request consists of the following parts:

  • AdlTrainingService and its relevant features all compatible with nnictl:
    • Core implementation and its corresponding config schema update (We also include an example config for mnist-pytorch example.)
    • Internal storage (PVC) and external storage (NFS) support
    • Tensorboard support
    • Log streaming and collection support
  • A new in-memory data field message in job info to display the status-related short message for a trial job together with status. For example, when an AdaptDL trial job is pending status, the message can distinguish it from "waiting in the queue" or "pulling the image" etc. So NNI status and message together could be informative enough, and that users don't need to query more info directly via Kubernetes anymore.

Test

Note that, the core test cases for the new Kubernetes training service in this PR, are integration testing, that needs a Kubernetes cluster for it. If skipping integration test like the NNI Kubeflow training service, all original unit-test cases behave the same as those in master.


Hope that we could successfully merge it later after some review discussions and changes.

If you guys have any questions, let us know!

Thank you! 👍

@ghost
Copy link

ghost commented Oct 22, 2020

CLA assistant check
All CLA requirements met.

@pengwu22
Copy link
Author

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

  • Email: opensource@petuum.com
  • Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

@scarlett2018
Copy link
Member

scarlett2018 commented Oct 30, 2020

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

  • Email: opensource@petuum.com
  • Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

@pengwu22
Copy link
Author

pengwu22 commented Oct 30, 2020

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

  • Email: opensource@petuum.com
  • Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

@scarlett2018
Copy link
Member

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

  • Email: opensource@petuum.com
  • Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

Thanks @pw2393, this Wed does not work out in the end, I pick up a same time slot for next Wed and send out the invitation, hope it work for you.

@pengwu22
Copy link
Author

pengwu22 commented Nov 4, 2020

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

  • Email: opensource@petuum.com
  • Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

Thanks @pw2393, this Wed does not work out in the end, I pick up a same time slot for next Wed and send out the invitation, hope it work for you.

Thanks @scarlett2018, the mail invite is accepted.

@pengwu22
Copy link
Author

pengwu22 commented Nov 9, 2020

@SparkSnail

Please add a AdlMode.md doc under https://github.com/microsoft/nni/tree/master/docs/en_US/TrainingService, give a description for the training service and explain how to prepare environment and start experiments.

Yup, we are working on it.

@SparkSnail done here: https://github.com/microsoft/nni/blob/7a73286699a4da148ec7319763ba472a147470c8/docs/en_US/TrainingService/AdaptDLMode.md

@SparkSnail
Copy link
Contributor

SparkSnail commented Nov 13, 2020

I'll help to add an adl pipeline and test the code in pipeline. after the pipeline is ready, this pr can be merged. Will give a notification when I have any progress.

@SparkSnail SparkSnail merged commit 6518e0b into microsoft:dev-adl Nov 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants