A new training service implementation for Kubernetes by Petuum #3022

pengwu22 · 2020-10-22T22:00:15Z

Hi NNI Developers,

We are Petuum developers and we'd like to contribute a new training service to NNI 😄, which is AdlTrainingService that implements the original KubernetesTrainingService abstraction.

Background

Recently, collaborating with a CMU lab, we released an open-source project: AdaptDL, aiming to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

It consists of a Kubernetes job scheduler and an adaptive training library. The scheduler part directly schedules the trial jobs and optimizes cluster-wide training performance and resource utilization. The added training service AdlTrainingService uses the Kubernetes client library to interact with the AdaptDL scheduler CRD, to control the lifecycles of the trials of experiments defined upon this new training service.

We have already adopted this integration on a fork for our machine-learning engineers internally, and it's very cool and useful! Therefore, we'd like to contribute back to NNI, so that we could grow our open-source community together.

Outline

While developing, we are following a principle that all changes added for the new training service does NOT affect any features that NNI already has, so that all NNI functionalities remain the same as the master.

This pull request consists of the following parts:

AdlTrainingService and its relevant features all compatible with nnictl:
- Core implementation and its corresponding config schema update (We also include an example config for mnist-pytorch example.)
- Internal storage (PVC) and external storage (NFS) support
- Tensorboard support
- Log streaming and collection support
A new in-memory data field message in job info to display the status-related short message for a trial job together with status. For example, when an AdaptDL trial job is pending status, the message can distinguish it from "waiting in the queue" or "pulling the image" etc. So NNI status and message together could be informative enough, and that users don't need to query more info directly via Kubernetes anymore.

Test

Note that, the core test cases for the new Kubernetes training service in this PR, are integration testing, that needs a Kubernetes cluster for it. If skipping integration test like the NNI Kubeflow training service, all original unit-test cases behave the same as those in master.

Hope that we could successfully merge it later after some review discussions and changes.

If you guys have any questions, let us know!

Thank you! 👍

ghost · 2020-10-22T22:00:30Z

All CLA requirements met.

pengwu22 · 2020-10-28T19:21:13Z

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

Email: opensource@petuum.com

Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

scarlett2018 · 2020-10-30T09:22:57Z

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

Email: opensource@petuum.com

Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

pengwu22 · 2020-10-30T15:42:05Z

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

Email: opensource@petuum.com

Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

scarlett2018 · 2020-11-04T02:47:48Z

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

Email: opensource@petuum.com

Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

Thanks @pw2393, this Wed does not work out in the end, I pick up a same time slot for next Wed and send out the invitation, hope it work for you.

pengwu22 · 2020-11-04T16:35:58Z

We want to have a review meeting with you to discuss the integration. Our PM will contact you soon.

Cool! Thanks. Looking forward to it. 👍

Hi @pw2393 - This is Scarlett, PM of the project. Firstly, thanks for contributing to NNI, glad to see another training service that can benefit a new AI platform user. As this PR is pretty big, shall we book a 30 mins to 1 hour PR review sometime this week? I will invite NNI dev who works on Training Service part as well. Let me know your email and available timeslot(s) for sending out the calendar invitation.

Thanks @scarlett2018 ! Sounds good to me

Email: opensource@petuum.com

Some availabilities in Pacific Time: this Wednesday 12:00 - 2:00 PM or Thursday 10:30 - 2:00 PM or Friday 10:30 - 2:00 PM.

Hi @scarlett2018 , just a follow-up, would there still be a meeting sometime this week?

Hi @pw2393, sorry for the delayed responds, this week actually not working well for all the reviewers. I started a doodle vote for next week, and there are currently 4 options, hope some of them works well for you:
https://doodle.com/poll/5ky9scc4xmb3mkh7

Thanks @scarlett2018. Slot is selected. Could you send us the meeting invite via email?

Thanks @pw2393, this Wed does not work out in the end, I pick up a same time slot for next Wed and send out the invitation, hope it work for you.

Thanks @scarlett2018, the mail invite is accepted.

pengwu22 · 2020-11-09T22:18:51Z

@SparkSnail

Please add a AdlMode.md doc under https://github.com/microsoft/nni/tree/master/docs/en_US/TrainingService, give a description for the training service and explain how to prepare environment and start experiments.

Yup, we are working on it.

@SparkSnail done here: https://github.com/microsoft/nni/blob/7a73286699a4da148ec7319763ba472a147470c8/docs/en_US/TrainingService/AdaptDLMode.md

docs/en_US/TrainingService/AdaptDLMode.md

SparkSnail · 2020-11-13T08:42:24Z

I'll help to add an adl pipeline and test the code in pipeline. after the pipeline is ready, this pr can be merged. Will give a notification when I have any progress.

pengwu22 mentioned this pull request Oct 22, 2020

[Internal Tracking] Stage for OSS petuum/nni#4

Closed

QuanluZhang requested review from SparkSnail, chicm-ms and liuzhe-lz October 23, 2020 02:27

pengwu22 and others added 25 commits October 23, 2020 11:28

A light pipeline to build, lint, test, release internally

4ed72c4

Add AdaptDL

41bade3

Add NFS support

a89ce3c

Support show trials in WebUI without requiring metrics reported

fea7286

Resolve BE-12443 "Dev"

f45728b

Deepcopy the templates. Avoids readding mounts and volumes

9afe3fe

Handle trial msg

8b78311

[BE-12469] Support adaptdl signal handling

cdc3a76

make checkpoint optional in config file

a9cfeb9

image pulling error handling added

1974390

nnictl log

0db3111

fix backward incompatible changes after rebase

0177886

fix nnictl tensorboard start allowing optional experiment id

e21e884

waiting doesn't exist bug fix

629804f

Resolve BE-12465: add fail msg and resource config

918cbb8

BE-12510: nni-tensorboard issue fixed

7ff5fd2

AdaptDL-Compatible Python CLI APIs Example

d129e23

Hide KubeConfig for Going Public

b24ba6c

Integrate Bert finetuning model as an example in NNI

a20b33d

tensorboard ui and web ui: at the same ip

1e8c461

adaptive support (#1)

4cc7cec

Simplify Examples

2096a9e

Cleanup: General

2e9b767

cleanup: webui

565d807

cleanup: sanity

9c76d39

comments: message value-write-read

1ee7de8

scarlett2018 requested review from SparkSnail and scarlett2018 October 28, 2020 02:47

microsoft deleted a comment from NiuChan1301 Oct 28, 2020

pengwu22 added 2 commits October 28, 2020 01:17

unit test

55c25c0

unit test

af315ca

unit test

d7afaab

doc

7a73286

pengwu22 added 6 commits November 10, 2020 14:33

toc tree reference

91fd236

lint

a0dc313

toctree

6825a5e

intermediate seq

0024670

doc fix

c85798d

intermediate sequence

81d6c29

SparkSnail reviewed Nov 11, 2020

View reviewed changes

docs/en_US/TrainingService/AdaptDLMode.md Outdated Show resolved Hide resolved

pengwu22 added 2 commits November 11, 2020 15:59

import fix

d527a89

===

9ef620f

SparkSnail approved these changes Nov 20, 2020

View reviewed changes

hao-howard-zhang added 3 commits November 20, 2020 15:21

add adl cifar10 example

cdb268e

rename tensorboard dir env var

6a32f17

improve adaptdl doc

7dd3ba0

liuzhe-lz approved these changes Nov 23, 2020

View reviewed changes

SparkSnail merged commit 6518e0b into microsoft:dev-adl Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new training service implementation for Kubernetes by Petuum #3022

A new training service implementation for Kubernetes by Petuum #3022

pengwu22 commented Oct 22, 2020 •

edited

Loading

ghost commented Oct 22, 2020 •

edited by ghost

Loading

pengwu22 commented Oct 28, 2020

scarlett2018 commented Oct 30, 2020 •

edited

Loading

pengwu22 commented Oct 30, 2020 •

edited

Loading

scarlett2018 commented Nov 4, 2020

pengwu22 commented Nov 4, 2020

pengwu22 commented Nov 9, 2020

SparkSnail commented Nov 13, 2020 •

edited

Loading

A new training service implementation for Kubernetes by Petuum #3022

A new training service implementation for Kubernetes by Petuum #3022

Conversation

pengwu22 commented Oct 22, 2020 • edited Loading

Background

Outline

Test

ghost commented Oct 22, 2020 • edited by ghost Loading

pengwu22 commented Oct 28, 2020

scarlett2018 commented Oct 30, 2020 • edited Loading

pengwu22 commented Oct 30, 2020 • edited Loading

scarlett2018 commented Nov 4, 2020

pengwu22 commented Nov 4, 2020

pengwu22 commented Nov 9, 2020

SparkSnail commented Nov 13, 2020 • edited Loading

pengwu22 commented Oct 22, 2020 •

edited

Loading

ghost commented Oct 22, 2020 •

edited by ghost

Loading

scarlett2018 commented Oct 30, 2020 •

edited

Loading

pengwu22 commented Oct 30, 2020 •

edited

Loading

SparkSnail commented Nov 13, 2020 •

edited

Loading