Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

DLTS integration #1945

Merged
merged 28 commits into from
Mar 2, 2020
Merged

DLTS integration #1945

merged 28 commits into from
Mar 2, 2020

Conversation

Gerhut
Copy link
Member

@Gerhut Gerhut commented Jan 10, 2020

This PR added DLTS integration to NNI.

Key Points:

  1. Run both NNI mamager / NNI Trial in DLTS job, as manager job / trial job.
  2. User should copy the trial code / NNI config to Samba, which will be synced to NFS by DLTS.
  3. Both manager job and trial job will share the same NFS mounting, includes user's trial code and config, so no copy operation needed.

@Gerhut Gerhut marked this pull request as ready for review February 19, 2020 14:12
@@ -0,0 +1,42 @@
import { DLTSClusterConfig } from "./dltsClusterConfig";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add license at file beginning and a empty line at end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

this.dltsRestServerPort = restServer.clusterRestServerPort;
}

// Step 1. Prepare PAI job configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is out of date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const parameterFileMeta = {
experimentId: this.experimentId,
trialId: trialJobId,
// filePath: hdfsHpFilePath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the commented field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 /work/nni-code/mnist.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3 mninst.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and tested

@chicm-ms
Copy link
Contributor

Will you add document about dlts training service?

break;
case 'pausing':
dltsTrialJob.status = "RUNNING";
dltsTrialJob.status = "RUNNING";
Copy link
Contributor

@SparkSnail SparkSnail Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated line. If the job in dlts is in 'pausing' status, means the job is running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DLTS job could be temporary "paused" (stop, release resources but easy to restart). I have not found a more suitable NNI status here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gerhut just curious, how users restart a paused job? the job starts from the beginning or could automatically resume the state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starts from beginning, user should handle checkpoints by themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a new status in NNI, I think RUNNING is not suitable here, since the job is not running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I make it WAITING?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is different from 'waiting'. if we show this state, users will be confused: why a job becomes waiting again after running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much affection if I introduced a new state into NNI? At least shall I provide a new style of the PAUSED state to the NNI dashboard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, if a job is in PAUSED state, simply cancel it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

VERSION_CHECK = 'version_check',
LOG_COLLECTION = 'log_collection'
LOG_COLLECTION = 'log_collection',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comma

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

while (!this.stopping) {
while (!this.stopping && this.jobQueue.length > 0) {
const trialJobId: string = this.jobQueue[0];
this.log.info('Got job ' + trialJobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format to Got job ${trialJobId}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (!this.dltsClusterConfig.cluster) {
this.dltsClusterConfig.cluster = '.default'
}
if (!this.dltsClusterConfig.email && process.env['DLWS_USER_EMAIL']) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users does not set email in config file, and the environment does not contain DLWS_USER_EMAIL, what will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 Forbidden

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest add validation in nnictl, and give meaningful error information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

resolve(gpus[0])
})
});
this.dltsClusterConfig.gpuType = gpu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can users set gpuType in config file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no since there is only one choice at present, but it have to be configured correctly or DLTS could not schedule it correctly.

DLTS plans to support cluster with multiple gpuTypes in the future.

this.log.info('Stopping DLTS training service...');
this.stopping = true;

const deferred: Deferred<void> = new Deferred<void>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove Deferred, refer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const version: string = this.versionCheck ? await getVersion() : '';
const nniDLTSTrialCommand: string = String.Format(
DLTS_TRIAL_COMMAND_FORMAT,
trialLocalFolder,
Copy link
Contributor

@SparkSnail SparkSnail Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is NNI_SYS_DIR a local folder? Is the trial job running in local?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both manager job and trial job will share the same NFS mounting, they share the same directory structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the path which const trialLocalFolder = path.join(getExperimentRootDir(), 'trials-local', trialJobId) specified a NFS sharing folder? Does users need to know this folder path and execute mount command before they start NNI job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; we will provide a NNI manager job template in DLTS to handle these mounting for user.

In my understanding, there will be two directories in use in NNI:

  • Config directory, since users have to put config / codes to samba directory first, which will synced to NFS by DLTS.
  • tempdir to generate transpiled trial code, we temporarily set TMPDIR=/work/tmp to make NNI generate transpiled code to mounted directories, it's better to make it configurable.


export class DLTSTrialConfig extends TrialConfig {
public constructor(
command: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify code style, all fields use public statement, or all of them does not use public.

Copy link
Member Author

@Gerhut Gerhut Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image is actually Parameter properties and other 3 will be passed to the super constructor.

@Gerhut
Copy link
Member Author

Gerhut commented Feb 20, 2020

How did you workaround eslint/eslint#11899 ? I have lots of these lint errors currently.

@@ -0,0 +1,52 @@
**Run an Experiment on Deep Learning Training Service**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use the official name (DLWorkspace) of the project here? or the well known aka. DLTS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @hongzhili for suggestion.

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to DLTS

@@ -0,0 +1,52 @@
**Run an Experiment on Deep Learning Training Service**
===
NNI supports running an experiment on [Deep Learning Training Service](https://github.com/microsoft/DLWorkspace.git) (aka DLTS), called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Deep Learning Training Service]
same comments about this name usage.

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to DLTS


Step 2. Prepare a NNI config YAML like the following:

```yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of directly post an yaml example, you might want to outline the new field like trainingServicePlatform: dlts, or additional keys comparing to LocalMode and RemoteMachineMode, refer to https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md

Copy link
Member Author

@Gerhut Gerhut Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all comments here other than DLTS specified ones.


![Submit Job](../../img/dlts-step4.png)

Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to theck trial's information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theck
typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@chicm-ms
Copy link
Contributor

Would you please fix following eslint error reported by IT pipeline?

/home/vsts/work/1/s/src/nni_manager/training_service/dlts/dltsTrainingService.ts
   72:25  warning  'trialJobId' is assigned a value but never used                                                                               @typescript-eslint/no-unused-vars
  111:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  114:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  116:21  error    Possible race condition: `dltsTrialJob.startTime` might be reassigned based on an outdated value of `dltsTrialJob.startTime`  require-atomic-updates
  119:21  error    Possible race condition: `dltsTrialJob.url` might be reassigned based on an outdated value of `dltsTrialJob.url`              require-atomic-updates
  123:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  126:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  130:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  135:21  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  138:21  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  142:17  error    Possible race condition: `dltsTrialJob.status` might be reassigned based on an outdated value of `dltsTrialJob.status`        require-atomic-updates
  334:37  warning  'key' is defined but never used                                                                                               @typescript-eslint/no-unused-vars
  474:39  warning  'res' is defined but never used  

@liuzhe-lz liuzhe-lz merged commit 134368f into microsoft:master Mar 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants