-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
24e3fde
to
5426c11
Compare
@@ -0,0 +1,42 @@ | |||
import { DLTSClusterConfig } from "./dltsClusterConfig"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add license at file beginning and a empty line at end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
this.dltsRestServerPort = restServer.clusterRestServerPort; | ||
} | ||
|
||
// Step 1. Prepare PAI job configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is out of date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
const parameterFileMeta = { | ||
experimentId: this.experimentId, | ||
trialId: trialJobId, | ||
// filePath: hdfsHpFilePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the commented field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
#choice: maximize, minimize | ||
optimize_mode: maximize | ||
trial: | ||
command: python3 /work/nni-code/mnist.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3 mninst.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and tested
Will you add document about dlts training service? |
break; | ||
case 'pausing': | ||
dltsTrialJob.status = "RUNNING"; | ||
dltsTrialJob.status = "RUNNING"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated line. If the job in dlts is in 'pausing' status, means the job is running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DLTS job could be temporary "paused" (stop, release resources but easy to restart). I have not found a more suitable NNI status here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gerhut just curious, how users restart a paused job? the job starts from the beginning or could automatically resume the state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starts from beginning, user should handle checkpoints by themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a new status in NNI, I think RUNNING
is not suitable here, since the job is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I make it WAITING
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is different from 'waiting'. if we show this state, users will be confused: why a job becomes waiting again after running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much affection if I introduced a new state into NNI? At least shall I provide a new style of the PAUSED
state to the NNI dashboard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed, if a job is in PAUSED
state, simply cancel it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
VERSION_CHECK = 'version_check', | ||
LOG_COLLECTION = 'log_collection' | ||
LOG_COLLECTION = 'log_collection', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
while (!this.stopping) { | ||
while (!this.stopping && this.jobQueue.length > 0) { | ||
const trialJobId: string = this.jobQueue[0]; | ||
this.log.info('Got job ' + trialJobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format to Got job ${trialJobId}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (!this.dltsClusterConfig.cluster) { | ||
this.dltsClusterConfig.cluster = '.default' | ||
} | ||
if (!this.dltsClusterConfig.email && process.env['DLWS_USER_EMAIL']) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users does not set email in config file, and the environment does not contain DLWS_USER_EMAIL
, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
403 Forbidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest add validation in nnictl, and give meaningful error information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
resolve(gpus[0]) | ||
}) | ||
}); | ||
this.dltsClusterConfig.gpuType = gpu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can users set gpuType in config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no since there is only one choice at present, but it have to be configured correctly or DLTS could not schedule it correctly.
DLTS plans to support cluster with multiple gpuTypes in the future.
this.log.info('Stopping DLTS training service...'); | ||
this.stopping = true; | ||
|
||
const deferred: Deferred<void> = new Deferred<void>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove Deferred, refer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
const version: string = this.versionCheck ? await getVersion() : ''; | ||
const nniDLTSTrialCommand: string = String.Format( | ||
DLTS_TRIAL_COMMAND_FORMAT, | ||
trialLocalFolder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is NNI_SYS_DIR
a local folder? Is the trial job running in local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both manager job and trial job will share the same NFS mounting, they share the same directory structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the path which const trialLocalFolder = path.join(getExperimentRootDir(), 'trials-local', trialJobId)
specified a NFS sharing folder? Does users need to know this folder path and execute mount command before they start NNI job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; we will provide a NNI manager job template in DLTS to handle these mounting for user.
In my understanding, there will be two directories in use in NNI:
- Config directory, since users have to put config / codes to samba directory first, which will synced to NFS by DLTS.
tempdir
to generate transpiled trial code, we temporarily setTMPDIR=/work/tmp
to make NNI generate transpiled code to mounted directories, it's better to make it configurable.
|
||
export class DLTSTrialConfig extends TrialConfig { | ||
public constructor( | ||
command: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify code style, all fields use public
statement, or all of them does not use public
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image
is actually Parameter properties and other 3 will be passed to the super constructor.
How did you workaround eslint/eslint#11899 ? I have lots of these lint errors currently. |
@@ -0,0 +1,52 @@ | |||
**Run an Experiment on Deep Learning Training Service** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use the official name (DLWorkspace) of the project here? or the well known aka. DLTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @hongzhili for suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to DLTS
@@ -0,0 +1,52 @@ | |||
**Run an Experiment on Deep Learning Training Service** | |||
=== | |||
NNI supports running an experiment on [Deep Learning Training Service](https://github.com/microsoft/DLWorkspace.git) (aka DLTS), called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Deep Learning Training Service]
same comments about this name usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to DLTS
|
||
Step 2. Prepare a NNI config YAML like the following: | ||
|
||
```yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of directly post an yaml example, you might want to outline the new field like trainingServicePlatform: dlts, or additional keys comparing to LocalMode and RemoteMachineMode, refer to https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all comments here other than DLTS specified ones.
|
||
![Submit Job](../../img/dlts-step4.png) | ||
|
||
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to theck trial's information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theck
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Would you please fix following eslint error reported by IT pipeline?
|
This PR added DLTS integration to NNI.
Key Points: