-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
merge master
merge master
Update evolution doc (microsoft#1493)
merge master
merge master
merge master
augment pylintrc (microsoft#1643)
fix console.log (microsoft#1636)
merge master
merge master
merge master
merge master
Filter prune algo implementation (microsoft#1655)
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
Support monitor mode when creating or resuming a new experiment (microsoft#1933)
Add test for documentation build (microsoft#1924)
fix pipeline status badge (microsoft#1942)
tools/nni_cmd/launcher_utils.py
Outdated
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!') | ||
exit(1) | ||
experiment_config['trial']['paiStoragePlugin'] = pai_config['extras']['com.microsoft.pai.runtimeplugin'][0]['plugin'] | ||
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('commands', [])[0] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default of commands
is []
. Using [0]
would raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
tools/nni_cmd/launcher_utils.py
Outdated
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('commands', [])[0] is None: | ||
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!') | ||
exit(1) | ||
experiment_config['trial']['command'] = pai_config['taskRoles']['taskrole']['commands'][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the support of PAI configuration is quite limited:
- Taskrole with name
taskrole
needs to exist. - Only commands in
taskrole
will be wrapped with trial keeper. - Multi-line commands are not supported.
Better states that in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, uses need to provide commands
field in config file, which is under taskrole
field.
} | ||
let paiJobConfig: any = undefined; | ||
if (this.paiTrialConfig.paiConfigPath) { | ||
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad indentation. Will this actually pass linter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
} catch (error) { | ||
this.log.error(`Error occurs during loading and merge ${this.paiTrialConfig.paiConfigPath} : ${error}`); | ||
if (this.paiTrialConfig.virtualCluster) { | ||
paiJobConfig.defaults= { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is eslint correctly enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eslint is enabled, but seems didn't detect whitespace errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chicm-ms could you help confirm that our current eslint config does not check whitespace?
let paiJobConfig: any = undefined; | ||
if (this.paiTrialConfig.paiConfigPath) { | ||
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8')); | ||
paiJobConfig.name = jobName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I found the job name nni generated are not human-readable and quite meaningless. I hope that the job name specified in PAI configuration can be used as a prefix of jobName. For example, if in my PAI yaml, name
is proxylessnas
, here paiJobConfig.name = "proxylessnas_nni_bdef1234_12345"
. However, having a long jobname is also troublesome on PAI. It's up to you to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, refactor jobName
is another topic, this pr will keep consistent with original design temporarily.
@@ -144,66 +143,60 @@ class PAIK8STrainingService extends PAITrainingService { | |||
throw new Error('trial config is not initialized'); | |||
} | |||
const jobName = `nni_exp_${this.experimentId}_trial_${trialJobId}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think showing experimentName
in jobName would be more user-friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experiment Id will be shown in NNICTL commands, it is easier to get for users..
} | ||
let paiJobConfig: any = undefined; | ||
if (this.paiTrialConfig.paiConfigPath) { | ||
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my paiJobConfig is accidentally changed or deleted during the run of experiment, it will submit inconsistent jobs. RIght?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, paiJobConfig file is read when new trial job is created, it may be changed by users.
if (this.paiTrialConfig.paiConfigPath) { | ||
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8')); | ||
paiJobConfig.name = jobName; | ||
paiJobConfig.taskRoles.taskrole.commands = [command] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I can't comment on L266 of this file.) Better to escape special characters in command. But it's not necessary for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there is only one key (i.e., taskrole
) under taskRoles
? what if there are multiple keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNI only parse the first taskrole, users should provide multiple taskroles in their configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you means only the first taskrole's command is wrapped by trial keeper? other taskroles' command is kept as it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the first taskrole, but the taskrole named "taskrole".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update logic, NNI will use all of taskRoles now.
@SparkSnail please update doc to make the usage very clear. |
updated. |
* paiConfigPath | ||
* Optional key. Set the file path of pai job configuration, the file is in yaml format. | ||
If users set paiConfigPath in NNI's configuration file, the `command`, `paiStoragePlugin`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` filed will lose efficacy, and configurations from `paiConfigPath` will take effect. Notice that if users set `paiConfigPath`, NNI will check the validation of `commands` under `taskRoles/taskRole/commands`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to explain the merge rule when user specify paiConfigPath
, for example, jobName
is changed by NNI to become a random string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes (limitations) I'm thinking of:
- Taskrole with name
taskrole
needs to exist. - Only commands in
taskrole
will be wrapped with trial keeper. Other taskroles will be kept as is. - Multi-line commands are not supported. Only the first command will be kept and wrapped with trial keeper.
- Special characters (like
"
and\
) are not supported in commands. jobName
specified in the original PAI configuration will be ignored.- PAI configuration needs to be kept unchanged during the whole experiment. If it's changed or deleted, either accidentally or not, nni will use the updated configuration file to submit new jobs.
Although some of these "limitations" are by design, I think, it's still "good to know" for users.
Also, what is "efficacy"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
- add explication for jobName in doc.
- remove hardcode
taskrole
, traversal all taskRoles instead. - wrap all taskRoles' command.
- multi-line command will be connected by
&&
. - use
\
to transfer meaning for"
and\
in command. - In current design, jobName in original PAI configuration will be replaced to NNI's job name, If we add original jobName in NNI experiment, it may cause name-too-long exception in PAI.
- only read original PAI configuration once in experiment, it will keep consistent in all trial jobs.
Note: | ||
|
||
1. If users set multiple taskRoles in PAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error. | ||
2. The job name in PAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni_exp_${this.experimentId}_trial_${trialJobId} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to switch point 1 and point 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
* paiConfigPath | ||
* Optional key. Set the file path of pai job configuration, the file is in yaml format. | ||
If users set paiConfigPath in NNI's configuration file, the `command`, `paiStoragePlugin`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` filed will be replaced by configurations from `paiConfigPath`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users set paiConfigPath
in NNI's configuration file, no need to specify the fields command
, paiStoragePlugin
, virtualCluster
, image
, memoryMB
, cpuNum
, gpuNum
in trial
configuration. These fields will use the values from the config file specified by paiConfigPath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@SparkSnail , suggest to add UT or IT to test this merge feature |
No description provided.