Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Update pai yaml merge method #2369

Merged
merged 65 commits into from
May 6, 2020

Conversation

SparkSnail
Copy link
Contributor

No description provided.

SparkSnail and others added 30 commits August 6, 2019 18:58
Filter prune algo implementation (microsoft#1655)
Support monitor mode when creating or resuming a new experiment (microsoft#1933)
Add test for documentation build (microsoft#1924)
@SparkSnail SparkSnail requested a review from ultmaster April 24, 2020 08:49
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!')
exit(1)
experiment_config['trial']['paiStoragePlugin'] = pai_config['extras']['com.microsoft.pai.runtimeplugin'][0]['plugin']
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('commands', [])[0] is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default of commands is []. Using [0] would raise an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if pai_config.get('taskRoles', {}).get('taskrole', {}).get('commands', [])[0] is None:
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!')
exit(1)
experiment_config['trial']['command'] = pai_config['taskRoles']['taskrole']['commands'][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the support of PAI configuration is quite limited:

  • Taskrole with name taskrole needs to exist.
  • Only commands in taskrole will be wrapped with trial keeper.
  • Multi-line commands are not supported.

Better states that in docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, uses need to provide commands field in config file, which is under taskrole field.

}
let paiJobConfig: any = undefined;
if (this.paiTrialConfig.paiConfigPath) {
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indentation. Will this actually pass linter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

} catch (error) {
this.log.error(`Error occurs during loading and merge ${this.paiTrialConfig.paiConfigPath} : ${error}`);
if (this.paiTrialConfig.virtualCluster) {
paiJobConfig.defaults= {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is eslint correctly enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eslint is enabled, but seems didn't detect whitespace errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chicm-ms could you help confirm that our current eslint config does not check whitespace?

let paiJobConfig: any = undefined;
if (this.paiTrialConfig.paiConfigPath) {
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
paiJobConfig.name = jobName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I found the job name nni generated are not human-readable and quite meaningless. I hope that the job name specified in PAI configuration can be used as a prefix of jobName. For example, if in my PAI yaml, name is proxylessnas, here paiJobConfig.name = "proxylessnas_nni_bdef1234_12345". However, having a long jobname is also troublesome on PAI. It's up to you to decide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, refactor jobName is another topic, this pr will keep consistent with original design temporarily.

@@ -144,66 +143,60 @@ class PAIK8STrainingService extends PAITrainingService {
throw new Error('trial config is not initialized');
}
const jobName = `nni_exp_${this.experimentId}_trial_${trialJobId}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think showing experimentName in jobName would be more user-friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiment Id will be shown in NNICTL commands, it is easier to get for users..

}
let paiJobConfig: any = undefined;
if (this.paiTrialConfig.paiConfigPath) {
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my paiJobConfig is accidentally changed or deleted during the run of experiment, it will submit inconsistent jobs. RIght?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, paiJobConfig file is read when new trial job is created, it may be changed by users.

if (this.paiTrialConfig.paiConfigPath) {
paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
paiJobConfig.name = jobName;
paiJobConfig.taskRoles.taskrole.commands = [command]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can't comment on L266 of this file.) Better to escape special characters in command. But it's not necessary for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there is only one key (i.e., taskrole) under taskRoles? what if there are multiple keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNI only parse the first taskrole, users should provide multiple taskroles in their configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you means only the first taskrole's command is wrapped by trial keeper? other taskroles' command is kept as it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the first taskrole, but the taskrole named "taskrole".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update logic, NNI will use all of taskRoles now.

@QuanluZhang
Copy link
Contributor

@SparkSnail please update doc to make the usage very clear.

@SparkSnail
Copy link
Contributor Author

@SparkSnail please update doc to make the usage very clear.

updated.

* paiConfigPath
* Optional key. Set the file path of pai job configuration, the file is in yaml format.
If users set paiConfigPath in NNI's configuration file, the `command`, `paiStoragePlugin`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` filed will lose efficacy, and configurations from `paiConfigPath` will take effect. Notice that if users set `paiConfigPath`, NNI will check the validation of `commands` under `taskRoles/taskRole/commands`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to explain the merge rule when user specify paiConfigPath, for example, jobName is changed by NNI to become a random string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes (limitations) I'm thinking of:

  • Taskrole with name taskrole needs to exist.
  • Only commands in taskrole will be wrapped with trial keeper. Other taskroles will be kept as is.
  • Multi-line commands are not supported. Only the first command will be kept and wrapped with trial keeper.
  • Special characters (like " and \) are not supported in commands.
  • jobName specified in the original PAI configuration will be ignored.
  • PAI configuration needs to be kept unchanged during the whole experiment. If it's changed or deleted, either accidentally or not, nni will use the updated configuration file to submit new jobs.

Although some of these "limitations" are by design, I think, it's still "good to know" for users.

Also, what is "efficacy"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

  1. add explication for jobName in doc.
  2. remove hardcode taskrole, traversal all taskRoles instead.
  3. wrap all taskRoles' command.
  4. multi-line command will be connected by &&.
  5. use \ to transfer meaning for " and \ in command.
  6. In current design, jobName in original PAI configuration will be replaced to NNI's job name, If we add original jobName in NNI experiment, it may cause name-too-long exception in PAI.
  7. only read original PAI configuration once in experiment, it will keep consistent in all trial jobs.

Note:

1. If users set multiple taskRoles in PAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
2. The job name in PAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni_exp_${this.experimentId}_trial_${trialJobId}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to switch point 1 and point 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

* paiConfigPath
* Optional key. Set the file path of pai job configuration, the file is in yaml format.
If users set paiConfigPath in NNI's configuration file, the `command`, `paiStoragePlugin`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` filed will be replaced by configurations from `paiConfigPath`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users set paiConfigPath in NNI's configuration file, no need to specify the fields command, paiStoragePlugin, virtualCluster, image, memoryMB, cpuNum, gpuNum in trial configuration. These fields will use the values from the config file specified by paiConfigPath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@QuanluZhang
Copy link
Contributor

@SparkSnail , suggest to add UT or IT to test this merge feature

@SparkSnail SparkSnail merged commit ee25377 into microsoft:master May 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants