Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Update pai yaml merge method #2369

Merged
merged 65 commits into from
May 6, 2020
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
704b50e
Merge pull request #200 from microsoft/master
SparkSnail Aug 6, 2019
5b0034e
Merge pull request #204 from microsoft/master
SparkSnail Aug 20, 2019
8fe2588
Merge pull request #205 from microsoft/master
SparkSnail Aug 30, 2019
9fae194
Merge pull request #206 from microsoft/master
SparkSnail Sep 16, 2019
c785655
Merge pull request #207 from microsoft/master
SparkSnail Oct 21, 2019
2f5272c
Merge pull request #208 from microsoft/master
SparkSnail Oct 24, 2019
1892bc2
Merge pull request #209 from microsoft/master
SparkSnail Oct 28, 2019
7c1ab11
Merge pull request #210 from microsoft/master
SparkSnail Oct 28, 2019
8c203f3
Merge pull request #211 from microsoft/master
SparkSnail Oct 31, 2019
d7a62f6
check pylint for nni_cmd
SparkSnail Oct 31, 2019
e259d10
fix id error
SparkSnail Oct 31, 2019
4997295
Merge pull request #212 from microsoft/master
SparkSnail Nov 3, 2019
c037a7c
Merge pull request #213 from microsoft/master
SparkSnail Nov 10, 2019
7620e7c
Merge pull request #214 from microsoft/master
SparkSnail Nov 14, 2019
d16dbe9
Merge pull request #215 from microsoft/master
SparkSnail Nov 19, 2019
9ce751d
Merge pull request #216 from microsoft/master
SparkSnail Nov 21, 2019
a0846f2
Merge pull request #217 from microsoft/master
SparkSnail Nov 22, 2019
cd3a912
Merge pull request #218 from microsoft/master
SparkSnail Nov 27, 2019
32efaa3
Merge pull request #219 from microsoft/master
SparkSnail Dec 10, 2019
543239c
Merge pull request #220 from microsoft/master
SparkSnail Dec 12, 2019
36e6e35
Merge pull request #221 from microsoft/master
SparkSnail Dec 19, 2019
f9ee589
Merge pull request #222 from microsoft/master
SparkSnail Dec 24, 2019
b9a7a95
Merge pull request #223 from microsoft/master
SparkSnail Dec 25, 2019
1a5c017
Merge pull request #224 from microsoft/master
SparkSnail Jan 6, 2020
392460a
Merge pull request #225 from microsoft/master
SparkSnail Jan 8, 2020
9bafa4c
Merge pull request #226 from microsoft/master
SparkSnail Jan 8, 2020
c23b807
Merge pull request #227 from microsoft/master
SparkSnail Jan 10, 2020
4132f62
Merge pull request #228 from microsoft/master
SparkSnail Jan 10, 2020
92c2ce7
add merge config
SparkSnail Jan 16, 2020
956b413
fix comments
SparkSnail Jan 19, 2020
0a37820
use deepmerge package
SparkSnail Jan 19, 2020
d07fec1
add semicolon
SparkSnail Jan 19, 2020
a803684
add annotation
SparkSnail Jan 19, 2020
1970f15
set trial config optional
SparkSnail Jan 19, 2020
a1dab9f
add doc
SparkSnail Jan 19, 2020
c58f49b
sort package.json
SparkSnail Jan 20, 2020
4f66d0c
Merge pull request #229 from microsoft/master
SparkSnail Feb 1, 2020
129c4a5
Merge pull request #230 from microsoft/master
SparkSnail Feb 4, 2020
4163f26
add yarn.lock
Feb 5, 2020
e2ceede
revert change
Feb 5, 2020
3fe117f
Merge pull request #231 from microsoft/master
SparkSnail Feb 7, 2020
aa31674
Merge pull request #233 from microsoft/master
SparkSnail Feb 21, 2020
1d74ae5
Merge pull request #234 from microsoft/master
SparkSnail Feb 27, 2020
75028bd
Merge pull request #235 from microsoft/master
SparkSnail Mar 17, 2020
4773c91
Merge pull request #236 from microsoft/master
SparkSnail Mar 18, 2020
3ee0961
Merge pull request #237 from microsoft/master
SparkSnail Mar 20, 2020
0fb7862
Merge pull request #238 from microsoft/master
SparkSnail Mar 26, 2020
6c3148c
Merge pull request #239 from microsoft/master
SparkSnail Apr 3, 2020
b4773e1
Merge pull request #240 from microsoft/master
SparkSnail Apr 11, 2020
6728799
Merge pull request #241 from microsoft/master
SparkSnail Apr 16, 2020
1b9daa3
Merge pull request #242 from microsoft/master
SparkSnail Apr 20, 2020
e0c2c0e
Merge pull request #243 from microsoft/master
SparkSnail Apr 23, 2020
bd9fda5
fix conflict
SparkSnail Apr 23, 2020
0e1ad4f
update pai yaml merge
SparkSnail Apr 23, 2020
61c8eaf
remove deepmerge package
SparkSnail Apr 24, 2020
052b3ec
fix comments
SparkSnail Apr 29, 2020
ce11bf7
add doc for paiConfigPath
SparkSnail Apr 29, 2020
8aca557
fix comments
SparkSnail Apr 29, 2020
8a3768f
update doc
SparkSnail Apr 29, 2020
f4b4775
fix pylint
SparkSnail Apr 29, 2020
e29b58a
Merge pull request #244 from microsoft/master
SparkSnail Apr 30, 2020
caadc04
fix comments
SparkSnail Apr 30, 2020
0516655
remove unused doc
SparkSnail Apr 30, 2020
9201699
format doc
SparkSnail Apr 30, 2020
45ce90d
update yarn.loc
SparkSnail Apr 30, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/en_US/TrainingService/PaiMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,17 @@ Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMod
* Required key. Set the mount path in your container used in PAI.
* paiStoragePlugin
* Optional key. Set the storage plugin name used in PAI. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* command
* Optional key. Set the commands used in PAI container.
* paiConfigPath
* Optional key. Set the file path of pai job configuration, the file is in yaml format.
If users set paiConfigPath in NNI's configuration file, the `command`, `paiStoragePlugin`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` filed will be replaced by configurations from `paiConfigPath`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users set paiConfigPath in NNI's configuration file, no need to specify the fields command, paiStoragePlugin, virtualCluster, image, memoryMB, cpuNum, gpuNum in trial configuration. These fields will use the values from the config file specified by paiConfigPath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

```
Note:

1. If users set multiple taskRoles in PAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
2. The job name in PAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni_exp_${this.experimentId}_trial_${trialJobId}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to switch point 1 and point 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

```


Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
Expand All @@ -104,6 +113,12 @@ to start the experiment in pai mode. NNI will create OpenPAI job for each trial,
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
![](../../img/nni_pai_joblist.jpg)



<center class="half">
<img src="https://github.com/JSong-Jia/NNI-Student-Program-2020/blob/master/QR%20Code.png?raw=true" />
</center>

Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.

Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand Down
1 change: 0 additions & 1 deletion src/nni_manager/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
"azure-storage": "^2.10.2",
"chai-as-promised": "^7.1.1",
"child-process-promise": "^2.2.1",
"deepmerge": "^4.2.2",
"express": "^4.16.3",
"express-joi-validator": "^2.0.0",
"js-base64": "^2.4.9",
Expand Down
176 changes: 94 additions & 82 deletions src/nni_manager/training_service/pai/paiK8S/paiK8STrainingService.ts
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ import { PAIClusterConfig, PAITrialJobDetail } from '../paiConfig';
import { PAIJobRestServer } from '../paiJobRestServer';

const yaml = require('js-yaml');
const deepmerge = require('deepmerge');

/**
* Training Service implementation for OpenPAI (Open Platform for AI)
Expand All @@ -53,9 +52,11 @@ const deepmerge = require('deepmerge');
@component.Singleton
class PAIK8STrainingService extends PAITrainingService {
protected paiTrialConfig: NNIPAIK8STrialConfig | undefined;

private paiJobConfig: undefined;
private nniVersion: string | undefined;
constructor() {
super();

}

public async setClusterMetadata(key: string, value: string): Promise<void> {
Expand Down Expand Up @@ -84,9 +85,13 @@ class PAIK8STrainingService extends PAITrainingService {
this.paiTrialConfig = <NNIPAIK8STrialConfig>JSON.parse(value);
// Validate to make sure codeDir doesn't have too many files
await validateCodeDir(this.paiTrialConfig.codeDir);
if (this.paiTrialConfig.paiConfigPath) {
this.paiJobConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
}
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
this.nniVersion = this.versionCheck ? await getVersion() : '';
break;
case TrialConfigMetadataKey.LOG_COLLECTION:
this.logCollection = value;
Expand Down Expand Up @@ -138,72 +143,100 @@ class PAIK8STrainingService extends PAITrainingService {

return trialJobDetail;
}

public generateJobConfigInYamlFormat(trialJobId: string, command: string) {
private generateNNITrialCommand(trialJobDetail: PAITrialJobDetail, command: string) {
if (this.paiTrialConfig === undefined) {
throw new Error('trial config is not initialized');
}
const jobName = `nni_exp_${this.experimentId}_trial_${trialJobId}`
const paiJobConfig: any = {
protocolVersion: 2,
name: jobName,
type: 'job',
jobRetryCount: 0,
prerequisites: [
{
type: 'dockerimage',
uri: this.paiTrialConfig.image,
name: 'docker_image_0'
}
],
taskRoles: {
taskrole: {
instances: 1,
completion: {
minFailedInstances: 1,
minSucceededInstances: -1
},
taskRetryCount: 0,
dockerImage: 'docker_image_0',
resourcePerInstance: {
gpu: this.paiTrialConfig.gpuNum,
cpu: this.paiTrialConfig.cpuNum,
memoryMB: this.paiTrialConfig.memoryMB
},
commands: [
command
]
}
},
extras: {
'com.microsoft.pai.runtimeplugin': [
{
plugin: this.paiTrialConfig.paiStoragePlugin
}
],
submitFrom: 'submit-job-v2'
}
}
if (this.paiTrialConfig.virtualCluster) {
paiJobConfig.defaults= {
virtualCluster: this.paiTrialConfig.virtualCluster
}
const containerWorkingDir: string = `${this.paiTrialConfig.containerNFSMountPath}/${this.experimentId}/${trialJobDetail.id}`;
const nniManagerIp: string = this.nniManagerIpConfig ? this.nniManagerIpConfig.nniManagerIp : getIPV4Address();
const nniPaiTrialCommand: string = String.Format(
PAI_K8S_TRIAL_COMMAND_FORMAT,
`${containerWorkingDir}`,
`${containerWorkingDir}/nnioutput`,
trialJobDetail.id,
this.experimentId,
trialJobDetail.form.sequenceId,
this.isMultiPhase,
command,
nniManagerIp,
this.paiRestServerPort,
this.nniVersion,
this.logCollection
)
.replace(/\r\n|\n|\r/gm, '');

return nniPaiTrialCommand;

}

private generateJobConfigInYamlFormat(trialJobDetail: PAITrialJobDetail) {
if (this.paiTrialConfig === undefined) {
throw new Error('trial config is not initialized');
}
const jobName = `nni_exp_${this.experimentId}_trial_${trialJobDetail.id}`

let nniJobConfig: any = undefined;
if (this.paiTrialConfig.paiConfigPath) {
try {
const additionalPAIConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
//deepmerge(x, y), if an element at the same key is present for both x and y, the value from y will appear in the result.
//refer: https://github.com/TehShrike/deepmerge
const overwriteMerge = (destinationArray: any, sourceArray: any, options: any) => sourceArray;
return yaml.safeDump(deepmerge(additionalPAIConfig, paiJobConfig, { arrayMerge: overwriteMerge }));
} catch (error) {
this.log.error(`Error occurs during loading and merge ${this.paiTrialConfig.paiConfigPath} : ${error}`);
nniJobConfig = this.paiJobConfig;
nniJobConfig.name = jobName;
// Each taskRole will generate new command in NNI's command format
// Each command will be formatted to NNI style
for(const taskRoleIndex in nniJobConfig.taskRoles) {
const commands = nniJobConfig.taskRoles[taskRoleIndex].commands
const nniTrialCommand = this.generateNNITrialCommand(trialJobDetail, commands.join(" && ").replace(/(["'$`\\])/g,'\\$1'));
nniJobConfig.taskRoles[taskRoleIndex].commands = [nniTrialCommand]
}

} else {
return yaml.safeDump(paiJobConfig);
nniJobConfig = {
protocolVersion: 2,
name: jobName,
type: 'job',
jobRetryCount: 0,
prerequisites: [
{
type: 'dockerimage',
uri: this.paiTrialConfig.image,
name: 'docker_image_0'
}
],
taskRoles: {
taskrole: {
instances: 1,
completion: {
minFailedInstances: 1,
minSucceededInstances: -1
},
taskRetryCount: 0,
dockerImage: 'docker_image_0',
resourcePerInstance: {
gpu: this.paiTrialConfig.gpuNum,
cpu: this.paiTrialConfig.cpuNum,
memoryMB: this.paiTrialConfig.memoryMB
},
commands: [
this.generateNNITrialCommand(trialJobDetail, this.paiTrialConfig.command)
]
}
},
extras: {
'com.microsoft.pai.runtimeplugin': [
{
plugin: this.paiTrialConfig.paiStoragePlugin
}
],
submitFrom: 'submit-job-v2'
}
}
if (this.paiTrialConfig.virtualCluster) {
nniJobConfig.defaults = {
virtualCluster: this.paiTrialConfig.virtualCluster
}
}
}
}
return yaml.safeDump(nniJobConfig);
}

protected async submitTrialJobToPAI(trialJobId: string): Promise<boolean> {
const deferred: Deferred<boolean> = new Deferred<boolean>();
Expand Down Expand Up @@ -248,29 +281,8 @@ class PAIK8STrainingService extends PAITrainingService {

//Copy codeDir files to local working folder
await execCopydir(this.paiTrialConfig.codeDir, trialLocalFolder);

const nniManagerIp: string = this.nniManagerIpConfig ? this.nniManagerIpConfig.nniManagerIp : getIPV4Address();
const version: string = this.versionCheck ? await getVersion() : '';
const containerWorkingDir: string = `${this.paiTrialConfig.containerNFSMountPath}/${this.experimentId}/${trialJobId}`;
const nniPaiTrialCommand: string = String.Format(
PAI_K8S_TRIAL_COMMAND_FORMAT,
`${containerWorkingDir}`,
`${containerWorkingDir}/nnioutput`,
trialJobId,
this.experimentId,
trialJobDetail.form.sequenceId,
this.isMultiPhase,
this.paiTrialConfig.command,
nniManagerIp,
this.paiRestServerPort,
version,
this.logCollection
)
.replace(/\r\n|\n|\r/gm, '');

this.log.info(`nniPAItrial command is ${nniPaiTrialCommand.trim()}`);

const paiJobConfig = this.generateJobConfigInYamlFormat(trialJobId, nniPaiTrialCommand);
//Generate Job Configuration in yaml format
const paiJobConfig = this.generateJobConfigInYamlFormat(trialJobDetail);
this.log.debug(paiJobConfig);
// Step 3. Submit PAI job via Rest call
// Refer https://github.com/Microsoft/pai/blob/master/docs/rest-server/API.md for more detail about PAI Rest API
Expand Down
2 changes: 1 addition & 1 deletion tools/nni_cmd/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ def setPathCheck(key):
'codeDir': setPathCheck('codeDir'),
'nniManagerNFSMountPath': setPathCheck('nniManagerNFSMountPath'),
'containerNFSMountPath': setType('containerNFSMountPath', str),
'command': setType('command', str),
Optional('command'): setType('command', str),
Optional('gpuNum'): setNumberRange('gpuNum', int, 0, 99999),
Optional('cpuNum'): setNumberRange('cpuNum', int, 0, 99999),
Optional('memoryMB'): setType('memoryMB', int),
Expand Down
33 changes: 6 additions & 27 deletions tools/nni_cmd/launcher_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,35 +266,14 @@ def validate_pai_config_path(experiment_config):
'''validate paiConfigPath field'''
if experiment_config.get('trainingServicePlatform') == 'pai':
if experiment_config.get('trial', {}).get('paiConfigPath'):
# validate the file format of paiConfigPath, ensure it is yaml format
# validate commands
pai_config = get_yml_content(experiment_config['trial']['paiConfigPath'])
if experiment_config['trial'].get('image') is None:
if pai_config.get('prerequisites', [{}])[0].get('uri') is None:
print_error('Please set image field, or set image uri in your own paiConfig!')
exit(1)
experiment_config['trial']['image'] = pai_config['prerequisites'][0]['uri']
if experiment_config['trial'].get('gpuNum') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('gpu') is None:
print_error('Please set gpuNum field, or set resourcePerInstance gpu in your own paiConfig!')
exit(1)
experiment_config['trial']['gpuNum'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['gpu']
if experiment_config['trial'].get('cpuNum') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('cpu') is None:
print_error('Please set cpuNum field, or set resourcePerInstance cpu in your own paiConfig!')
exit(1)
experiment_config['trial']['cpuNum'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['cpu']
if experiment_config['trial'].get('memoryMB') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('memoryMB', {}) is None:
print_error('Please set memoryMB field, or set resourcePerInstance memoryMB in your own paiConfig!')
exit(1)
experiment_config['trial']['memoryMB'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['memoryMB']
if experiment_config['trial'].get('paiStoragePlugin') is None:
if pai_config.get('extras', {}).get('com.microsoft.pai.runtimeplugin', [{}])[0].get('plugin') is None:
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!')
exit(1)
experiment_config['trial']['paiStoragePlugin'] = pai_config['extras']['com.microsoft.pai.runtimeplugin'][0]['plugin']
taskRoles_dict = pai_config.get('taskRoles')
if not taskRoles_dict:
print_error('Please set taskRoles in paiConfigPath config file!')
exit(1)
else:
pai_trial_fields_required_list = ['image', 'gpuNum', 'cpuNum', 'memoryMB', 'paiStoragePlugin']
pai_trial_fields_required_list = ['image', 'gpuNum', 'cpuNum', 'memoryMB', 'paiStoragePlugin', 'command']
for trial_field in pai_trial_fields_required_list:
if experiment_config['trial'].get(trial_field) is None:
print_error('Please set {0} in trial configuration,\
Expand Down