Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Merge pai config #1965

Merged
merged 38 commits into from
Feb 7, 2020
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
704b50e
Merge pull request #200 from microsoft/master
SparkSnail Aug 6, 2019
5b0034e
Merge pull request #204 from microsoft/master
SparkSnail Aug 20, 2019
8fe2588
Merge pull request #205 from microsoft/master
SparkSnail Aug 30, 2019
9fae194
Merge pull request #206 from microsoft/master
SparkSnail Sep 16, 2019
c785655
Merge pull request #207 from microsoft/master
SparkSnail Oct 21, 2019
2f5272c
Merge pull request #208 from microsoft/master
SparkSnail Oct 24, 2019
1892bc2
Merge pull request #209 from microsoft/master
SparkSnail Oct 28, 2019
7c1ab11
Merge pull request #210 from microsoft/master
SparkSnail Oct 28, 2019
8c203f3
Merge pull request #211 from microsoft/master
SparkSnail Oct 31, 2019
d7a62f6
check pylint for nni_cmd
SparkSnail Oct 31, 2019
e259d10
fix id error
SparkSnail Oct 31, 2019
4997295
Merge pull request #212 from microsoft/master
SparkSnail Nov 3, 2019
c037a7c
Merge pull request #213 from microsoft/master
SparkSnail Nov 10, 2019
7620e7c
Merge pull request #214 from microsoft/master
SparkSnail Nov 14, 2019
d16dbe9
Merge pull request #215 from microsoft/master
SparkSnail Nov 19, 2019
9ce751d
Merge pull request #216 from microsoft/master
SparkSnail Nov 21, 2019
a0846f2
Merge pull request #217 from microsoft/master
SparkSnail Nov 22, 2019
cd3a912
Merge pull request #218 from microsoft/master
SparkSnail Nov 27, 2019
32efaa3
Merge pull request #219 from microsoft/master
SparkSnail Dec 10, 2019
543239c
Merge pull request #220 from microsoft/master
SparkSnail Dec 12, 2019
36e6e35
Merge pull request #221 from microsoft/master
SparkSnail Dec 19, 2019
f9ee589
Merge pull request #222 from microsoft/master
SparkSnail Dec 24, 2019
b9a7a95
Merge pull request #223 from microsoft/master
SparkSnail Dec 25, 2019
1a5c017
Merge pull request #224 from microsoft/master
SparkSnail Jan 6, 2020
392460a
Merge pull request #225 from microsoft/master
SparkSnail Jan 8, 2020
9bafa4c
Merge pull request #226 from microsoft/master
SparkSnail Jan 8, 2020
c23b807
Merge pull request #227 from microsoft/master
SparkSnail Jan 10, 2020
4132f62
Merge pull request #228 from microsoft/master
SparkSnail Jan 10, 2020
92c2ce7
add merge config
SparkSnail Jan 16, 2020
956b413
fix comments
SparkSnail Jan 19, 2020
0a37820
use deepmerge package
SparkSnail Jan 19, 2020
d07fec1
add semicolon
SparkSnail Jan 19, 2020
a803684
add annotation
SparkSnail Jan 19, 2020
1970f15
set trial config optional
SparkSnail Jan 19, 2020
a1dab9f
add doc
SparkSnail Jan 19, 2020
c58f49b
sort package.json
SparkSnail Jan 20, 2020
4163f26
add yarn.lock
Feb 5, 2020
e2ceede
revert change
Feb 5, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions docs/en_US/TrainingService/PaiMode.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,20 +48,22 @@ Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if y

Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* Optional key. Should be positive number based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* image
* Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* Optional key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* virtualCluster
* Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
* nniManagerNFSMountPath
* Required key. Set the mount path in your nniManager machine.
* containerNFSMountPath
* Required key. Set the mount path in your container used in PAI.
* paiStoragePlugin
* Required key. Set the storage plugin name used in PAI.
* Optional key. Set the storage plugin name used in PAI. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* paiConfigPath
* Optional key. Set the file path of pai job configuration, the file is in yaml format.


Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"azure-storage": "^2.10.2",
"chai-as-promised": "^7.1.1",
"child-process-promise": "^2.2.1",
"deepmerge": "^4.2.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be some updates in yarn.lock needs to be checked in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"express": "^4.16.3",
"express-joi-validator": "^2.0.0",
"js-base64": "^2.4.9",
Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/rest_server/restValidationSchemas.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ export namespace ValidationSchemas {
authFile: joi.string(),
nniManagerNFSMountPath: joi.string().min(1),
containerNFSMountPath: joi.string().min(1),
paiConfigPath: joi.string(),
paiStoragePlugin: joi.string().min(1),
nasMode: joi.string().valid('classic_mode', 'enas_mode', 'oneshot_mode', 'darts_mode'),
portList: joi.array().items(joi.object({
Expand Down
4 changes: 3 additions & 1 deletion src/nni_manager/training_service/pai/paiK8S/paiK8SConfig.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,11 @@ export class NNIPAIK8STrialConfig extends TrialConfig {
public readonly nniManagerNFSMountPath: string;
public readonly containerNFSMountPath: string;
public readonly paiStoragePlugin: string;
public readonly paiConfigPath?: string;

constructor(command: string, codeDir: string, gpuNum: number, cpuNum: number, memoryMB: number,
image: string, nniManagerNFSMountPath: string, containerNFSMountPath: string,
paiStoragePlugin: string, virtualCluster?: string) {
paiStoragePlugin: string, virtualCluster?: string, paiConfigPath?: string) {
super(command, codeDir, gpuNum);
this.cpuNum = cpuNum;
this.memoryMB = memoryMB;
Expand All @@ -43,5 +44,6 @@ export class NNIPAIK8STrialConfig extends TrialConfig {
this.nniManagerNFSMountPath = nniManagerNFSMountPath;
this.containerNFSMountPath = containerNFSMountPath;
this.paiStoragePlugin = paiStoragePlugin;
this.paiConfigPath = paiConfigPath;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ import { PAIClusterConfig, PAITrialJobDetail } from '../paiConfig';
import { PAIJobRestServer } from '../paiJobRestServer';

const yaml = require('js-yaml');
const deepmerge = require('deepmerge');

/**
* Training Service implementation for OpenPAI (Open Platform for AI)
Expand Down Expand Up @@ -185,7 +186,19 @@ class PAIK8STrainingService extends PAITrainingService {
}
}

return yaml.safeDump(paiJobConfig);
if (this.paiTrialConfig.paiConfigPath) {
try {
const additionalPAIConfig = yaml.safeLoad(fs.readFileSync(this.paiTrialConfig.paiConfigPath, 'utf8'));
//deepmerge(x, y), if an element at the same key is present for both x and y, the value from y will appear in the result.
//refer: https://github.com/TehShrike/deepmerge
const overwriteMerge = (destinationArray: any, sourceArray: any, options: any) => sourceArray;
return yaml.safeDump(deepmerge(additionalPAIConfig, paiJobConfig, { arrayMerge: overwriteMerge }));
} catch (error) {
this.log.error(`Error occurs during loading and merge ${this.paiTrialConfig.paiConfigPath} : ${error}`);
}
} else {
return yaml.safeDump(paiJobConfig);
}
}

protected async submitTrialJobToPAI(trialJobId: string): Promise<boolean> {
Expand Down Expand Up @@ -254,7 +267,7 @@ class PAIK8STrainingService extends PAITrainingService {
this.log.info(`nniPAItrial command is ${nniPaiTrialCommand.trim()}`);

const paiJobConfig = this.generateJobConfigInYamlFormat(trialJobId, nniPaiTrialCommand);

this.log.debug(paiJobConfig);
// Step 3. Submit PAI job via Rest call
// Refer https://github.com/Microsoft/pai/blob/master/docs/rest-server/API.md for more detail about PAI Rest API
const submitJobRequest: request.Options = {
Expand Down
15 changes: 8 additions & 7 deletions tools/nni_cmd/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,16 +271,17 @@ def setPathCheck(key):

pai_trial_schema = {
'trial':{
'command': setType('command', str),
'codeDir': setPathCheck('codeDir'),
'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
'cpuNum': setNumberRange('cpuNum', int, 0, 99999),
'memoryMB': setType('memoryMB', int),
'image': setType('image', str),
Optional('virtualCluster'): setType('virtualCluster', str),
'nniManagerNFSMountPath': setPathCheck('nniManagerNFSMountPath'),
'containerNFSMountPath': setType('containerNFSMountPath', str),
'paiStoragePlugin': setType('paiStoragePlugin', str)
'command': setType('command', str),
Optional('gpuNum'): setNumberRange('gpuNum', int, 0, 99999),
Optional('cpuNum'): setNumberRange('cpuNum', int, 0, 99999),
Optional('memoryMB'): setType('memoryMB', int),
Optional('image'): setType('image', str),
Optional('virtualCluster'): setType('virtualCluster', str),
Optional('paiStoragePlugin'): setType('paiStoragePlugin', str),
Optional('paiConfigPath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'paiConfigPath')
}
}

Expand Down
46 changes: 45 additions & 1 deletion tools/nni_cmd/launcher_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from schema import Schema
from .config_schema import LOCAL_CONFIG_SCHEMA, REMOTE_CONFIG_SCHEMA, PAI_CONFIG_SCHEMA, PAI_YARN_CONFIG_SCHEMA, KUBEFLOW_CONFIG_SCHEMA,\
FRAMEWORKCONTROLLER_CONFIG_SCHEMA, tuner_schema_dict, advisor_schema_dict, assessor_schema_dict
from .common_utils import print_error, print_warning, print_normal
from .common_utils import print_error, print_warning, print_normal, get_yml_content

def expand_path(experiment_config, key):
'''Change '~' to user home directory'''
Expand Down Expand Up @@ -63,6 +63,8 @@ def parse_path(experiment_config, config_path):
if experiment_config.get('machineList'):
for index in range(len(experiment_config['machineList'])):
expand_path(experiment_config['machineList'][index], 'sshKeyPath')
if experiment_config['trial'].get('paiConfigPath'):
expand_path(experiment_config['trial'], 'paiConfigPath')

#if users use relative path, convert it to absolute path
root_path = os.path.dirname(config_path)
Expand Down Expand Up @@ -94,6 +96,8 @@ def parse_path(experiment_config, config_path):
if experiment_config.get('machineList'):
for index in range(len(experiment_config['machineList'])):
parse_relative_path(root_path, experiment_config['machineList'][index], 'sshKeyPath')
if experiment_config['trial'].get('paiConfigPath'):
parse_relative_path(root_path, experiment_config['trial'], 'paiConfigPath')

def validate_search_space_content(experiment_config):
'''Validate searchspace content,
Expand Down Expand Up @@ -254,6 +258,45 @@ def validate_machine_list(experiment_config):
print_error('Please set machineList!')
exit(1)

def validate_pai_config_path(experiment_config):
'''validate paiConfigPath field'''
if experiment_config.get('trainingServicePlatform') == 'pai':
if experiment_config.get('trial', {}).get('paiConfigPath'):
# validate the file format of paiConfigPath, ensure it is yaml format
pai_config = get_yml_content(experiment_config['trial']['paiConfigPath'])
if experiment_config['trial'].get('image') is None:
if pai_config.get('prerequisites', [{}])[0].get('uri') is None:
print_error('Please set image field, or set image uri in your own paiConfig!')
exit(1)
experiment_config['trial']['image'] = pai_config['prerequisites'][0]['uri']
if experiment_config['trial'].get('gpuNum') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('gpu') is None:
print_error('Please set gpuNum field, or set resourcePerInstance gpu in your own paiConfig!')
exit(1)
experiment_config['trial']['gpuNum'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['gpu']
if experiment_config['trial'].get('cpuNum') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('cpu') is None:
print_error('Please set cpuNum field, or set resourcePerInstance cpu in your own paiConfig!')
exit(1)
experiment_config['trial']['cpuNum'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['cpu']
if experiment_config['trial'].get('memoryMB') is None:
if pai_config.get('taskRoles', {}).get('taskrole', {}).get('resourcePerInstance', {}).get('memoryMB', {}) is None:
print_error('Please set memoryMB field, or set resourcePerInstance memoryMB in your own paiConfig!')
exit(1)
experiment_config['trial']['memoryMB'] = pai_config['taskRoles']['taskrole']['resourcePerInstance']['memoryMB']
if experiment_config['trial'].get('paiStoragePlugin') is None:
if pai_config.get('extras', {}).get('com.microsoft.pai.runtimeplugin', [{}])[0].get('plugin') is None:
print_error('Please set paiStoragePlugin field, or set plugin in your own paiConfig!')
exit(1)
experiment_config['trial']['paiStoragePlugin'] = pai_config['extras']['com.microsoft.pai.runtimeplugin'][0]['plugin']
else:
pai_trial_fields_required_list = ['image', 'gpuNum', 'cpuNum', 'memoryMB', 'paiStoragePlugin']
for trial_field in pai_trial_fields_required_list:
if experiment_config['trial'].get(trial_field) is None:
print_error('Please set {0} in trial configuration,\
or set additional pai configuration file path in paiConfigPath!'.format(trial_field))
exit(1)

def validate_pai_trial_conifg(experiment_config):
'''validate the trial config in pai platform'''
if experiment_config.get('trainingServicePlatform') in ['pai', 'paiYarn']:
Expand All @@ -269,6 +312,7 @@ def validate_pai_trial_conifg(experiment_config):
print_warning(warning_information.format('dataDir'))
if experiment_config.get('trial').get('outputDir'):
print_warning(warning_information.format('outputDir'))
validate_pai_config_path(experiment_config)

def validate_all_content(experiment_config, config_path):
'''Validate whether experiment_config is valid'''
Expand Down