-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support heterogeneous environment service #3097
Changes from 46 commits
dcd2ffd
3b8b6fb
916e444
caeffb8
57c300e
65660e6
9376d6a
5fef3cf
5544ae8
f9fdfee
aa64fe6
c6a5f8c
68abe2f
14e9619
f69e206
12ef0aa
ddcf229
c4f6e66
88f8c1b
7eb15f8
f73367f
a3a32d5
2fc266e
ef4f561
aca3e28
db90b8f
765bc33
d95d17b
cb59654
4ead8a0
6b42a4d
cff51cc
983d4e7
c764277
eb4802c
c245b63
92dd6f8
a2392e8
401378d
5bd5c38
4232fea
ab34de6
cb9efcc
9b10a02
e03f063
3ce49c0
17626fb
6b88048
09c2131
f383650
ee71f16
6ed07ea
6fcaa3d
3966827
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
**Run an Experiment on Heterogeneous Mode** | ||
=== | ||
Run NNI on heterogeneous mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously。 | ||
|
||
## Setup environment | ||
NNI has supported [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md) and [AML](./AMLMode.md) for heterogeneous training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs. | ||
|
||
|
||
|
||
## Run an experiment | ||
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like: | ||
|
||
```yaml | ||
authorName: default | ||
experimentName: example_mnist | ||
trialConcurrency: 2 | ||
maxExecDuration: 1h | ||
maxTrialNum: 10 | ||
trainingServicePlatform: heterogeneous | ||
searchSpacePath: search_space.json | ||
#choice: true, false | ||
useAnnotation: false | ||
tuner: | ||
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner | ||
#SMAC (SMAC should be installed through nnictl) | ||
builtinTunerName: TPE | ||
classArgs: | ||
#choice: maximize, minimize | ||
optimize_mode: maximize | ||
trial: | ||
command: python3 mnist.py | ||
codeDir: . | ||
gpuNum: 1 | ||
heterogeneousConfig: | ||
trainingServicePlatforms: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Propose to write as below format. So that user can input multiple AML or PAI instances. And the heterogeneousConfig
- type: local
- type: remote
machineList:
- ip: 10.1.1.1
username: bob
passwd: bob123
#port can be skip if using default ssh port 22
#port: 22 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I know,
for heterogeneous purpose. This pr will keep current heterogenerouConfig format, and we will have another pr to convert new yaml file(exposed to users) to current yaml file(internal use). Add @liuzhe-lz for confirm. |
||
- local | ||
- remote | ||
remoteConfig: | ||
reuse: true | ||
machineList: | ||
- ip: 10.1.1.1 | ||
username: bob | ||
passwd: bob123 | ||
#port can be skip if using default ssh port 22 | ||
#port: 22 | ||
``` | ||
Configurations for heterogeneous mode: | ||
|
||
heterogeneousConfig: | ||
* trainingServicePlatforms. required key. This field specify the platforms used in heterogeneous mode, the values using yaml list format. NNI support setting `local`, `remote`, `aml`, `pai` in this field. | ||
|
||
|
||
Note: | ||
If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set `remote` as one of the platform, should also set `machineList` and `remoteConfig` configuration. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
authorName: default | ||
experimentName: example_mnist | ||
trialConcurrency: 3 | ||
maxExecDuration: 1h | ||
maxTrialNum: 10 | ||
trainingServicePlatform: heterogeneous | ||
searchSpacePath: search_space.json | ||
#choice: true, false | ||
useAnnotation: false | ||
tuner: | ||
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner | ||
#SMAC (SMAC should be installed through nnictl) | ||
builtinTunerName: TPE | ||
classArgs: | ||
#choice: maximize, minimize | ||
optimize_mode: maximize | ||
trial: | ||
command: python3 mnist.py | ||
codeDir: . | ||
gpuNum: 0 | ||
heterogeneousConfig: | ||
trainingServicePlatforms: | ||
- local | ||
- remote | ||
remoteConfig: | ||
reuse: true | ||
machineList: | ||
- ip: 10.1.1.1 | ||
username: bob | ||
passwd: bob123 | ||
#port can be skip if using default ssh port 22 | ||
#port: 22 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -118,13 +118,6 @@ def set_local_config(experiment_config, port, config_file_name): | |
request_data = dict() | ||
if experiment_config.get('localConfig'): | ||
request_data['local_config'] = experiment_config['localConfig'] | ||
if request_data['local_config']: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is toooooooo mess...... |
||
if request_data['local_config'].get('gpuIndices') and isinstance(request_data['local_config'].get('gpuIndices'), int): | ||
request_data['local_config']['gpuIndices'] = str(request_data['local_config'].get('gpuIndices')) | ||
if request_data['local_config'].get('maxTrialNumOnEachGpu'): | ||
request_data['local_config']['maxTrialNumOnEachGpu'] = request_data['local_config'].get('maxTrialNumOnEachGpu') | ||
if request_data['local_config'].get('useActiveGpu'): | ||
request_data['local_config']['useActiveGpu'] = request_data['local_config'].get('useActiveGpu') | ||
response = rest_put(cluster_metadata_url(port), json.dumps(request_data), REST_TIME_OUT) | ||
err_message = '' | ||
if not response or not check_response(response): | ||
|
@@ -306,6 +299,37 @@ def set_aml_config(experiment_config, port, config_file_name): | |
#set trial_config | ||
return set_trial_config(experiment_config, port, config_file_name), err_message | ||
|
||
def set_heterogeneous_config(experiment_config, port, config_file_name): | ||
'''set heterogeneous configuration''' | ||
heterogeneous_config_data = dict() | ||
heterogeneous_config_data['heterogeneous_config'] = experiment_config['heterogeneousConfig'] | ||
platform_list = experiment_config['heterogeneousConfig']['trainingServicePlatforms'] | ||
for platform in platform_list: | ||
if platform == 'aml': | ||
heterogeneous_config_data['aml_config'] = experiment_config['amlConfig'] | ||
elif platform == 'remote': | ||
if experiment_config.get('remoteConfig'): | ||
heterogeneous_config_data['remote_config'] = experiment_config['remoteConfig'] | ||
heterogeneous_config_data['machine_list'] = experiment_config['machineList'] | ||
elif platform == 'local' and experiment_config.get('localConfig'): | ||
heterogeneous_config_data['local_config'] = experiment_config['localConfig'] | ||
elif platform == 'pai': | ||
heterogeneous_config_data['pai_config'] = experiment_config['paiConfig'] | ||
response = rest_put(cluster_metadata_url(port), json.dumps(heterogeneous_config_data), REST_TIME_OUT) | ||
err_message = None | ||
if not response or not response.status_code == 200: | ||
if response is not None: | ||
err_message = response.text | ||
_, stderr_full_path = get_log_path(config_file_name) | ||
with open(stderr_full_path, 'a+') as fout: | ||
fout.write(json.dumps(json.loads(err_message), indent=4, sort_keys=True, separators=(',', ':'))) | ||
return False, err_message | ||
result, message = setNNIManagerIp(experiment_config, port, config_file_name) | ||
if not result: | ||
return result, message | ||
#set trial_config | ||
return set_trial_config(experiment_config, port, config_file_name), err_message | ||
|
||
def set_experiment(experiment_config, mode, port, config_file_name): | ||
'''Call startExperiment (rest POST /experiment) with yaml file content''' | ||
request_data = dict() | ||
|
@@ -387,6 +411,21 @@ def set_experiment(experiment_config, mode, port, config_file_name): | |
{'key': 'aml_config', 'value': experiment_config['amlConfig']}) | ||
request_data['clusterMetaData'].append( | ||
{'key': 'trial_config', 'value': experiment_config['trial']}) | ||
elif experiment_config['trainingServicePlatform'] == 'heterogeneous': | ||
request_data['clusterMetaData'].append( | ||
{'key': 'heterogeneous_config', 'value': experiment_config['heterogeneousConfig']}) | ||
platform_list = experiment_config['heterogeneousConfig']['trainingServicePlatforms'] | ||
request_dict = { | ||
'aml': {'key': 'aml_config', 'value': experiment_config.get('amlConfig')}, | ||
'remote': {'key': 'machine_list', 'value': experiment_config.get('machineList')}, | ||
'pai': {'key': 'pai_config', 'value': experiment_config.get('paiConfig')}, | ||
'local': {'key': 'local_config', 'value': experiment_config.get('localConfig')} | ||
} | ||
for platform in platform_list: | ||
if request_dict.get(platform): | ||
request_data['clusterMetaData'].append(request_dict[platform]) | ||
request_data['clusterMetaData'].append( | ||
{'key': 'trial_config', 'value': experiment_config['trial']}) | ||
response = rest_post(experiment_url(port), json.dumps(request_data), REST_TIME_OUT, show_error=True) | ||
if check_response(response): | ||
return response | ||
|
@@ -420,6 +459,8 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res | |
config_result, err_msg = set_dlts_config(experiment_config, port, config_file_name) | ||
elif platform == 'aml': | ||
config_result, err_msg = set_aml_config(experiment_config, port, config_file_name) | ||
elif platform == 'heterogeneous': | ||
config_result, err_msg = set_heterogeneous_config(experiment_config, port, config_file_name) | ||
else: | ||
raise Exception(ERROR_INFO % 'Unsupported platform!') | ||
exit(1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be more user friendly to write
trainingServicePlatform: ["local", "remote"]
in config?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normal yaml config use
as list.