Create experiment from Python code #3111

liuzhe-lz · 2020-11-20T08:47:59Z

I don't know where should I place the example. Please give suggestion.

ultmaster · 2020-12-07T03:15:43Z

nni/experiment/config/convert.py

+_logger = logging.getLogger(__name__)
+
+
+def to_old_yaml(config: ExperimentConfig, skip_nnictl: bool = False) -> Dict[str, Any]:


Suggest to_v1_yaml

ultmaster · 2020-12-07T06:37:07Z

nni/experiment/nni_client.py

@@ -228,7 +228,7 @@ def __repr__(self):
                    .format(self.trialJobId, self.status, self.hyperParameters, self.logPath,
                            self.startTime, self.endTime, self.finalMetricData, self.stderrPath)

-class Experiment:
+class ExternalExperiment:


What is external experiment?

It means experiment created by other process.
You can suggest a better name.

what scenario is it?

ultmaster · 2020-12-07T06:39:45Z

nni/tools/nnictl/launcher.py

@@ -85,6 +85,7 @@ def start_rest_server(port, platform, mode, experiment_id, foreground=False, log
        log_header = LOG_HEADER % str(time_now)
        stdout_file.write(log_header)
        stderr_file.write(log_header)
+        print('## [nnictl] cmds:', cmds)


Is this debug message?

ultmaster · 2020-12-07T06:43:50Z

nni/experiment/config/convert.py

+    data['trial'] = {
+        'command': data.pop('trialCommand'),
+        'codeDir': data.pop('trialCodeDirectory'),
+        'gpuNum': data.pop('trialGpuNumber', '')


What is the default value for gpuNum?

It will be None, set in nni/experiment/config/common.py.
The change is in experiment config PR.

ultmaster · 2020-12-07T06:44:14Z

nni/experiment/config/convert.py

+            'reuse': ts['reuseMode']
+        }
+
+    return data


What happens to other training platforms?

It will be in experiment config PR. This PR bases on a draft snapshot. Same below.

ultmaster · 2020-12-07T06:44:47Z

nni/experiment/config/convert.py

+        data['remoteConfig'] = {'reuse': ts['reuseMode']}
+        data['machineList'] = []
+        for machine in ts['machineList']:
+            machine = {


This seems to have no effect.

ultmaster · 2020-12-07T06:45:30Z

examples/trials/mnist-tfv2/launch.py

@@ -0,0 +1,27 @@
+# FIXME: For demonstration only. It should not be here
+
+from pathlib import Path


Are we releasing this feature as experimental or recommended way to launch an experiment?

Experimental I think.

@ultmaster what is the difference? why experimental is better?

Suggest full IT and UT pipeline if we are recommending this feature.

i think we should add IT and UT, but could in a follow up pr

according to the code completeness, the feature should be experimental. still suggest to add IT and UT in this release

QuanluZhang · 2020-12-07T12:14:01Z

please fix pipeline errors

nni/experiment/config/base.py

QuanluZhang · 2020-12-08T01:54:01Z

nni/experiment/config/base.py

+            optional = any([
+                type_name.startswith('Optional['),
+                type_name.startswith('Union[') and 'NoneType' in type_name,
+                type_name == 'Any'
+            ])


directly parse type using string operation seems hacky...

Generic classes are not designed for attribute accessing.

nni/experiment/config/common.py

J-shang · 2020-12-08T05:09:53Z

nni/experiment/config/convert.py

+    if experiment_config.get('logCollection'):
+        request_data['logCollection'] = experiment_config.get('logCollection')
+    request_data['clusterMetaData'] = []
+    if experiment_config['trainingServicePlatform'] == 'local':


just reminding this place is a bit strange, also in launcher.py

More detail?

J-shang · 2020-12-08T05:16:53Z

nni/experiment/config/util.py

+
+def camel_case(key: str) -> str:
+    words = key.split('_')
+    return words[0] + ''.join(word.title() for word in words[1:])


what if _xxx_xxx, or this will not happen? Do we need remove the prefix _?

I suppose this will not happen because there is no equivalent in camelCase.

J-shang · 2020-12-08T07:55:59Z

nni/experiment/experiment.py

+            while True:
+                time.sleep(10)
+                status = self.get_status()
+                if status in ['ERROR', 'STOPPED', 'NO_MORE_TRIAL']:


NO_MORE_TRIAL means currSubmittedTrialNum >= experimentProfile.params.maxTrialNum and there are unfinished jobs. We should not return.

Oh, got it.

nni/experiment/config/common.py

J-shang · 2020-12-08T08:52:12Z

nni/runtime/protocol.py

@@ -32,8 +32,7 @@ class CommandType(Enum):
        _in_file = open(3, 'rb')
        _out_file = open(4, 'wb')
 except OSError:
-    _msg = 'IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?'
-    logging.getLogger(__name__).warning(_msg)
+    pass


we do not need log here?

Seems the warning causes more trouble then benefit.
It doesn't introduce real problems to import the module in wrong place.

J-shang · 2020-12-08T08:56:22Z

setup.py

@@ -74,7 +74,8 @@
    'pkginfo',
    'websockets',
    'filelock',
-    'prettytable'
+    'prettytable',
+    'dataclasses ; python_version < "3.7"'


do we support python_version >= 3.7 ?

It is a standard library in 3.7+.

SparkSnail · 2020-12-08T09:40:10Z

nni/experiment/config/convert.py

+                'preCommand': machine['trialPrepareCommand']
+            }
+
+    elif ts['platform'] == 'pai':


what about other platforms? like kubeflow, aml etc.

It's in PR #3138
This PR is not about config schema. It merely includes a snapshot version so it can run end-to-end.

SparkSnail · 2020-12-08T09:45:54Z

nni/experiment/config/convert.py

+            {'key': 'aml_config', 'value': experiment_config['amlConfig']})
+        request_data['clusterMetaData'].append(
+            {'key': 'trial_config', 'value': experiment_config['trial']})
+    return request_data


miss 'adl', 'dlts' here.

SparkSnail · 2020-12-08T09:50:15Z

nni/runtime/platform/local.py

@@ -21,9 +20,6 @@
    os.makedirs(_outputdir)

 _nni_platform = trial_env_vars.NNI_PLATFORM
-if _nni_platform == 'local':


remove trial.log in local mode?

The logging system is refactored. Its initialized when nni (top-level package) get imported.

SparkSnail · 2020-12-08T09:50:54Z

nni/runtime/protocol.py

@@ -32,8 +32,7 @@ class CommandType(Enum):
        _in_file = open(3, 'rb')
        _out_file = open(4, 'wb')
 except OSError:
-    _msg = 'IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?'
-    logging.getLogger(__name__).warning(_msg)
+    pass


why remove warning info?

It causes more problem than benefit. Importing the package in wrong place won't cause real problem.

nni/experiment/config/common.py

nni/experiment/config/local.py

QuanluZhang · 2020-12-08T11:53:04Z

nni/runtime/protocol.py

@@ -32,8 +32,7 @@ class CommandType(Enum):
        _in_file = open(3, 'rb')
        _out_file = open(4, 'wb')
 except OSError:
-    _msg = 'IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?'
-    logging.getLogger(__name__).warning(_msg)
+    pass


why remove this log?

It causes many problems (I have fixed relative issue multiple times) and has little real benefit.
It's more a debug message in nni's early stage, when I had no confidence about the correctness of ipc modules.

liuzhe-lz added 6 commits November 20, 2020 16:30

first draft

c8ffed8

second ver

bfbec3d

refactor logging

e8648eb

fix cluster metadata

bacd496

clean up

4307ee8

use foreground in example

b177b79

liuzhe-lz marked this pull request as ready for review November 26, 2020 23:47

liuzhe-lz added 5 commits November 27, 2020 07:57

Merge branch 'master' into exp

e38278d

add missing file

97d370c

fix pylint

8c55d21

update ts timestamp to match python format

7f96326

try to fix ts version differnce

b052411

QuanluZhang requested review from chicm-ms, SparkSnail, ultmaster, J-shang and QuanluZhang November 29, 2020 13:05

liuzhe-lz mentioned this pull request Nov 30, 2020

v2.0 Release Plan #2935

Closed

77 tasks

QuanluZhang removed the request for review from chicm-ms December 4, 2020 09:09

liuzhe-lz and others added 7 commits December 7, 2020 08:21

stop on python exit

bb36624

Merge branch 'master' into exp

7f303b0

improve message

2158fb1

add dep

4f646b4

fix ut

eed0a39

fix ut

a5b737e

fix windows

df43dab

ultmaster reviewed Dec 7, 2020

View reviewed changes

liuzhe added 3 commits December 7, 2020 15:13

fix comment

13aa4b5

remove debug message

7453bd7

fix windows color

b100720