Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Align nni.experiment tuner behavior with nnictl #3419

Merged
merged 5 commits into from
Mar 8, 2021

Conversation

liuzhe-lz
Copy link
Contributor

This PR also refactors NNI manager to support quit from REST request.

'tuner_gpu_indices': lambda value: [int(idx) for idx in value.split(',')] if isinstance(value, str) else value,
'tuner': lambda config: None if config.name == '_none_' else config,
'assessor': lambda config: None if config.name == '_none_' else config,
'advisor': lambda config: None if config.name == '_none_' else config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where 'none' is assigned to name and in which case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Experiment.__init__(). The purpose is to let users omit calling AlgorithmConfig() explicitly.
AlgorithmConfig with name "none" is designed to be equivalent to None, just like Path('/home') is equivalent to str('/home/'). This means everywhere tuner can be None, it can also be AlgorithmConfig(name='_none_').

@@ -97,7 +86,8 @@ def _start_rest_server(config: ExperimentConfig, port: int, debug: bool, experim
from subprocess import CREATE_NEW_PROCESS_GROUP
proc = Popen(cmd, cwd=node_dir, creationflags=CREATE_NEW_PROCESS_GROUP)
else:
proc = Popen(cmd, cwd=node_dir)
import os
proc = Popen(cmd, cwd=node_dir, preexec_fn=os.setpgrp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why preexec_fn is not set in the previous version?

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not needed.
This is about terminating subprocess. Previously tuner is not a subprocess.

const ds: DataStore = component.get(DataStore);
await ds.close();
const restServer: NNIRestServer = component.get(NNIRestServer);
await restServer.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a little strange that these components are created in this file, but stopped in nnimanager.ts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But I don't want to make huge refactor at this point.

@@ -14,7 +14,7 @@


def to_v1_yaml(config: ExperimentConfig, skip_nnictl: bool = False) -> Dict[str, Any]:
config.validate(skip_nnictl)
config.validate(False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_nnictl is not used in this function. what is the meaning of skip_nnictl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is likely to be removed soon. I'm too lazy to make it elegant.

router.delete('/experiment', (req: Request, res: Response) => {
this.nniManager.stopExperimentTopHalf().then(() => {
res.send();
this.nniManager.stopExperimentBottomHalf();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can wait for the whole experiment to stop, then res.send()? To avoid something like the dispatcher has an independent process in the future. Or we need to put this kind of independent process cleanup() in stopExperimentTopHalf()? Anyway, it is fine in this version.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because killing trials is too time consuming. I'm afraid users (like Quanlu) will kill the process with second ctrl-c if we let them wait that long.
In theory if a clean up routine should block the requester, it goes into top half; if the routine takes too much time which harms the user experience, it goes into bottom half.

await this.stopExperimentBottomHalf();
}

public async stopExperimentTopHalf(): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API sounds very internal. Can we think of a formal name?

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name comes from Linux interrupt handler. Top-half means it should be completed before handling control to caller, while bottom-half means it can be handled later in background.

@J-shang J-shang closed this Mar 5, 2021
@J-shang J-shang reopened this Mar 5, 2021
@QuanluZhang QuanluZhang self-requested a review March 6, 2021 13:40
@J-shang J-shang merged commit bc55eec into microsoft:master Mar 8, 2021
@liuzhe-lz liuzhe-lz deleted the exp-tuner branch March 17, 2021 22:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants