-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Align nni.experiment
tuner behavior with nnictl
#3419
Conversation
'tuner_gpu_indices': lambda value: [int(idx) for idx in value.split(',')] if isinstance(value, str) else value, | ||
'tuner': lambda config: None if config.name == '_none_' else config, | ||
'assessor': lambda config: None if config.name == '_none_' else config, | ||
'advisor': lambda config: None if config.name == '_none_' else config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where 'none' is assigned to name
and in which case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Experiment.__init__()
. The purpose is to let users omit calling AlgorithmConfig()
explicitly.
AlgorithmConfig
with name "none" is designed to be equivalent to None
, just like Path('/home')
is equivalent to str('/home/')
. This means everywhere tuner
can be None
, it can also be AlgorithmConfig(name='_none_')
.
@@ -97,7 +86,8 @@ def _start_rest_server(config: ExperimentConfig, port: int, debug: bool, experim | |||
from subprocess import CREATE_NEW_PROCESS_GROUP | |||
proc = Popen(cmd, cwd=node_dir, creationflags=CREATE_NEW_PROCESS_GROUP) | |||
else: | |||
proc = Popen(cmd, cwd=node_dir) | |||
import os | |||
proc = Popen(cmd, cwd=node_dir, preexec_fn=os.setpgrp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why preexec_fn
is not set in the previous version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's not needed.
This is about terminating subprocess. Previously tuner is not a subprocess.
const ds: DataStore = component.get(DataStore); | ||
await ds.close(); | ||
const restServer: NNIRestServer = component.get(NNIRestServer); | ||
await restServer.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a little strange that these components are created in this file, but stopped in nnimanager.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. But I don't want to make huge refactor at this point.
@@ -14,7 +14,7 @@ | |||
|
|||
|
|||
def to_v1_yaml(config: ExperimentConfig, skip_nnictl: bool = False) -> Dict[str, Any]: | |||
config.validate(skip_nnictl) | |||
config.validate(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip_nnictl
is not used in this function. what is the meaning of skip_nnictl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is likely to be removed soon. I'm too lazy to make it elegant.
router.delete('/experiment', (req: Request, res: Response) => { | ||
this.nniManager.stopExperimentTopHalf().then(() => { | ||
res.send(); | ||
this.nniManager.stopExperimentBottomHalf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can wait for the whole experiment to stop, then res.send()
? To avoid something like the dispatcher has an independent process in the future. Or we need to put this kind of independent process cleanup()
in stopExperimentTopHalf()
? Anyway, it is fine in this version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because killing trials is too time consuming. I'm afraid users (like Quanlu) will kill the process with second ctrl-c if we let them wait that long.
In theory if a clean up routine should block the requester, it goes into top half; if the routine takes too much time which harms the user experience, it goes into bottom half.
await this.stopExperimentBottomHalf(); | ||
} | ||
|
||
public async stopExperimentTopHalf(): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API sounds very internal. Can we think of a formal name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name comes from Linux interrupt handler. Top-half means it should be completed before handling control to caller, while bottom-half means it can be handled later in background.
This PR also refactors NNI manager to support quit from REST request.