-
Notifications
You must be signed in to change notification settings - Fork 1.8k
support display trial log on local mode #2718
Conversation
Approve for webui changes. |
@@ -325,6 +325,10 @@ class NNIManager implements Manager { | |||
// FIXME: unit test | |||
} | |||
|
|||
public async getTrialLog(trialJobId: string, logType: LogType): Promise<string> { | |||
return this.trainingService.getTrialLog(trialJobId, logType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be better to catch MethodNotImplementedError
here, then return log message like XXX training service does not support retrieving log
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain how did you fix this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit, I revert this to the original version (i.e., not handle the error here) since the front end will check the training service now. The error here could also be an file operation error, so I let the restHander to catch the error throwed here.
} else { | ||
throw new Error('unexpected log type'); | ||
} | ||
return fs.promises.readFile(logPath, 'utf8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this readFile() function need to catch exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception should be catched by getTrialLog
in restHandler.ts
@@ -8,6 +8,8 @@ | |||
*/ | |||
type TrialJobStatus = 'UNKNOWN' | 'WAITING' | 'RUNNING' | 'SUCCEEDED' | 'FAILED' | 'USER_CANCELED' | 'SYS_CANCELED' | 'EARLY_STOPPED'; | |||
|
|||
type LogType = 'TRIAL_LOG' | 'TRIAL_STDERR'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does TRIAL_LOG means stdout log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, corresponding to trial.log
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed and decided that it should be ERROR_MESSAGE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to TRIAL_LOG
and TRIAL_ERROR
localTrainingService.run() | ||
await delay(5000); | ||
//chai.expect(jobDetail.status).to.be.equals('RUNNING'); | ||
chai.expect(await localTrainingService.getTrialLog(jobDetail.id, 'TRIAL_LOG')).to.be.a('string'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not compare with 'This is trial log'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part need refactor, since the submitted jobs are not actually "running". The run
function doesn't take effect. After sync with liuzhe, it may be better to add this test in IT. Maybe I should arrange an meeting to discuss issues about this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in line 91, you already run the training service, why it does not take effect? @liuzhe-lz @chicm-ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add a generate_logfile
function in mockedTrial.py to generate mock logfiles. However, these files seems not generated after run()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
fs.rmdirSync(path.join(rootDir, 'trials')) | ||
|
||
await localTrainingService.cancelTrialJob(jobDetail.id); | ||
}).timeout(20000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is testing training service, do we need to also test nnimanager or restserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UT for getTrialLog
looks OK for me, we only want to test getTrialLog here, we do not care about other parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -72,6 +74,38 @@ describe('Unit Test for LocalTrainingService', () => { | |||
chai.expect(jobDetail.status).to.be.equals('USER_CANCELED'); | |||
}).timeout(20000); | |||
|
|||
it('Get trial log', async () => { | |||
// set meta data | |||
const trialConfig: string = `{\"command\":\"python3 mockedTrial.py\", \"codeDir\":\"${localCodeDir}\",\"gpuNum\":0}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether this mockedTrial.py
will cause some conflicts regarding log content. It may generate log which make the log file content is not as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the command to sleep 5
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. now use the original trialConifig in line 23
(cherry picked from commit 10c177c)
No description provided.