-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallel studyjobs #404
Add support for parallel studyjobs #404
Conversation
/assign @hougangliu |
/assign @YujiOshima |
pkg/suggestion/nasrl_service.py
Outdated
if request.study_id not in self.registered_studies: | ||
self.setup_controller(request) | ||
self.is_first_run = True | ||
self.registered_studies.append(request.study_id) | ||
self.is_first_run[studyID] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all self.dict will always increase, and nasrl-suggestion service will consume all memory (I think before that OOM killer will kill the pod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hougangliu Is it possible to get this value: https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/studyjob/v1alpha1/studyjob_types.go#L34
inside Suggestion, right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RequestCount is the only information to determine whether a job has ended or not. If we cannot get that, we could not delete the StudyJob information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, a studyjob may stop calling GetSuggestion from suggestion service when: 1. changed to failed
; 2. changed to completed
(exceed RequestCount or exceed goal).
I think we had to add a new api like StopSuggestion
in above two conditions to make suggestionService to clear something. @YujiOshima @richardsliu @johnugeorge WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hougangliu @YujiOshima Should we ignore this parameter at this time?
It will work, but we can't clean dictionary with studyIDs and information about it.
After we add RequestCount parameter to the API, we can modify this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please trace cleaning unused controller by an issue. thanks!
/assign @YujiOshima @richardsliu @johnugeorge |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hougangliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/test kubeflow-katib-presubmit |
/retest |
Now the suggestion of neural architecture search with reinforcement learning supports multiple studyjobs running in parallel.
Fixes #382
This change is