-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kubeflow MXJob example #1688
Add Kubeflow MXJob example #1688
Conversation
/assign @kubeflow/wg-training-leads |
@andreyvelich
|
Sure, nice catch! |
2c55783
to
b121409
Compare
b121409
to
b149ead
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
(looks like you are still working on it) /hold |
/retest |
1 similar comment
/retest |
This PR is ready. |
/retest |
1 similar comment
/retest |
I added Kubeflow MXJob with BytePS example.
@kubeflow/wg-training-leads Is it possible to redirect training logs in distributive MXNet to the
Scheduler
from theWorker
?I can't find any information about it in the doc: https://mxnet.apache.org/versions/1.8.0/api/faq/distributed_training.
If not, we have to collect logs from the Workers which only works with
cleanPodPolicy: None
, since Metrics Collector sidecar must be finished./assign @kubeflow/wg-training-leads