Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kubeflow MXJob example #1688

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Sep 29, 2021

I added Kubeflow MXJob with BytePS example.

@kubeflow/wg-training-leads Is it possible to redirect training logs in distributive MXNet to the Scheduler from the Worker ?
I can't find any information about it in the doc: https://mxnet.apache.org/versions/1.8.0/api/faq/distributed_training.

If not, we have to collect logs from the Workers which only works with cleanPodPolicy: None, since Metrics Collector sidecar must be finished.

/assign @kubeflow/wg-training-leads

@andreyvelich
Copy link
Member Author

/assign @kubeflow/wg-training-leads

@tenzen-y
Copy link
Member

@andreyvelich
Copy link
Member Author

@andreyvelich Could you fix the following mistyped section name in this PR?

Elastic Kubernetes Serice

https://github.com/kubeflow/katib/blob/29409198ff7a13d1984e48a0b14d6954296950d6/examples/v1beta1/fpga/README.md#simplifying-fpga-management-in-eks-elastic-kubernetes-serice

Sure, nice catch!

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 7, 2021

(looks like you are still working on it)

/hold

@andreyvelich
Copy link
Member Author

/retest

1 similar comment
@tenzen-y
Copy link
Member

tenzen-y commented Oct 7, 2021

/retest

@andreyvelich
Copy link
Member Author

This PR is ready.
/retest
/hold cancel

@andreyvelich
Copy link
Member Author

/retest

1 similar comment
@andreyvelich
Copy link
Member Author

/retest

@google-oss-robot google-oss-robot merged commit 60baacd into kubeflow:master Oct 8, 2021
@andreyvelich andreyvelich deleted the add-mxnet-kubeflow-example branch October 8, 2021 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants