Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add support for mpijob in katib #1183

Conversation

YuxiJin-tobeyjin
Copy link

What this PR does / why we need it:
Add support for mpijob in katib

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1181

Release note:

None

cc @gaocegege @johnugeorge @andreyvelich

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@kubeflow-bot
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @YuxiJin-tobeyjin. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign johnugeorge
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

@YuxiJin-tobeyjin: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@YuxiJin-tobeyjin
Copy link
Author

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YuxiJin-tobeyjin Thank you for doing this!
Can you add e2e test for MPI Job, please?


// built-in JobRoles
JobRole = "job-role"
JobRoleTF = "tf-job-role"
JobRolePyTorch = "pytorch-job-role"
JobRoleMpi = "mpi_role_type"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job roles in MPI are named with "_" not with "-" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Hi, will we change the label in the future, maybe v1beta1 or v1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gaocegege. I think MPI should follow the same patterns as TF Job and Pytorch Job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I changed it in kubeflow/mpi-operator#252 for v1 candidate of MPI Operator. Perhaps this PR can add support for v1 candidate directly? The API should be relatively stable even though there isn't official release yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gaocegege .
@YuxiJin-tobeyjin can you change this PR to support v1 MPI Operator version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I need to make some tests on latest mpi master branch to ensure changes work as expected, I will do it ASAP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🎉 👍

Copy link
Author

@YuxiJin-tobeyjin YuxiJin-tobeyjin May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I've tried mpi-operator master branch on our cluster with kubernetes 1.14 and encountered some problems.

First, when creating new mpijob crd ,it throws up

unknown field "additionalPrinterColumns" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceValidation

Walk around then, the crd has been created ok, but the mpi-operator is created failed

the server could not find the requested resource (get mpijobs.kubeflow.org)

Maybe problem is here ,still use KubeflowV1alpha2 to get v1 crd?

https://github.com/kubeflow/mpi-operator/blob/acddf3028ce922e24ac3e735a2928ff4487be28f/cmd/mpi-operator.v1/app/server.go#L288

Walk around then, but mpi-operator throws

Failed to list *v1beta1.PodGroup: podgroups.scheduling.volcano.sh

So , now use mpi-operator v1 we must use volcano as the batch scheduler? Since kube-batch latest release do not support v1beta1.PodGroup. Actually we've already used the kube-batch latest release for some times...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I fixed the CRD validation issue in kubeflow/mpi-operator#257 (comment). Yes Volcano has better support and maintenance. You may want to update your MPI Operator's deployment yaml which includes permissions for Volcano resources: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml

@@ -30,7 +30,7 @@ const (

TrainingCompleted = "completed"

DefaultFilter = `([\w|-]+)\s*=\s*((-?\d+)(\.\d+)?)`
DefaultFilter = `([\w|-]+)\s*[:=]\s*((-?\d+)(\.\d+)?)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change DefaultFilter ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases ,the print logs are like “filtername : *%“ or “filtername = *%“, so I prefer the default filter to support both “:” and "=", would it be better? @andreyvelich WDYT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this modification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this, I think we should modify documentation about default filter for Metrics Collector. For example here: https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector.

Do you think specifying filter with filter format (:) will be not handy for Katib user?
E.g, like we did for DARTS example: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/nas/darts-example-gpu.yaml#L19.
Your thoughts @gaocegege @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it to enhance the default filter. But we need to open an issue to update the doc.

OwnerReferences: []metav1.OwnerReference{
{
APIVersion: "batch/v1",
Kind: "Job",
Copy link
Member

@terrytangyuan terrytangyuan May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's related here but just FYI we removed Job and Statefulset for v1 candidate here so now MPI Operator only creates pods: kubeflow/mpi-operator#203

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(⊙o⊙)… we developed some features based on worker's sts property...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Is there any plan when to release a new mpi-operator version?

Copy link
Member

@terrytangyuan terrytangyuan May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concrete timeline yet but it should be relatively stable now as we are getting close to graduation.

@carmark
Copy link
Member

carmark commented Sep 23, 2020

@andreyvelich The proposal was merged, could we move this PR forward or submit a new one?

@YuxiJin-tobeyjin
Copy link
Author

Close cause #1273 will give a better solution.

@andreyvelich
Copy link
Member

andreyvelich commented Oct 3, 2020

@carmark Sure, I have already created PR to support MPI operator in Katib: #1342.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Katib doesn't support mpijob
10 participants