-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: add support for mpijob in katib #1183
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,6 +64,7 @@ rules: | |
- kubeflow.org | ||
resources: | ||
- tfjobs | ||
- mpijobs | ||
- pytorchjobs | ||
verbs: | ||
- "*" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,7 @@ const ( | |
|
||
TrainingCompleted = "completed" | ||
|
||
DefaultFilter = `([\w|-]+)\s*=\s*((-?\d+)(\.\d+)?)` | ||
DefaultFilter = `([\w|-]+)\s*[:=]\s*((-?\d+)(\.\d+)?)` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you want to change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In most cases ,the print logs are like “filtername : *%“ or “filtername = *%“, so I prefer the default filter to support both “:” and "=", would it be better? @andreyvelich WDYT There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy with this modification. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we change this, I think we should modify documentation about default filter for Metrics Collector. For example here: https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector. Do you think specifying filter with filter format (:) will be not handy for Katib user? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can keep it to enhance the default filter. But we need to open an issue to update the doc. |
||
) | ||
|
||
var ( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -480,6 +480,23 @@ func TestGetKatibJob(t *testing.T) { | |
Err: true, | ||
Name: "Invalid Kind", | ||
}, | ||
{ | ||
Pod: v1.Pod{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
OwnerReferences: []metav1.OwnerReference{ | ||
{ | ||
APIVersion: "batch/v1", | ||
Kind: "Job", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if it's related here but just FYI we removed Job and Statefulset for v1 candidate here so now MPI Operator only creates pods: kubeflow/mpi-operator#203 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (⊙o⊙)… we developed some features based on worker's sts property... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @terrytangyuan Is there any plan when to release a new mpi-operator version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No concrete timeline yet but it should be relatively stable now as we are getting close to graduation. |
||
Name: "OwnerName-launcher", | ||
}, | ||
}, | ||
}, | ||
}, | ||
ExpectedJobKind: "Job", | ||
ExpectedJobName: "OwnerName", | ||
Err: false, | ||
Name: "Valid Pod", | ||
}, | ||
} | ||
|
||
for _, tc := range testCases { | ||
|
@@ -498,6 +515,7 @@ func TestGetKatibJob(t *testing.T) { | |
func TestIsMasterRole(t *testing.T) { | ||
masterRoleLabel := make(map[string]string) | ||
masterRoleLabel[consts.JobRole] = MasterRole | ||
masterRoleLabel[consts.JobRoleMpi] = LanucherRole | ||
invalidLabel := make(map[string]string) | ||
invalidLabel["invalid-label"] = "invalid" | ||
testCases := []struct { | ||
|
@@ -521,6 +539,16 @@ func TestIsMasterRole(t *testing.T) { | |
IsMaster: true, | ||
Name: "Pytorch Master Pod", | ||
}, | ||
{ | ||
Pod: v1.Pod{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Labels: masterRoleLabel, | ||
}, | ||
}, | ||
JobKind: "MPIJob", | ||
IsMaster: true, | ||
Name: "MPI Launcher Pod", | ||
}, | ||
{ | ||
Pod: v1.Pod{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job roles in MPI are named with "_" not with "-" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, refer to
https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/v1alpha2/mpi_job_controller.go#L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan Hi, will we change the label in the future, maybe v1beta1 or v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @gaocegege. I think MPI should follow the same patterns as TF Job and Pytorch Job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I changed it in kubeflow/mpi-operator#252 for v1 candidate of MPI Operator. Perhaps this PR can add support for v1 candidate directly? The API should be relatively stable even though there isn't official release yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @gaocegege .
@YuxiJin-tobeyjin can you change this PR to support v1 MPI Operator version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but I need to make some tests on latest mpi master branch to ensure changes work as expected, I will do it ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! 🎉 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I've tried mpi-operator master branch on our cluster with kubernetes 1.14 and encountered some problems.
First, when creating new mpijob crd ,it throws up
Walk around then, the crd has been created ok, but the mpi-operator is created failed
Maybe problem is here ,still use KubeflowV1alpha2 to get v1 crd?
https://github.com/kubeflow/mpi-operator/blob/acddf3028ce922e24ac3e735a2928ff4487be28f/cmd/mpi-operator.v1/app/server.go#L288
Walk around then, but mpi-operator throws
So , now use mpi-operator v1 we must use volcano as the batch scheduler? Since kube-batch latest release do not support v1beta1.PodGroup. Actually we've already used the kube-batch latest release for some times...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I fixed the CRD validation issue in kubeflow/mpi-operator#257 (comment). Yes Volcano has better support and maintenance. You may want to update your MPI Operator's deployment yaml which includes permissions for Volcano resources: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml