Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add support for mpijob in katib #1183

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions manifests/v1alpha3/katib-controller/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ rules:
- kubeflow.org
resources:
- tfjobs
- mpijobs
- pytorchjobs
verbs:
- "*"
Expand Down
3 changes: 3 additions & 0 deletions pkg/controller.v1alpha3/consts/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,14 @@ const (
JobKindTF = "TFJob"
// JobKindPyTorch is the kind of PyTorchJob.
JobKindPyTorch = "PyTorchJob"
// JobKindMpi is the kind of MpiJob.
JobKindMpi = "MPIJob"

// built-in JobRoles
JobRole = "job-role"
JobRoleTF = "tf-job-role"
JobRolePyTorch = "pytorch-job-role"
JobRoleMpi = "mpi_role_type"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job roles in MPI are named with "_" not with "-" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Hi, will we change the label in the future, maybe v1beta1 or v1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gaocegege. I think MPI should follow the same patterns as TF Job and Pytorch Job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I changed it in kubeflow/mpi-operator#252 for v1 candidate of MPI Operator. Perhaps this PR can add support for v1 candidate directly? The API should be relatively stable even though there isn't official release yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gaocegege .
@YuxiJin-tobeyjin can you change this PR to support v1 MPI Operator version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I need to make some tests on latest mpi master branch to ensure changes work as expected, I will do it ASAP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🎉 👍

Copy link
Author

@YuxiJin-tobeyjin YuxiJin-tobeyjin May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I've tried mpi-operator master branch on our cluster with kubernetes 1.14 and encountered some problems.

First, when creating new mpijob crd ,it throws up

unknown field "additionalPrinterColumns" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceValidation

Walk around then, the crd has been created ok, but the mpi-operator is created failed

the server could not find the requested resource (get mpijobs.kubeflow.org)

Maybe problem is here ,still use KubeflowV1alpha2 to get v1 crd?

https://github.com/kubeflow/mpi-operator/blob/acddf3028ce922e24ac3e735a2928ff4487be28f/cmd/mpi-operator.v1/app/server.go#L288

Walk around then, but mpi-operator throws

Failed to list *v1beta1.PodGroup: podgroups.scheduling.volcano.sh

So , now use mpi-operator v1 we must use volcano as the batch scheduler? Since kube-batch latest release do not support v1beta1.PodGroup. Actually we've already used the kube-batch latest release for some times...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I fixed the CRD validation issue in kubeflow/mpi-operator#257 (comment). Yes Volcano has better support and maintenance. You may want to update your MPI Operator's deployment yaml which includes permissions for Volcano resources: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml


// AnnotationIstioSidecarInjectName is the annotation of Istio Sidecar
AnnotationIstioSidecarInjectName = "sidecar.istio.io/inject"
Expand Down
7 changes: 7 additions & 0 deletions pkg/job/v1alpha3/kubeflow.go
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,11 @@ func init() {
Kind: consts.JobKindPyTorch,
}
JobRoleMap[consts.JobKindPyTorch] = []string{consts.JobRole, consts.JobRolePyTorch}
ProviderRegistry[consts.JobKindMpi] = &Kubeflow{}
SupportedJobList[consts.JobKindMpi] = schema.GroupVersionKind{
Group: "kubeflow.org",
Version: "v1alpha2",
Kind: consts.JobKindMpi,
}
JobRoleMap[consts.JobKindMpi] = []string{consts.JobRole, consts.JobRoleMpi}
}
2 changes: 1 addition & 1 deletion pkg/metricscollector/v1alpha3/common/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ const (

TrainingCompleted = "completed"

DefaultFilter = `([\w|-]+)\s*=\s*((-?\d+)(\.\d+)?)`
DefaultFilter = `([\w|-]+)\s*[:=]\s*((-?\d+)(\.\d+)?)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change DefaultFilter ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases ,the print logs are like “filtername : *%“ or “filtername = *%“, so I prefer the default filter to support both “:” and "=", would it be better? @andreyvelich WDYT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this modification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this, I think we should modify documentation about default filter for Metrics Collector. For example here: https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector.

Do you think specifying filter with filter format (:) will be not handy for Katib user?
E.g, like we did for DARTS example: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/nas/darts-example-gpu.yaml#L19.
Your thoughts @gaocegege @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it to enhance the default filter. But we need to open an issue to update the doc.

)

var (
Expand Down
5 changes: 3 additions & 2 deletions pkg/webhook/v1alpha3/pod/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@ import (
)

const (
MasterRole = "master"
BatchJob = "Job"
MasterRole = "master"
LanucherRole = "launcher"
BatchJob = "Job"
)

var (
Expand Down
28 changes: 28 additions & 0 deletions pkg/webhook/v1alpha3/pod/inject_webhook_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,23 @@ func TestGetKatibJob(t *testing.T) {
Err: true,
Name: "Invalid Kind",
},
{
Pod: v1.Pod{
ObjectMeta: metav1.ObjectMeta{
OwnerReferences: []metav1.OwnerReference{
{
APIVersion: "batch/v1",
Kind: "Job",
Copy link
Member

@terrytangyuan terrytangyuan May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's related here but just FYI we removed Job and Statefulset for v1 candidate here so now MPI Operator only creates pods: kubeflow/mpi-operator#203

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(⊙o⊙)… we developed some features based on worker's sts property...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Is there any plan when to release a new mpi-operator version?

Copy link
Member

@terrytangyuan terrytangyuan May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concrete timeline yet but it should be relatively stable now as we are getting close to graduation.

Name: "OwnerName-launcher",
},
},
},
},
ExpectedJobKind: "Job",
ExpectedJobName: "OwnerName",
Err: false,
Name: "Valid Pod",
},
}

for _, tc := range testCases {
Expand All @@ -498,6 +515,7 @@ func TestGetKatibJob(t *testing.T) {
func TestIsMasterRole(t *testing.T) {
masterRoleLabel := make(map[string]string)
masterRoleLabel[consts.JobRole] = MasterRole
masterRoleLabel[consts.JobRoleMpi] = LanucherRole
invalidLabel := make(map[string]string)
invalidLabel["invalid-label"] = "invalid"
testCases := []struct {
Expand All @@ -521,6 +539,16 @@ func TestIsMasterRole(t *testing.T) {
IsMaster: true,
Name: "Pytorch Master Pod",
},
{
Pod: v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Labels: masterRoleLabel,
},
},
JobKind: "MPIJob",
IsMaster: true,
Name: "MPI Launcher Pod",
},
{
Pod: v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Expand Down
12 changes: 11 additions & 1 deletion pkg/webhook/v1alpha3/pod/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package pod
import (
"errors"
"fmt"
"strings"

"github.com/google/go-containerregistry/pkg/authn"
"github.com/google/go-containerregistry/pkg/authn/k8schain"
Expand All @@ -37,6 +38,15 @@ func getKatibJob(pod *v1.Pod) (string, string, error) {
owners := pod.GetOwnerReferences()
for _, owner := range owners {
if isMatchGVK(owner, gvk) {
if strings.Contains(owner.Name, LanucherRole) {
// in fact, launcher pod is owned by job not mpijob directly,
// whose name is like "mpi-example-wf2hx8lr-launcher",
// consists of "mpiJobName" and "-launcher",
// thus its related trialName should git rid of "-launcher".
tn := strings.Split(owner.Name, "-")
trialName := strings.Join(tn[:len(tn)-1], "-")
return owner.Kind, trialName, nil
}
return owner.Kind, owner.Name, nil
}
}
Expand All @@ -62,7 +72,7 @@ func isMasterRole(pod *v1.Pod, jobKind string) bool {
}
for _, label := range labels {
if v, err := getLabel(pod, label); err == nil {
if v == MasterRole {
if v == MasterRole || v == LanucherRole {
return true
}
}
Expand Down