Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Katib generic for operator support #341

Closed
1 task done
johnugeorge opened this issue Jan 23, 2019 · 12 comments
Closed
1 task done

Make Katib generic for operator support #341

johnugeorge opened this issue Jan 23, 2019 · 12 comments

Comments

@johnugeorge
Copy link
Member

johnugeorge commented Jan 23, 2019

Currently in order to support custom K8s to Katib, explicit support has to be added. There are explicit checks for each resource in the current code. Remove such checks so that it is generic to support new operators.

eg: In getJobWorkerStatus, https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L421

Related PRs:

@richardsliu
Copy link
Contributor

StudyJobController currently implements different logic for each worker type to check if the worker has completed: https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L413

BatchJob also supports the "last condition" format, so in theory we should be able to make this code generic: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/batch/v1/generated.proto#L158

@richardsliu richardsliu added this to New in 0.5.0 via automation Feb 4, 2019
@richardsliu richardsliu moved this from New to Hyperparameter Tuning in 0.5.0 Feb 4, 2019
@johnugeorge
Copy link
Member Author

/assign

@johnugeorge
Copy link
Member Author

/area 0.5.0

@richardsliu
Copy link
Contributor

I looked a bit into this, seems like we are dependent on other operator binaries in a few ways:

  1. Watch on resources. Currently we are doing:
	err = c.Watch(
		&source.Kind{Type: &tfjobv1beta1.TFJob{}},
		&handler.EnqueueRequestForOwner{
			IsController: true,
			OwnerType:    &katibv1alpha1.StudyJob{},
		})
	if isFatalWatchError(err, TFJobWorker) {
		return err
	}

Instead, can we do something like:

	u := &unstructured.Unstructured{}
	u.SetGroupVersionKind(schema.GroupVersionKind{
		Kind:    "TFJob",
		Group:   "",
		Version: "v1beta1",
	})
	err = c.Watch(
		&source.Kind{Type: u},
		&handler.EnqueueRequestForOwner{
			IsController: true,
			OwnerType:    &katibv1alpha1.StudyJob{},
		})
	if err != nil {
		return err
	}
  1. Get/Delete resources at runtime, e.g. https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L313

Seems like these calls just need a pointer to a runtime.Object, and do not need any specific linking to operators.

  1. GetStatus, e.g. https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L417

We just need to define an interface with Status.Conditions, and it should just work across operators. The contract is that each resource needs to support Status.Conditions.

  1. Decode, e.g. https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L612

Do we really need this? Can we just let each operator handle this part?

@YujiOshima @johnugeorge @gaocegege @hougangliu What do you think?

@johnugeorge
Copy link
Member Author

I had a look at this earlier.
From my quick code walk,
1,2,3 can solved using unstructured. It needs higher controller-runtime version. It is now possible as we had upgraded k8s version of operators.

Can you explain your solution for 4? Unstructured might work for 4. Not sure about it.

I will raise a PR this week wrt to version upgrade in katib which is a pre requisite and then take this up.

@YujiOshima
Copy link
Contributor

For 1, Do we need to watch each job type with unstructured?

u.SetGroupVersionKind(schema.GroupVersionKind{
		Kind:    "TFJob",
		Group:   "",
		Version: "v1beta1",
	})
	err = c.Watch(
...

then

u.SetGroupVersionKind(schema.GroupVersionKind{
		Kind:    "PytorchJob",
		Group:   "",
		Version: "v1beta1",
	})
	err = c.Watch(

and for Job, CronJob?

For 4, I agree with @johnugeorge .

@richardsliu
Copy link
Contributor

For 4, we are trying to decode the yaml into a specific job instance, and raise an error if the decoding fails. Another place where we use this is https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/utils.go#L83.

What if katib skips this altogether and just use unstructured to create the job instance? Then just let the specific operator handle marshalling errors. Does that work?

@YujiOshima
Copy link
Contributor

@richardsliu Thanks. Though I haven't checked, we may be able to use UnstructuredJSONScheme of k8s.io/apimachinery/pkg/apis/meta/v1/unstructured for decoding unstructed obj.

@johnugeorge
Copy link
Member Author

Agree.

@johnugeorge
Copy link
Member Author

All operator specific code is moved to unstructured. Watch will be made dynamic in next controller runtime version which is tracked in #317.

Closing this as #387 is merged.

@johnugeorge
Copy link
Member Author

/close

@k8s-ci-robot
Copy link

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

0.5.0 automation moved this from Hyperparameter Tuning to Done Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
0.5.0
  
Done
Development

No branches or pull requests

5 participants