StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

jlewi · 2019-01-07T01:40:27Z

User reports that the StudyJob controller crashes if pytorch operator isn't installed on the cluster.

Logs for StudyJobController

$ kubectl logs -n kubeflow studyjob-controller-7d77f959-pjfzf
2019/01/06 22:43:34 Registering Components.
2019/01/06 22:43:34 controller.AddToManager(mgr)
2019/01/06 22:43:34 no matches for kind "PyTorchJob" in version "kubeflow.org/v1beta1"

This seems like a bug. If a particular job controller isn't installed I would still expect katib and StudyJobs to work with other types of job controllers.

The text was updated successfully, but these errors were encountered:

johnugeorge · 2019-01-07T02:53:11Z

https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/addtoscheme_tfjob_v1beta1.go#L21

Operator deployments are not needed if corresponding workers are not needed in Katib. However, there is a requirement of having operator crds to be installed.

https://github.com/kubeflow/katib/blob/master/scripts/deploy.sh#L36

/cc @richardsliu

jlewi · 2019-01-07T14:59:07Z

@johnugeorge Can you elaborate? Why does linking in the spec for TFJob (https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/addtoscheme_tfjob_v1beta1.go#L21) create a runtime dependency on the CRD being installed in the cluster? I would have thought that just creates a compile time dependency?

I guess my bigger question is; what's the long term plan here for Katib in terms of how its going to fire of K8s resources to do the actual training? Is this covered in either of the design docs for Katib?

johnugeorge · 2019-01-07T16:14:02Z

@jlewi Runtime error should be where we do a watch on the jobs during init

https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L114

can you elaborate more on "to fire off K8s resources to do the actual training" ?

jlewi · 2019-01-14T00:54:09Z

@johnugeorge

Operator deployments are not needed if corresponding workers are not needed in Katib. However, there is a requirement of having operator crds to be installed.

Why is there a requirement to have operator CRDs to be installed if they aren't being used? This seems like a limitation of the current implementation.

The StudyJob CRD takes a template for the K8s resource (PyTorchJob, TFJob, ChainerJob, Job, ....) that will be used to train the model. Why does Katib need to have explict support for any of these resources?

Do you agree that Katib shouldn't crash if the CR for one of these resources isn't defined in the cluster?

Kubernetes provides a REST API that can be used to create any K8s API. This REST API is generic; i.e. its fully possible to create a client that can create/delete any K8s resource described by the K8s YAML manifest. The client doesn't need to have explicit support for a given resource to be linked into it.

gaocegege · 2019-01-14T06:30:06Z

I think the design of informer does not allow us to register the resource when it is used.

The controller implementation here https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/internal/controller/controller.go#L126 will watch all resources which are registered previously. AFAIK, we cannot register CRDs dynamically. It is not constrained by the k8s API, it is limited by the implementation of the controller, IMO.

johnugeorge · 2019-01-14T07:44:12Z

@jlewi I agree with you that Katib shouldn't crash if the user hasn't installed these CRs in the cluster . Ideally, users have to install CRs only when they need to.

However as @gaocegege told, it is difficult to implement dynamic registration of resource watch in the controller.

Eg: During the init, we can skip the resource watch(TFjob/PyTorchJob) if CRDs are not installed. This will avoid the reported crash. However if user installs the crd/operator in the future for adding TFJob support, we need to dynamically add the watch on that resource. I couldn't find a nice way to handle this in the controller. I will investigate more.

Btw if the user installs Katib via Kfctl, crds are installed by default.

jlewi · 2019-01-15T18:41:38Z

The GoClient libraries provide a REST interface
https://github.com/kubernetes/client-go/blob/master/rest/client.go

That allow us to perform the basic rest operations without linking in any CRD specific client side generated client generated libraries.

Here's some sample code from the early days of TFOperator.
https://github.com/kubeflow/tf-operator/blob/e4a436da92e198dcb88c89c33010608e0c8a23bf/pkg/util/k8sutil/tf_job_client.go#L88

So given a YAML file for an arbitrary K8s resource or custom resource. We should be able to write generic logic to Create/Delete that object. We can do this without limiting ourselves to a fixed list of K8s objects apriori.

So I think the only question is how do we extend that so that the studyjob controller can be efficiently notified about events for the resources it is waiting on.

Does the informer library not have a similar unstructured version that would allow us to dynamically instantiate it at runtime for some resource?

Isn't the TFJob operator using an unstructured informer
https://github.com/kubeflow/tf-operator/blob/1fa0779840816b772a1c113c14220a2464d04ac0/pkg/util/unstructured/informer.go?

gyliu513 · 2019-01-21T08:36:53Z

FYI @hougangliu @jinchihe

johnugeorge · 2019-01-21T09:19:40Z

Few solutions that I can think of

Start a watch of TFJob/PyTorch Job using rest api. When watch events are received, call the reconcile method of the current controller run time. [We are not calling the controller-runtime watch api https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L113 . Instead our custom function is responsible for the watch]
Start a watch on CRD using rest api. When watch events are received(ie when the job operator crd is installed), start the controller-runtime watch mechanism on the corresponding job operator.
Install TFJob/PyTorch CRDs during the study job controller init. No other changes are required here.

johnugeorge · 2019-01-21T18:12:15Z

@jlewi @richardsliu WDYT?

hougangliu · 2019-01-21T23:07:22Z

Maybe a simple solution is: when TFJob/PyTorch CRDs does not installed when studyjob controller starts, ignore it but a warning log. And when we creates a studyjob with TFJob/PyTorch job, if it not watched, mark the studyjobb as invalid.

hougangliu · 2019-01-21T23:07:28Z

/assign

johnugeorge · 2019-01-22T01:51:10Z

@hougangliu I am not sure if it is the right solution. This will force the user to reinstall katib if it needs to create study job with TFJob/PyTorch Job in the future(which should be the usual case)

I feel that we need to support dynamic watch on them at runtime so that behavior remains the same

jlewi · 2019-01-22T02:04:24Z

@johnugeorge regarding the options listed in
#317 (comment)

What's the difference between options 1 & 2? Does 1 require the supported job types to be explicitly enumerated whereas 2 doesn't?

Which of these options if any allow us to immediately support new operators e.g. (chainer, mpi, etc...) without having to change Katib code?

johnugeorge · 2019-01-22T02:21:48Z

@johnugeorge regarding the options listed in
#317 (comment)

What's the difference between options 1 & 2? Does 1 require the supported job types to be explicitly enumerated whereas 2 doesn't?

Which of these options if any allow us to immediately support new operators e.g. (chainer, mpi, etc...) without having to change Katib code?

1 refers to watch on TFjob/PyTorchJob resource
eg:watch on /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs and call event handler directly.
2. refers to watch on CRD itself
eg: watch on apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions and check if TFJob/PyTorch CRDs are installed. Then, start watch on the TFJob/PyTorch resource(using controller watch). I haven't verified this option.

Both 1 and 2 will require job types to be explicitly enumerated in the Katib code. However, it should be minor addition when integrating the operator in Katib.

hougangliu · 2019-01-22T03:12:39Z

studyJob controller is implemented by controller-runtime now, and controller-runtime cannot support dynamic watch. replacing controller-runtime maybe take some time.
I think this is a critical issue since once user doesn't install tfjob or pytorch job when install kubeflow, katib fails to work. we can provide a short-term solution for katib 0.4.0 (as mentioned: when TFJob/PyTorch CRDs does not installed when studyjob controller starts, ignore it but a warning log. And when we creates a studyjob with TFJob/PyTorch job, if it not watched, mark the studyjobb as invalid. In fact, when user installs TFJob/PyTorch job later, if he wants to re-watch the CRD, he can just restart studyjob-controller , for example delete studyjob-controller, to take it effect)

For long-term solution, maybe we should replace controller-runtime.

@here any comment?

jlewi · 2019-01-22T12:52:01Z

If we can't support dynamic watch, can we make the list of resources a command line argument of Katib so that if a user wants to use additional resources with Katib we just have to update a command line argument and reapply?

hougangliu · 2019-01-23T02:50:44Z

If we can't support dynamic watch, can we make the list of resources a command line argument of Katib so that if a user wants to use additional resources with Katib we just have to update a command line argument and reapply?

Sound good! However for long-term solution, I think dynamic watch is better for a good UE

jlewi · 2019-01-23T16:38:19Z

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

It looks like the current implementation is just polling for job status

katib/pkg/controller/studyjob/studyjob_controller.go

Line 404 in 8545970

runtimejob := createWorkerJobObj(w.Kind)

Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

/cc @richardsliu

johnugeorge · 2019-01-23T20:05:13Z

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

resources are watched in https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L113

It looks like the current implementation is just polling for job status

katib/pkg/controller/studyjob/studyjob_controller.go
Line 404 in 8545970
runtimejob := createWorkerJobObj(w.Kind)
Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

can you explain more on "for each type of resource the first time it sees a resource of a given type?"

/cc @richardsliu

controller-runtime

hougangliu · 2019-01-24T00:05:35Z

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

It looks like the current implementation is just polling for job status

katib/pkg/controller/studyjob/studyjob_controller.go
Line 404 in 8545970
runtimejob := createWorkerJobObj(w.Kind)
Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

/cc @richardsliu

I will consider using unstructured informer, but I wonder if unstructured informer can reduce load on the APIServer.
And why controller-runtime cannot support dynamic watch is that once controller-runtime started, a lock created so that we cannot Watch the new resource in need.

hougangliu · 2019-01-24T00:27:33Z

Does PR #335 need any more? I have thought it is a solution for 0.4 considering this issue's severity

richardsliu · 2019-01-24T00:29:05Z

I would still like to keep #335 as a temporary fix. Meanwhile we can investigate how to use unstructured informer.

hougangliu · 2019-01-25T00:30:25Z

/reopen
Trace long-term solution

k8s-ci-robot · 2019-01-25T00:30:26Z

@hougangliu: Reopened this issue.

In response to this:

/reopen
Trace long-term solution

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi · 2019-01-25T15:19:53Z

Thanks @johnugeorge for the pointers and @hougangliu for the short term fix.

Regardling long term solution

@johnugeorge @hougangliu @richardsliu It it possible to make Watch work in a dynamic fashion?

It looks like StudyJob controller creates the manager here:

katib/cmd/studyjobcontroller/main.go

Line 45 in f4026e4

mgr, err := manager.New(cfg, manager.Options{})

It looks like the manager config options allows specifying the NewCacheFunc which is used to create the informer.

https://github.com/kubernetes-sigs/controller-runtime/blob/6ada5f3055493a6c2fdafe240a7ae00bbbb7048a/pkg/manager/manager.go#L125

jlewi · 2019-02-19T13:23:34Z

@johnugeorge @richardsliu is the long term fix #341? Can we close this issue?

johnugeorge · 2019-03-04T09:37:18Z

Except watch, everything else is moved to unstructured type in #341. Since dynamic watch is supported only in the newer controller runtime version(
kubernetes-sigs/kubebuilder#422) , unstructured watch can be dynamically created once the controller runtime version is upgraded.

johnugeorge · 2019-03-04T09:40:22Z

Currently controllers don't crash if operators are not installed. However, a restart is needed after CRD is installed.

richardsliu · 2019-03-06T21:05:20Z

@johnugeorge Should we rename this issue? The controllers aren't crashing anymore.

johnugeorge · 2019-03-07T02:55:24Z

I will close this issue then as it is no more valid. Opened #422 to track the operator watch

johnugeorge · 2019-03-07T02:55:34Z

/close

k8s-ci-robot · 2019-03-07T02:55:35Z

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi added priority/p1 area/katib kind/bug labels Jan 7, 2019

jlewi added this to New in 0.5.0 via automation Jan 7, 2019

richardsliu moved this from New to Releasing & Testing in 0.5.0 Jan 7, 2019

richardsliu moved this from Releasing & Testing to Hyperparameter Tuning in 0.5.0 Jan 7, 2019

gyliu513 mentioned this issue Jan 21, 2019

studyjob controller stays in CrashLoopbackoff kubeflow/kubeflow#2308

Closed

k8s-ci-robot assigned hougangliu Jan 21, 2019

hougangliu mentioned this issue Jan 22, 2019

ignore tfjob/pytorch job if corresponding CRD not created #335

Merged

hougangliu mentioned this issue Jan 23, 2019

studyJob cannot recover once Completed or Failed #291

Closed

johnugeorge mentioned this issue Jan 23, 2019

Make Katib generic for operator support #341

Closed

1 task

k8s-ci-robot closed this as completed in #335 Jan 25, 2019

0.5.0 automation moved this from Hyperparameter Tuning to Done Jan 25, 2019

k8s-ci-robot reopened this Jan 25, 2019

0.5.0 automation moved this from Done to New Jan 25, 2019

richardsliu mentioned this issue Jan 30, 2019

Katib 2019 Roadmap #348

Merged

jlewi moved this from New to Hyperparameter Tuning in 0.5.0 Feb 4, 2019

johnugeorge mentioned this issue Feb 17, 2019

Removing Operator specific handling during a StudyJob run #387

Merged

johnugeorge mentioned this issue Mar 7, 2019

Create a dynamic watch on Job operators #422

Closed

k8s-ci-robot closed this as completed Mar 7, 2019

0.5.0 automation moved this from Hyperparameter Tuning to Done Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

jlewi commented Jan 7, 2019

johnugeorge commented Jan 7, 2019 •

edited

Loading

jlewi commented Jan 7, 2019

johnugeorge commented Jan 7, 2019

jlewi commented Jan 14, 2019

gaocegege commented Jan 14, 2019

johnugeorge commented Jan 14, 2019

jlewi commented Jan 15, 2019

gyliu513 commented Jan 21, 2019

johnugeorge commented Jan 21, 2019

johnugeorge commented Jan 21, 2019

hougangliu commented Jan 21, 2019 •

edited

Loading

hougangliu commented Jan 21, 2019

johnugeorge commented Jan 22, 2019

jlewi commented Jan 22, 2019

johnugeorge commented Jan 22, 2019 •

edited

Loading

hougangliu commented Jan 22, 2019

jlewi commented Jan 22, 2019

hougangliu commented Jan 23, 2019

jlewi commented Jan 23, 2019

johnugeorge commented Jan 23, 2019

hougangliu commented Jan 24, 2019

hougangliu commented Jan 24, 2019

richardsliu commented Jan 24, 2019

hougangliu commented Jan 25, 2019

k8s-ci-robot commented Jan 25, 2019

jlewi commented Jan 25, 2019

jlewi commented Feb 19, 2019

johnugeorge commented Mar 4, 2019 •

edited

Loading

johnugeorge commented Mar 4, 2019 •

edited

Loading

richardsliu commented Mar 6, 2019

johnugeorge commented Mar 7, 2019

johnugeorge commented Mar 7, 2019

k8s-ci-robot commented Mar 7, 2019

StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

Comments

jlewi commented Jan 7, 2019

johnugeorge commented Jan 7, 2019 • edited Loading

jlewi commented Jan 7, 2019

johnugeorge commented Jan 7, 2019

jlewi commented Jan 14, 2019

gaocegege commented Jan 14, 2019

johnugeorge commented Jan 14, 2019

jlewi commented Jan 15, 2019

gyliu513 commented Jan 21, 2019

johnugeorge commented Jan 21, 2019

johnugeorge commented Jan 21, 2019

hougangliu commented Jan 21, 2019 • edited Loading

hougangliu commented Jan 21, 2019

johnugeorge commented Jan 22, 2019

jlewi commented Jan 22, 2019

johnugeorge commented Jan 22, 2019 • edited Loading

hougangliu commented Jan 22, 2019

jlewi commented Jan 22, 2019

hougangliu commented Jan 23, 2019

jlewi commented Jan 23, 2019

johnugeorge commented Jan 23, 2019

hougangliu commented Jan 24, 2019

hougangliu commented Jan 24, 2019

richardsliu commented Jan 24, 2019

hougangliu commented Jan 25, 2019

k8s-ci-robot commented Jan 25, 2019

jlewi commented Jan 25, 2019

jlewi commented Feb 19, 2019

johnugeorge commented Mar 4, 2019 • edited Loading

johnugeorge commented Mar 4, 2019 • edited Loading

richardsliu commented Mar 6, 2019

johnugeorge commented Mar 7, 2019

johnugeorge commented Mar 7, 2019

k8s-ci-robot commented Mar 7, 2019

johnugeorge commented Jan 7, 2019 •

edited

Loading

hougangliu commented Jan 21, 2019 •

edited

Loading

johnugeorge commented Jan 22, 2019 •

edited

Loading

johnugeorge commented Mar 4, 2019 •

edited

Loading

johnugeorge commented Mar 4, 2019 •

edited

Loading