Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317

Closed
jlewi opened this issue Jan 7, 2019 · 33 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 7, 2019

See: kubeflow/kubeflow#2212

User reports that the StudyJob controller crashes if pytorch operator isn't installed on the cluster.

Logs for StudyJobController

$ kubectl logs -n kubeflow studyjob-controller-7d77f959-pjfzf
2019/01/06 22:43:34 Registering Components.
2019/01/06 22:43:34 controller.AddToManager(mgr)
2019/01/06 22:43:34 no matches for kind "PyTorchJob" in version "kubeflow.org/v1beta1"

This seems like a bug. If a particular job controller isn't installed I would still expect katib and StudyJobs to work with other types of job controllers.

@jlewi jlewi added this to New in 0.5.0 via automation Jan 7, 2019
@johnugeorge
Copy link
Member

johnugeorge commented Jan 7, 2019

https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/addtoscheme_tfjob_v1beta1.go#L21

Operator deployments are not needed if corresponding workers are not needed in Katib. However, there is a requirement of having operator crds to be installed.

https://github.com/kubeflow/katib/blob/master/scripts/deploy.sh#L36

/cc @richardsliu

@jlewi
Copy link
Contributor Author

jlewi commented Jan 7, 2019

@johnugeorge Can you elaborate? Why does linking in the spec for TFJob (https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/addtoscheme_tfjob_v1beta1.go#L21) create a runtime dependency on the CRD being installed in the cluster? I would have thought that just creates a compile time dependency?

I guess my bigger question is; what's the long term plan here for Katib in terms of how its going to fire of K8s resources to do the actual training? Is this covered in either of the design docs for Katib?

@johnugeorge
Copy link
Member

@jlewi Runtime error should be where we do a watch on the jobs during init

https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L114

can you elaborate more on "to fire off K8s resources to do the actual training" ?

@richardsliu richardsliu moved this from New to Releasing & Testing in 0.5.0 Jan 7, 2019
@richardsliu richardsliu moved this from Releasing & Testing to Hyperparameter Tuning in 0.5.0 Jan 7, 2019
@jlewi
Copy link
Contributor Author

jlewi commented Jan 14, 2019

@johnugeorge

Operator deployments are not needed if corresponding workers are not needed in Katib. However, there is a requirement of having operator crds to be installed.

Why is there a requirement to have operator CRDs to be installed if they aren't being used? This seems like a limitation of the current implementation.

The StudyJob CRD takes a template for the K8s resource (PyTorchJob, TFJob, ChainerJob, Job, ....) that will be used to train the model. Why does Katib need to have explict support for any of these resources?

Do you agree that Katib shouldn't crash if the CR for one of these resources isn't defined in the cluster?

Kubernetes provides a REST API that can be used to create any K8s API. This REST API is generic; i.e. its fully possible to create a client that can create/delete any K8s resource described by the K8s YAML manifest. The client doesn't need to have explicit support for a given resource to be linked into it.

@gaocegege
Copy link
Member

I think the design of informer does not allow us to register the resource when it is used.

The controller implementation here https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/internal/controller/controller.go#L126 will watch all resources which are registered previously. AFAIK, we cannot register CRDs dynamically. It is not constrained by the k8s API, it is limited by the implementation of the controller, IMO.

@johnugeorge
Copy link
Member

@jlewi I agree with you that Katib shouldn't crash if the user hasn't installed these CRs in the cluster . Ideally, users have to install CRs only when they need to.

However as @gaocegege told, it is difficult to implement dynamic registration of resource watch in the controller.

Eg: During the init, we can skip the resource watch(TFjob/PyTorchJob) if CRDs are not installed. This will avoid the reported crash. However if user installs the crd/operator in the future for adding TFJob support, we need to dynamically add the watch on that resource. I couldn't find a nice way to handle this in the controller. I will investigate more.

Btw if the user installs Katib via Kfctl, crds are installed by default.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 15, 2019

The GoClient libraries provide a REST interface
https://github.com/kubernetes/client-go/blob/master/rest/client.go

That allow us to perform the basic rest operations without linking in any CRD specific client side generated client generated libraries.

Here's some sample code from the early days of TFOperator.
https://github.com/kubeflow/tf-operator/blob/e4a436da92e198dcb88c89c33010608e0c8a23bf/pkg/util/k8sutil/tf_job_client.go#L88

So given a YAML file for an arbitrary K8s resource or custom resource. We should be able to write generic logic to Create/Delete that object. We can do this without limiting ourselves to a fixed list of K8s objects apriori.

So I think the only question is how do we extend that so that the studyjob controller can be efficiently notified about events for the resources it is waiting on.

Does the informer library not have a similar unstructured version that would allow us to dynamically instantiate it at runtime for some resource?

Isn't the TFJob operator using an unstructured informer
https://github.com/kubeflow/tf-operator/blob/1fa0779840816b772a1c113c14220a2464d04ac0/pkg/util/unstructured/informer.go?

@gyliu513
Copy link
Member

FYI @hougangliu @jinchihe

@johnugeorge
Copy link
Member

Few solutions that I can think of

  1. Start a watch of TFJob/PyTorch Job using rest api. When watch events are received, call the reconcile method of the current controller run time. [We are not calling the controller-runtime watch api https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L113 . Instead our custom function is responsible for the watch]
  2. Start a watch on CRD using rest api. When watch events are received(ie when the job operator crd is installed), start the controller-runtime watch mechanism on the corresponding job operator.
  3. Install TFJob/PyTorch CRDs during the study job controller init. No other changes are required here.

@johnugeorge
Copy link
Member

@jlewi @richardsliu WDYT?

@hougangliu
Copy link
Member

hougangliu commented Jan 21, 2019

Maybe a simple solution is: when TFJob/PyTorch CRDs does not installed when studyjob controller starts, ignore it but a warning log. And when we creates a studyjob with TFJob/PyTorch job, if it not watched, mark the studyjobb as invalid.

@hougangliu
Copy link
Member

/assign

@johnugeorge
Copy link
Member

@hougangliu I am not sure if it is the right solution. This will force the user to reinstall katib if it needs to create study job with TFJob/PyTorch Job in the future(which should be the usual case)

I feel that we need to support dynamic watch on them at runtime so that behavior remains the same

@jlewi
Copy link
Contributor Author

jlewi commented Jan 22, 2019

@johnugeorge regarding the options listed in
#317 (comment)

What's the difference between options 1 & 2? Does 1 require the supported job types to be explicitly enumerated whereas 2 doesn't?

Which of these options if any allow us to immediately support new operators e.g. (chainer, mpi, etc...) without having to change Katib code?

@johnugeorge
Copy link
Member

johnugeorge commented Jan 22, 2019

@johnugeorge regarding the options listed in
#317 (comment)

What's the difference between options 1 & 2? Does 1 require the supported job types to be explicitly enumerated whereas 2 doesn't?

Which of these options if any allow us to immediately support new operators e.g. (chainer, mpi, etc...) without having to change Katib code?

1 refers to watch on TFjob/PyTorchJob resource
eg:watch on /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs and call event handler directly.
2. refers to watch on CRD itself
eg: watch on apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions and check if TFJob/PyTorch CRDs are installed. Then, start watch on the TFJob/PyTorch resource(using controller watch). I haven't verified this option.

Both 1 and 2 will require job types to be explicitly enumerated in the Katib code. However, it should be minor addition when integrating the operator in Katib.

@hougangliu
Copy link
Member

studyJob controller is implemented by controller-runtime now, and controller-runtime cannot support dynamic watch. replacing controller-runtime maybe take some time.
I think this is a critical issue since once user doesn't install tfjob or pytorch job when install kubeflow, katib fails to work. we can provide a short-term solution for katib 0.4.0 (as mentioned: when TFJob/PyTorch CRDs does not installed when studyjob controller starts, ignore it but a warning log. And when we creates a studyjob with TFJob/PyTorch job, if it not watched, mark the studyjobb as invalid. In fact, when user installs TFJob/PyTorch job later, if he wants to re-watch the CRD, he can just restart studyjob-controller , for example delete studyjob-controller, to take it effect)

For long-term solution, maybe we should replace controller-runtime.

@here any comment?

@jlewi
Copy link
Contributor Author

jlewi commented Jan 22, 2019

If we can't support dynamic watch, can we make the list of resources a command line argument of Katib so that if a user wants to use additional resources with Katib we just have to update a command line argument and reapply?

@hougangliu
Copy link
Member

If we can't support dynamic watch, can we make the list of resources a command line argument of Katib so that if a user wants to use additional resources with Katib we just have to update a command line argument and reapply?

Sound good! However for long-term solution, I think dynamic watch is better for a good UE

@jlewi
Copy link
Contributor Author

jlewi commented Jan 23, 2019

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

It looks like the current implementation is just polling for job status

runtimejob := createWorkerJobObj(w.Kind)

Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

/cc @richardsliu

@johnugeorge
Copy link
Member

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

resources are watched in https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L113

It looks like the current implementation is just polling for job status

katib/pkg/controller/studyjob/studyjob_controller.go
Line 404 in 8545970
runtimejob := createWorkerJobObj(w.Kind)
Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

can you explain more on "for each type of resource the first time it sees a resource of a given type?"

/cc @richardsliu

controller-runtime

@hougangliu
Copy link
Member

@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources?

It looks like the current implementation is just polling for job status

katib/pkg/controller/studyjob/studyjob_controller.go
Line 404 in 8545970
runtimejob := createWorkerJobObj(w.Kind)
Can we update that code to use the REST API so we can easily support any K8s object?

I think polling the APIServer might eventually create too much load on the APIServer.
To solve that problem can we (eventually) switch to using a SharedInformer to cache events and status?

Here is the TFJob controller code:
https://github.com/kubeflow/tf-operator/blob/31e7169cfd77575c5b5ec38a8dc38f72cc309358/pkg/controller.v2/tensorflow/informer.go#L34

We are creating an unstructured informer based on the REST information (e.g. resource name and kind).

So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type?

/cc @richardsliu

I will consider using unstructured informer, but I wonder if unstructured informer can reduce load on the APIServer.
And why controller-runtime cannot support dynamic watch is that once controller-runtime started, a lock created so that we cannot Watch the new resource in need.

@hougangliu
Copy link
Member

Does PR #335 need any more? I have thought it is a solution for 0.4 considering this issue's severity

@richardsliu
Copy link
Contributor

I would still like to keep #335 as a temporary fix. Meanwhile we can investigate how to use unstructured informer.

0.5.0 automation moved this from Hyperparameter Tuning to Done Jan 25, 2019
@hougangliu
Copy link
Member

/reopen
Trace long-term solution

@k8s-ci-robot k8s-ci-robot reopened this Jan 25, 2019
@k8s-ci-robot
Copy link

@hougangliu: Reopened this issue.

In response to this:

/reopen
Trace long-term solution

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

0.5.0 automation moved this from Done to New Jan 25, 2019
@jlewi
Copy link
Contributor Author

jlewi commented Jan 25, 2019

Thanks @johnugeorge for the pointers and @hougangliu for the short term fix.

Regardling long term solution

@johnugeorge @hougangliu @richardsliu It it possible to make Watch work in a dynamic fashion?

It looks like StudyJob controller creates the manager here:

mgr, err := manager.New(cfg, manager.Options{})

It looks like the manager config options allows specifying the NewCacheFunc which is used to create the informer.

https://github.com/kubernetes-sigs/controller-runtime/blob/6ada5f3055493a6c2fdafe240a7ae00bbbb7048a/pkg/manager/manager.go#L125

@jlewi
Copy link
Contributor Author

jlewi commented Feb 19, 2019

@johnugeorge @richardsliu is the long term fix #341? Can we close this issue?

@johnugeorge
Copy link
Member

johnugeorge commented Mar 4, 2019

Except watch, everything else is moved to unstructured type in #341. Since dynamic watch is supported only in the newer controller runtime version(
kubernetes-sigs/kubebuilder#422) , unstructured watch can be dynamically created once the controller runtime version is upgraded.

@johnugeorge
Copy link
Member

johnugeorge commented Mar 4, 2019

Currently controllers don't crash if operators are not installed. However, a restart is needed after CRD is installed.

@richardsliu
Copy link
Contributor

@johnugeorge Should we rename this issue? The controllers aren't crashing anymore.

@johnugeorge
Copy link
Member

I will close this issue then as it is no more valid. Opened #422 to track the operator watch

@johnugeorge
Copy link
Member

/close

@k8s-ci-robot
Copy link

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

0.5.0 automation moved this from Hyperparameter Tuning to Done Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
0.5.0
  
Done
Development

No branches or pull requests

7 participants