-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StudyJob controller shouldn't crash if PyTorch (or other job operators not installed) #317
Comments
Operator deployments are not needed if corresponding workers are not needed in Katib. However, there is a requirement of having operator crds to be installed. https://github.com/kubeflow/katib/blob/master/scripts/deploy.sh#L36 /cc @richardsliu |
@johnugeorge Can you elaborate? Why does linking in the spec for TFJob (https://github.com/kubeflow/katib/blob/master/pkg/api/operators/apis/addtoscheme_tfjob_v1beta1.go#L21) create a runtime dependency on the CRD being installed in the cluster? I would have thought that just creates a compile time dependency? I guess my bigger question is; what's the long term plan here for Katib in terms of how its going to fire of K8s resources to do the actual training? Is this covered in either of the design docs for Katib? |
@jlewi Runtime error should be where we do a watch on the jobs during init can you elaborate more on "to fire off K8s resources to do the actual training" ? |
Why is there a requirement to have operator CRDs to be installed if they aren't being used? This seems like a limitation of the current implementation. The StudyJob CRD takes a template for the K8s resource (PyTorchJob, TFJob, ChainerJob, Job, ....) that will be used to train the model. Why does Katib need to have explict support for any of these resources? Do you agree that Katib shouldn't crash if the CR for one of these resources isn't defined in the cluster? Kubernetes provides a REST API that can be used to create any K8s API. This REST API is generic; i.e. its fully possible to create a client that can create/delete any K8s resource described by the K8s YAML manifest. The client doesn't need to have explicit support for a given resource to be linked into it. |
I think the design of informer does not allow us to register the resource when it is used. The controller implementation here https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/internal/controller/controller.go#L126 will watch all resources which are registered previously. AFAIK, we cannot register CRDs dynamically. It is not constrained by the k8s API, it is limited by the implementation of the controller, IMO. |
@jlewi I agree with you that Katib shouldn't crash if the user hasn't installed these CRs in the cluster . Ideally, users have to install CRs only when they need to. However as @gaocegege told, it is difficult to implement dynamic registration of resource watch in the controller. Eg: During the init, we can skip the resource watch(TFjob/PyTorchJob) if CRDs are not installed. This will avoid the reported crash. However if user installs the crd/operator in the future for adding TFJob support, we need to dynamically add the watch on that resource. I couldn't find a nice way to handle this in the controller. I will investigate more. Btw if the user installs Katib via Kfctl, crds are installed by default. |
The GoClient libraries provide a REST interface That allow us to perform the basic rest operations without linking in any CRD specific client side generated client generated libraries. Here's some sample code from the early days of TFOperator. So given a YAML file for an arbitrary K8s resource or custom resource. We should be able to write generic logic to Create/Delete that object. We can do this without limiting ourselves to a fixed list of K8s objects apriori. So I think the only question is how do we extend that so that the studyjob controller can be efficiently notified about events for the resources it is waiting on. Does the informer library not have a similar unstructured version that would allow us to dynamically instantiate it at runtime for some resource? Isn't the TFJob operator using an unstructured informer |
FYI @hougangliu @jinchihe |
Few solutions that I can think of
|
@jlewi @richardsliu WDYT? |
Maybe a simple solution is: when TFJob/PyTorch CRDs does not installed when studyjob controller starts, ignore it but a warning log. And when we creates a studyjob with TFJob/PyTorch job, if it not watched, mark the studyjobb as invalid. |
/assign |
@hougangliu I am not sure if it is the right solution. This will force the user to reinstall katib if it needs to create study job with TFJob/PyTorch Job in the future(which should be the usual case) I feel that we need to support dynamic watch on them at runtime so that behavior remains the same |
@johnugeorge regarding the options listed in What's the difference between options 1 & 2? Does 1 require the supported job types to be explicitly enumerated whereas 2 doesn't? Which of these options if any allow us to immediately support new operators e.g. (chainer, mpi, etc...) without having to change Katib code? |
1 refers to watch on TFjob/PyTorchJob resource Both 1 and 2 will require job types to be explicitly enumerated in the Katib code. However, it should be minor addition when integrating the operator in Katib. |
studyJob controller is implemented by controller-runtime now, and controller-runtime cannot support dynamic watch. replacing controller-runtime maybe take some time. For long-term solution, maybe we should replace controller-runtime. @here any comment? |
If we can't support dynamic watch, can we make the list of resources a command line argument of Katib so that if a user wants to use additional resources with Katib we just have to update a command line argument and reapply? |
Sound good! However for long-term solution, I think dynamic watch is better for a good UE |
@hougangliu Can you elaborate on the dynamic watch? Can you provide a reference to the code where the StudyJob controller sets up a watch on TFJob and PyTorch resources? It looks like the current implementation is just polling for job status
Can we update that code to use the REST API so we can easily support any K8s object? I think polling the APIServer might eventually create too much load on the APIServer. Here is the TFJob controller code: We are creating an unstructured informer based on the REST information (e.g. resource name and kind). So could the StudyJob controller instantiate an unstructured informer for each type of resource the first time it sees a resource of a given type? /cc @richardsliu |
resources are watched in https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L113
can you explain more on "for each type of resource the first time it sees a resource of a given type?"
controller-runtime |
I will consider using unstructured informer, but I wonder if unstructured informer can reduce load on the APIServer. |
Does PR #335 need any more? I have thought it is a solution for 0.4 considering this issue's severity |
I would still like to keep #335 as a temporary fix. Meanwhile we can investigate how to use unstructured informer. |
/reopen |
@hougangliu: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks @johnugeorge for the pointers and @hougangliu for the short term fix. Regardling long term solution @johnugeorge @hougangliu @richardsliu It it possible to make Watch work in a dynamic fashion? It looks like StudyJob controller creates the manager here: katib/cmd/studyjobcontroller/main.go Line 45 in f4026e4
It looks like the manager config options allows specifying the |
@johnugeorge @richardsliu is the long term fix #341? Can we close this issue? |
Except watch, everything else is moved to unstructured type in #341. Since dynamic watch is supported only in the newer controller runtime version( |
Currently controllers don't crash if operators are not installed. However, a restart is needed after CRD is installed. |
@johnugeorge Should we rename this issue? The controllers aren't crashing anymore. |
I will close this issue then as it is no more valid. Opened #422 to track the operator watch |
/close |
@johnugeorge: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
See: kubeflow/kubeflow#2212
User reports that the StudyJob controller crashes if pytorch operator isn't installed on the cluster.
Logs for StudyJobController
This seems like a bug. If a particular job controller isn't installed I would still expect katib and StudyJobs to work with other types of job controllers.
The text was updated successfully, but these errors were encountered: