-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Viewer CRD controller for managing web views such as Tensorboard instances from within the Pipelines UI. #449
Conversation
Fixes #443 |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for finding a new way to create CRDs more easily. I have a few comments.
} | ||
|
||
// New returns a new Reconciler. | ||
func New(cli client.Client, scheme *runtime.Scheme) *Reconciler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you only creating/deleting the instances?
What about limiting the number of total instances?
What about automatically terminating an instance?
Does it fit with this scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I updated the logic to limit the number of instances as discussed. I default to 50 but it's configurable via a flag. When we hit the limit, we delete the oldest one before creating the next one. I also added a test for this behaviour.
package viewer | ||
|
||
const ( | ||
Kind string = "Viewer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the name does not really fit what it currently does. Any other idea? If not, let's keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, it's not a great name. I thought of viewer since it's meant to proxy views of another webapp through ambassador that we launch and control. I can't think of another one right now, and I'm open to suggestions :-)
return reconcile.Result{}, nil | ||
} | ||
|
||
func setPodSpecForTensorboard(view *viewerV1alpha1.Viewer, s *corev1.PodSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you setting the owner references for the created resources so that garbage collection is taken care of?
Do you need labels so that we can query by type of resource? Otherwise the UI may list all the resources which could be a problem once we add additional types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the owner reference ensures Kubernetes will take care of garbage collection of the deployment+service whenever we delete a viewer instance.
Good idea on the labels. I added labels to the created deployment+service so we can easily query them for them by asking for all viewer created ones, as well as those created for a given viewer type.
} | ||
} | ||
glog.Infof("Created new deployment with spec: %+v", dpl) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to write simple fakes like in the case of the persistence agent:
https://github.com/kubeflow/pipelines/tree/master/backend/src/agent/persistence/client
So that we can test end to end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is. This would be what we'd use:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/reconcile/reconciletest
I will add an e2e test in a follow up PR. For now, I added unit tests testing the behaviour of the reconciler using the fake kubernetes client here:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/client/fake
Added unit tests and CRD definitions. This is ready to be looked at again. /retest |
a605de8
to
9aed2f6
Compare
/retest |
/assign @IronPan |
type TensorboardSpec struct { | ||
// LogDir is the location of the log directory to be read by tensorboard, i.e., | ||
// ---log_dir. | ||
LogDir string `json:"logDir"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an array (slice?) of strings. Tensorboard supports comparing outputs by passing a list of logdirs, we're using this feature in run comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a comma separated list right? Which can be passed in directly here. This may be more flexible, in that I'm not reinterpreting the arguments to Tensorboard. For example, TB allows naming some of those runs as well. Rather than structure that here, just pass in the expected format to Tensorboard. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine I guess, although structuring it would give some value, because the structure is not very straightforward. Something like "name1:<encoded_path_1>,name2:<encoded_path_2>" is easy to mistype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but if we're planning on creating the viewer through UI, then we can try to construct it for the user there instead before sending the create request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we can have clients do it.
const ( | ||
// ViewerTypeTensorboard is the ViewerType constant used to indicate that the | ||
// underlying type is Tensorboard. An instance named `instance123` will serve | ||
// under /tensorboard/instance123. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider mapping logdirs to instances in the route if we can, so that the user can just navigate to /tensorboard/encoded-paths-here, and they get routed to a new or existing instance by the CRD. Is this doable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit, and at this point, I don't think we can. In that, the CRD isn't watching the routes, so it can't create a new one when the user navigates to it. The other thing is, I think it's worth having short names (right now, auto-generated by k8s) to give a shorter user-facing url. Mapping the logdirs to a name may result in a longer string, which I'm not in favour of.
Tangential to all this is keeping a unique a instance of TB for a given set of logdirs. I think this is a nice to have optimization, and I'm happy to look into adding this functionality in a follow up PR. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who keeps track of that mapping though?
This isn't P0 by any means for this implementation, but it's good to keep it in mind while we're working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine the right approach would be to label the viewer with the logdirs, which you can select for to ensure it's not already created. The reconciler here aims to be stateless as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
/lgtm |
backend/src/crd/README.md
Outdated
``` | ||
|
||
The Tensorboard instance should be accessible via Ambassador. Set up port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am wondering if ambassador should be mentioned at this level. It's a kubeflow component. Ideally the TB CRD should be able to run and accessible without ambassador, instead port-forwarding the tb instance pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought one of the reasons for having a CRD is to manage the routing for the user? We also don't want to be re-inventing routing which is why I am depending on ambassador, which seems reasonable since it's part of a kubeflow install. Port-forwarding directly to the pod is still do-able. The routing here enables us to serve all TB instances under /tensorboard/, on the same address as the pipeline UI, which is under /pipeline/.
backend/src/crd/README.md
Outdated
viewer.kubeflow.org/viewer-75tkf created | ||
|
||
$ kube getctl -n kubeflow vi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is kube a customized command alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below
spec: | ||
type: tensorboard | ||
tensorboardSpec: | ||
logDir: "gs://ml-pipeline-playground/mnist/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the trailing / needed here?
http://localhost:8000/tensorboard/viewer-75tkf/. Note that the last path | ||
corresponds to the new viewer name, and the URL must end with the trailing | ||
slash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why it need a trailing slash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be quirk of tensorboard serving under a specific path-prefix. I don't know why exactly.
// Package reconciler describes a Reconciler for working with Viewer CRDs. The | ||
// reconciler takes care of the main logic for ensuring every Viewer CRD | ||
// corresponds to a unique deployment and backing service. The service is | ||
// annotated such that it is compatible with Ambassador managed routing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a requirement to have ambassador as dependency? Can we decouple pipeline from ambassador?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the pipeline UI also use ambassador currently for routing tensorboard instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I removed explicit references to Ambassador. This now works both with and without (via direct port forwarding). I added instructions in the README for how to use the latter approach as well.
s.Containers = append(s.Containers, c) | ||
} | ||
|
||
func deploymentFrom(view *viewerV1alpha1.Viewer) (*appsv1.Deployment, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to launch a TB pod with a volume mounted, instead of a gcs path?
I guess it's as simple as passing right spec to PodTemplateSpec. is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much. I did have to make a slight change to how I set up the container so I can reuse the existing volume mount if one is specified. I also added a unit test for this, and verified it works in my own cluster. I will add testing this functionality as part of my e2e test in an upcoming PR.
Do you mind having a test/sample about using local logdir with persistent volume? |
Also add a sample YAML to show how to mount and use a GCE persistent disk in the viewer CRD.
Added tests and sample for persistent volumes, and also removed the explicit dependency on Ambassador. Routing should work through direct port forwarding now as well, and I added this to the README. PTAL. |
/retest |
1 similar comment
/retest |
/test kubeflow-pipeline-e2e-test |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: IronPan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: IronPan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* fix AnySequencer syntax * Update tekton-catalog/any-sequencer/README.md Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com> Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>
This PR adds a basic controller for managing a new type of CRD for managing instances of web views such as Tensorboard from within the Pipelines system.
Currently, the controller only supports Tensorboard views, but additional views should be trivial to add.
This PR makes use of the new Kubernetes controller-runtime libraries, which is the set of libraries underlying kubebuilder. These libraries make creating CRD controllers a lot easier by managing internal data structures like informers and workqueues, so we can focus on writing just the business logic.
I still need to do the following before merging this PR:
For now, I'd like to get some feedback on this approach, before adding any tests.
Once this PR goes in, I will send a separate PR for bundling this controller and deploying it as part of pipelines.
This change is