Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TensorBoard Integration #15

Merged
merged 6 commits into from
Aug 16, 2017
Merged

Add TensorBoard Integration #15

merged 6 commits into from
Aug 16, 2017

Conversation

wbuchwalter
Copy link
Contributor

@wbuchwalter wbuchwalter commented Aug 6, 2017

Fix #13 .

Notes:

  • To add some consistency, I modify the existing spec to be camelCase everywhere instead of snake_case, all standard k8s specs already being in camelCase.
  • We discussed assuming that we don't need to expose the service outside the cluster. But since it was a really small change to add I added the option. By default TensorBoard's service is using ClusterIP, but you can change the ServiceType to be what you want. For example:
tensorboard:
    logDir: /tmp/tensorflow
    serviceType: LoadBalancer

Let me know if you are okay with the 2 points above!

Todos:

  • Base logic
  • Tests
  • Rebase master
  • Update documentation

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thank you very much.

Do you think its worth writing a unittest for TBReplicaSet?

Do you think its worth updating the helm E2E tests to verify TensorBoard is created and deleted? I'd be ok with filing a PR for that and leaving it for the future.

glide.yaml Outdated
@@ -10,7 +10,7 @@ import:
version: 6d8c23d2e66ce61e2a7e34f92e8eb3eec13279fa
- package: github.com/coreos/etcd
version: v3.1.9
- package: github.com/Sirupsen/logrus
- package: github.com/sirupsen/logrus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use this package?
The original etcd operator used the Sirupsen package. I liked the idea of structured logging because I think we want to tag log entries so we know which worker they came from. But it seemed like that package didn't support logging filename and linenumber. So I settled on using the Google log package for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it. Glog is perfectly fine so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this removed? (I'm still new to GitHub code review).

"mlkube.io/pkg/spec"
)

//TBReplicaSet represent the RS for the TensorBoard instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit space after //

}

func (s *TBReplicaSet) Create() error {
//By default we assume TensorBoard's service will be a ClusterIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit space after //

st = s.Spec.ServiceType
}

//create the service exposing TensorBoard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after //

failures = true
}

err = s.ClientSet.CoreV1().Pods(NAMESPACE).DeleteCollection(&meta_v1.DeleteOptions{}, listOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete pods? Won't deleting the deployment take care of it?

failures = true
}

err = s.ClientSet.ExtensionsV1beta1().ReplicaSets(NAMESPACE).DeleteCollection(&meta_v1.DeleteOptions{}, listOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete the replica set? Won't deleting the deployment take care of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by default it seems. All my tests showed that this was not the case. Deleting the Deployment did not delete the RS, and deleting the RS did not delete the pods.
I'm guessing I need to look at the DeleteOptions and tune the PropagationPolicy for that to happen. I'll take a look.

},
Ports: []v1.ContainerPort{
{
ContainerPort: 6006,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit could we define a constant for 6006 since we use it in 2 different places.

func (s *TBReplicaSet) getDeploymentSpecTemplate() v1.PodTemplateSpec {
c := &v1.Container{
Name: s.jobName(),
Image: "tensorflow/tensorflow",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO here to make the TensorFlow image a parameter of the job operator. I think that there's a couple places where we might want to just use the default TensorFlow image. For example in parameter servers we can just start the TensorFlow server.

However, users might want to specify which version of TensorFlow they want to use e.g. 1.0/1.2 etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
What about this being a parameter of the TfJob and not the a param of the operator?
For large teams that may want to share a single kubernetes cluster, different projects might be running different versions of TF for example. Also CNTK supports TensorBoard but maybe only specific versions etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the TfJob operator to specify a default and then allow it to be overwritten on a per job basis.

For overloading per job, I'm not sure whether we should have a specific TfJob field to specify a default TF image for the job that overwrites the default provided by the operator. Or if we should just allow e.g. subcomponent (e.g. TBReplicaSet, TfReplicaSet to specify a container image).

@wbuchwalter
Copy link
Contributor Author

Do you think its worth writing a unittest for TBReplicaSet?

Do you think its worth updating the helm E2E tests to verify TensorBoard is created and deleted? I'd be ok with filing a PR for that and leaving it for the future.

Yes I think I should add some test. If you would like this to be merged early, I can do it in a separate PR, otherwise I'll do it here.

@jlewi
Copy link
Contributor

jlewi commented Aug 7, 2017

Lets do the tests in this PR if you don't mind. Thank you.

@wbuchwalter wbuchwalter changed the title [WIP] Add TensorBoard Integration Add TensorBoard Integration Aug 14, 2017
@wbuchwalter
Copy link
Contributor Author

@jlewi I implemented your suggestions, let me know what you think! Sorry for the delay, last week was pretty busy on my side.
Also I update Travis's configuration to install glide with apt as the build was failing with Sorry, we dont have a dist for your system: linux amd64. See Masterminds/glide#639.

FYI, i'll be travelling for the next two weeks with very limited internet access, so In case you require further changes feel free to update the PR if you don't want to wait.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I just had a quick question about the test. If I don't hear from you, I'll go ahead and merge it anyway.

}

// Check that a service was created.
sList, err := clientSet.CoreV1().Services(NAMESPACE).List(meta_v1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you use list here and not get?

Copy link
Contributor Author

@wbuchwalter wbuchwalter Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my bad I should change that.
I'm currently in the plane so I added a Todo as I won't have time to test. Since there should never be more than one service the test still work as expected, but I will change it later for improved clarity.

glide.yaml Outdated
@@ -10,7 +10,7 @@ import:
version: 6d8c23d2e66ce61e2a7e34f92e8eb3eec13279fa
- package: github.com/coreos/etcd
version: v3.1.9
- package: github.com/Sirupsen/logrus
- package: github.com/sirupsen/logrus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this removed? (I'm still new to GitHub code review).

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you.

@jlewi jlewi merged commit 392d54c into kubeflow:master Aug 16, 2017
Syulin7 added a commit to Syulin7/training-operator that referenced this pull request Jun 5, 2023
Signed-off-by: Syulin7 <735122171@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants