-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TensorBoard Integration #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thank you very much.
Do you think its worth writing a unittest for TBReplicaSet?
Do you think its worth updating the helm E2E tests to verify TensorBoard is created and deleted? I'd be ok with filing a PR for that and leaving it for the future.
glide.yaml
Outdated
@@ -10,7 +10,7 @@ import: | |||
version: 6d8c23d2e66ce61e2a7e34f92e8eb3eec13279fa | |||
- package: github.com/coreos/etcd | |||
version: v3.1.9 | |||
- package: github.com/Sirupsen/logrus | |||
- package: github.com/sirupsen/logrus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually use this package?
The original etcd operator used the Sirupsen package. I liked the idea of structured logging because I think we want to tag log entries so we know which worker they came from. But it seemed like that package didn't support logging filename and linenumber. So I settled on using the Google log package for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it. Glog
is perfectly fine so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this removed? (I'm still new to GitHub code review).
pkg/trainer/tensorboard.go
Outdated
"mlkube.io/pkg/spec" | ||
) | ||
|
||
//TBReplicaSet represent the RS for the TensorBoard instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit space after //
pkg/trainer/tensorboard.go
Outdated
} | ||
|
||
func (s *TBReplicaSet) Create() error { | ||
//By default we assume TensorBoard's service will be a ClusterIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit space after //
pkg/trainer/tensorboard.go
Outdated
st = s.Spec.ServiceType | ||
} | ||
|
||
//create the service exposing TensorBoard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after //
pkg/trainer/tensorboard.go
Outdated
failures = true | ||
} | ||
|
||
err = s.ClientSet.CoreV1().Pods(NAMESPACE).DeleteCollection(&meta_v1.DeleteOptions{}, listOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to delete pods? Won't deleting the deployment take care of it?
pkg/trainer/tensorboard.go
Outdated
failures = true | ||
} | ||
|
||
err = s.ClientSet.ExtensionsV1beta1().ReplicaSets(NAMESPACE).DeleteCollection(&meta_v1.DeleteOptions{}, listOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to delete the replica set? Won't deleting the deployment take care of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not by default it seems. All my tests showed that this was not the case. Deleting the Deployment did not delete the RS, and deleting the RS did not delete the pods.
I'm guessing I need to look at the DeleteOptions
and tune the PropagationPolicy
for that to happen. I'll take a look.
pkg/trainer/tensorboard.go
Outdated
}, | ||
Ports: []v1.ContainerPort{ | ||
{ | ||
ContainerPort: 6006, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit could we define a constant for 6006 since we use it in 2 different places.
pkg/trainer/tensorboard.go
Outdated
func (s *TBReplicaSet) getDeploymentSpecTemplate() v1.PodTemplateSpec { | ||
c := &v1.Container{ | ||
Name: s.jobName(), | ||
Image: "tensorflow/tensorflow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO here to make the TensorFlow image a parameter of the job operator. I think that there's a couple places where we might want to just use the default TensorFlow image. For example in parameter servers we can just start the TensorFlow server.
However, users might want to specify which version of TensorFlow they want to use e.g. 1.0/1.2 etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
What about this being a parameter of the TfJob
and not the a param of the operator?
For large teams that may want to share a single kubernetes cluster, different projects might be running different versions of TF for example. Also CNTK supports TensorBoard but maybe only specific versions etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the TfJob operator to specify a default and then allow it to be overwritten on a per job basis.
For overloading per job, I'm not sure whether we should have a specific TfJob field to specify a default TF image for the job that overwrites the default provided by the operator. Or if we should just allow e.g. subcomponent (e.g. TBReplicaSet, TfReplicaSet to specify a container image).
Yes I think I should add some test. If you would like this to be merged early, I can do it in a separate PR, otherwise I'll do it here. |
Lets do the tests in this PR if you don't mind. Thank you. |
@jlewi I implemented your suggestions, let me know what you think! Sorry for the delay, last week was pretty busy on my side. FYI, i'll be travelling for the next two weeks with very limited internet access, so In case you require further changes feel free to update the PR if you don't want to wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I just had a quick question about the test. If I don't hear from you, I'll go ahead and merge it anyway.
pkg/trainer/tensorboard_test.go
Outdated
} | ||
|
||
// Check that a service was created. | ||
sList, err := clientSet.CoreV1().Services(NAMESPACE).List(meta_v1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you use list here and not get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, my bad I should change that.
I'm currently in the plane so I added a Todo as I won't have time to test. Since there should never be more than one service the test still work as expected, but I will change it later for improved clarity.
glide.yaml
Outdated
@@ -10,7 +10,7 @@ import: | |||
version: 6d8c23d2e66ce61e2a7e34f92e8eb3eec13279fa | |||
- package: github.com/coreos/etcd | |||
version: v3.1.9 | |||
- package: github.com/Sirupsen/logrus | |||
- package: github.com/sirupsen/logrus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this removed? (I'm still new to GitHub code review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you.
Signed-off-by: Syulin7 <735122171@qq.com>
Fix #13 .
Notes:
spec
to becamelCase
everywhere instead ofsnake_case
, all standard k8s specs already being incamelCase
.ClusterIP
, but you can change theServiceType
to be what you want. For example:Let me know if you are okay with the 2 points above!
Todos: