Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard Integration #13

Closed
wbuchwalter opened this issue Aug 2, 2017 · 4 comments
Closed

TensorBoard Integration #13

wbuchwalter opened this issue Aug 2, 2017 · 4 comments

Comments

@wbuchwalter
Copy link
Contributor

wbuchwalter commented Aug 2, 2017

How do you see TensorBoard integrating with this solution?
It would be really cool if I could create a template, ask for TensorBoard to be deployed as well and receive either a ClusterIP or Public IP.

For example this could look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  addons:
    - tensorboard:
         ip-type: LoadBalancer
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

TensorBoard would then run as a sidecar in the master's pod.
Now the main issue here is accessing the log files.
An easy way would be to document a certain convention. For example, we assume that the log files are saved under /var/tensorflow/logs and then mount this directory into the TensorBoard container through the node.

This also begs the question of data persistence: In this state, once the job shutdowns, all data is lost. Do you think we need to address this question right away, or could this be discussed later on?

Happy to work on this if you approve.

@jlewi
Copy link
Contributor

jlewi commented Aug 2, 2017

I was thinking that it would work something like the following.

The user would just specify the location of the checkpoint in the TfJob e.g. something like this

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tensorboard:
      log_dir: gs://output_dir/check_point_dir
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

The TfJob operator would then just create/manage a ReplicaSet that was running TensorBoard with the specified log_dir. The lifetime of that replica set would be tied to the lifetime of the TfJob.

My first assumption is that users are already using some sort of storage not tied to pod lifetime to perserve their models; e.g. GCS, HDFS, NFS. So I don't think we need to run TensorBoard in a sidecar.

To support this we might need to extend the spec for TensorBoard to allow specifying volume mounts where the checkpoints are located.

My second simplifying assumption is that we don't need to expose the service for TensorBoard outside of the cluster. My expectation is that users will have some method of connecting to services running on the cluster e.g. kubectl proxy or on GKE, IAP.

For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately.

Thoughts?

@wbuchwalter
Copy link
Contributor Author

Regarding storage, to make sure I understand correctly:
Your example gs://output_dir/check_point_dir was when using some kind of in-cluster distrbuted storage where there is no need for any kind of credentials, or url to access the storage.

And when dealing with something like Azure Files or GCE Persistent Disks, the template would instead look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tensorboard:
      log_dir: /var/tensorflow/check_point_dir
      volumes:
        - name: azurefile
          azureFile:
              secretName: azure-secret
              shareName: data
      volumeMounts:
        - name: azurefile
          mountPath: "/var/tensorflow/"
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                - name: azurefile
                  mountPath: "/var/tensorflow/"
          volumes:
            - name: azurefile
              azureFile:
                  secretName: azure-secret
                  shareName: data

Am I understanding this correctly?

My second simplifying assumption is that we don't need to expose the service for TensorBoard outside of the cluster. My expectation is that users will have some method of connecting to services running on the cluster e.g. kubectl proxy or on GKE, IAP.

For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately.

Agreed 👍

@jlewi
Copy link
Contributor

jlewi commented Aug 3, 2017

Your spec looks good to me. If you want to take that one that would be great.

On GCP you don't have to mount GCS because GCS is GCP's object storage system. So any process can just read/write to GCS as long as you have appropriate credentials. So you don't have to explicitly mount GCS as a volume as TensorFlow supports reading/writing directly to GCS.

You could do something similar with HDFS since TensorFlow can support reading/writing HDFS.

Can Azure mount a volume in multiple pods simultaneously?

On GCP a PD can only be mounted with write permissions on a single VM. So on GCP if you wanted to use PD you would probably want to setup NFS backed by PD and then use K8s support for NFS volumes. (That's just an FYI not really related to this issue).

@wbuchwalter
Copy link
Contributor Author

wbuchwalter commented Aug 3, 2017

Your spec looks good to me. If you want to take that one that would be great.

Sure thing!

Can Azure mount a volume in multiple pods simultaneously?

Yes Azure files supports multiple mounting simultaneously which is quite nice for this use case.

@jlewi jlewi closed this as completed in #15 Aug 16, 2017
ChenYi015 pushed a commit to ChenYi015/training-operator that referenced this issue Sep 20, 2024
* add annotation support for worker0 completed

* modify annotation name

* modify annotation name

* modify check pod ttl way

* fix from comments

* fix from comments

* fix from comments

* modify ttl way

* update check ttl code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants