Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a tensorflow example and update task documentation #253

Merged
merged 3 commits into from
Sep 13, 2023

Conversation

kannon92
Copy link
Contributor

Fixes #78

I wanted to add some documentation updates to tasks also as the above story says "to add a task" but we had no writeup in tasks. I decided to link to examples and write some words around these examples.

Added a really simple example for Tensorflow. I used https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/simple.yaml as a template.

There are some issues with this one actually.

The logs of the pods have some warnings around deprecations.

[ec2-user@ip-172-31-93-184 jobset-kevin]$ kubectl logs tensorflow-tensorflow-0-0-5l6bb
WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2023-08-10 13:42:32.288488: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 10, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 10, 2023

- [mnist](examples/tensorflow/mnist.yaml)

This is an example of running tensorflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note about what the expected output should be / how to check if the example is working?

spec:
containers:
- name: tensorflow
image: docker.io/kubeflowkatib/tf-mnist-with-summaries:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed this is using kubeflow, which imo kind of defeats the purpose of using jobset, since kubeflow is presumably creating TFJobs under the hood, a CRD which encapsulates the TF distributed training primitives like the leader, workers, parameter server, which when modeled as a JobSet would each be a different ReplicatedJob, so they can have different success/failure policies applied to them and so on.

It's also handling the setting of the TF_CONFIG env var which is something the user still needs to configure when using JobSets, although automating the configuration of these kinds of env vars is part of the future plans for JobSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the image is mostly just a KubeFlow hosted image that builds/deploys tensorflow. It’s not using KubeFlow.

If you’d prefer to just host a tensorflow image in Jobset (like we did for PyTorch) we could go that route. I have very little experience with tensorflow so I was grabbing workable examples from the internet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the user wants to use this image I don't see any reason to use JobSet here, as none of its features are being used, it's just running a single Indexed Job. Is the headless service being utilized via pod-to-pod communication via pod hostnames at least? What happens when we set enableDNSHostnames: false to explicitly not create the headless service, does the example still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check. Where is your source for the pytorch image btw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the Dockerfile or code for it published publicly but I can share it somewhere. Perhaps I should include the training script and Dockerfile in examples/pytorch along with the JobSet yaml that is already there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that is good idea IMO. I notice that these images/code can code rot pretty quickly.

I brought the tensorflow example being out-of-date on kubeflow ie kubeflow/training-operator#1884.

Looks like they have the source defined in the repo and some pushes to the images. Maybe we don't need to go that far but it would be nice to have a way to up-date these if they rot too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, the image has nothing to do with kubeflow itself, it is just the training script.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor nit

docs/tasks/README.md Outdated Show resolved Hide resolved
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
@kannon92
Copy link
Contributor Author

@danielvegamyhre or @ahg-g any outstanding items on this PR?

@ahg-g
Copy link
Contributor

ahg-g commented Sep 13, 2023

/lgtm
/approve
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 13, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 13, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit bfe46f3 into kubernetes-sigs:main Sep 13, 2023
@danielvegamyhre danielvegamyhre mentioned this pull request Dec 12, 2023
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a task explaining how to run TF training on JobSet
5 participants