add a tensorflow example and update task documentation #253

kannon92 · 2023-08-10T13:44:53Z

Fixes #78

I wanted to add some documentation updates to tasks also as the above story says "to add a task" but we had no writeup in tasks. I decided to link to examples and write some words around these examples.

Added a really simple example for Tensorflow. I used https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/simple.yaml as a template.

There are some issues with this one actually.

The logs of the pods have some warnings around deprecations.

[ec2-user@ip-172-31-93-184 jobset-kevin]$ kubectl logs tensorflow-tensorflow-0-0-5l6bb
WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2023-08-10 13:42:32.288488: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

danielvegamyhre · 2023-08-10T18:12:39Z

docs/tasks/README.md

+
+- [mnist](examples/tensorflow/mnist.yaml)
+
+This is an example of running tensorflow.


Can you add a note about what the expected output should be / how to check if the example is working?

examples/tensorflow/mnist.yaml

danielvegamyhre · 2023-08-14T21:13:53Z

examples/tensorflow/mnist.yaml

+          spec:
+            containers:
+            - name: tensorflow
+              image: docker.io/kubeflowkatib/tf-mnist-with-summaries:latest


just noticed this is using kubeflow, which imo kind of defeats the purpose of using jobset, since kubeflow is presumably creating TFJobs under the hood, a CRD which encapsulates the TF distributed training primitives like the leader, workers, parameter server, which when modeled as a JobSet would each be a different ReplicatedJob, so they can have different success/failure policies applied to them and so on.

It's also handling the setting of the TF_CONFIG env var which is something the user still needs to configure when using JobSets, although automating the configuration of these kinds of env vars is part of the future plans for JobSet.

So the image is mostly just a KubeFlow hosted image that builds/deploys tensorflow. It’s not using KubeFlow.

If you’d prefer to just host a tensorflow image in Jobset (like we did for PyTorch) we could go that route. I have very little experience with tensorflow so I was grabbing workable examples from the internet.

Yeah, if the user wants to use this image I don't see any reason to use JobSet here, as none of its features are being used, it's just running a single Indexed Job. Is the headless service being utilized via pod-to-pod communication via pod hostnames at least? What happens when we set enableDNSHostnames: false to explicitly not create the headless service, does the example still work?

I'll have to check. Where is your source for the pytorch image btw?

I don't have the Dockerfile or code for it published publicly but I can share it somewhere. Perhaps I should include the training script and Dockerfile in examples/pytorch along with the JobSet yaml that is already there?

Yea I think that is good idea IMO. I notice that these images/code can code rot pretty quickly.

I brought the tensorflow example being out-of-date on kubeflow ie kubeflow/training-operator#1884.

Looks like they have the source defined in the repo and some pushes to the images. Maybe we don't need to go that far but it would be nice to have a way to up-date these if they rot too much.

I think this is fine, the image has nothing to do with kubeflow itself, it is just the training script.

ahg-g

one minor nit

docs/tasks/README.md

Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>

kannon92 · 2023-09-13T20:14:57Z

@danielvegamyhre or @ahg-g any outstanding items on this PR?

ahg-g · 2023-09-13T20:35:33Z

/lgtm
/approve
/label tide/merge-method-squash

k8s-ci-robot · 2023-09-13T20:35:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add a tensorflow example and update task documentation

c75d34d

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 10, 2023

k8s-ci-robot requested review from ahg-g and danielvegamyhre August 10, 2023 13:44

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 10, 2023

kannon92 mentioned this pull request Aug 10, 2023

Tensorflow example image has a lot of warnings and python 2.7. kubeflow/training-operator#1884

Closed

danielvegamyhre reviewed Aug 10, 2023

View reviewed changes

tenzen-y reviewed Aug 11, 2023

View reviewed changes

examples/tensorflow/mnist.yaml Outdated Show resolved Hide resolved

add more up to date tensorflow example

ccda609

danielvegamyhre reviewed Aug 14, 2023

View reviewed changes

danielvegamyhre mentioned this pull request Aug 15, 2023

Add source code and Dockerfiles for container images used in JobSet examples #257

Closed

ahg-g reviewed Sep 12, 2023

View reviewed changes

docs/tasks/README.md Outdated Show resolved Hide resolved

Update docs/tasks/README.md

0cf365d

Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>

kannon92 mentioned this pull request Sep 12, 2023

Document fungible jobs with JobSet #294

Closed

kannon92 requested review from ahg-g, danielvegamyhre and tenzen-y September 13, 2023 13:58

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 13, 2023

k8s-ci-robot assigned ahg-g Sep 13, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 13, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2023

k8s-ci-robot merged commit bfe46f3 into kubernetes-sigs:main Sep 13, 2023

danielvegamyhre mentioned this pull request Dec 12, 2023

Release v0.3.0 #347

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a tensorflow example and update task documentation #253

add a tensorflow example and update task documentation #253

kannon92 commented Aug 10, 2023

danielvegamyhre Aug 10, 2023

danielvegamyhre Aug 14, 2023

kannon92 Aug 14, 2023

danielvegamyhre Aug 15, 2023

kannon92 Aug 15, 2023

danielvegamyhre Aug 15, 2023

kannon92 Aug 15, 2023

ahg-g Sep 12, 2023

ahg-g left a comment

kannon92 commented Sep 13, 2023

ahg-g commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023


		- [mnist](examples/tensorflow/mnist.yaml)

		This is an example of running tensorflow.

add a tensorflow example and update task documentation #253

add a tensorflow example and update task documentation #253

Conversation

kannon92 commented Aug 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

kannon92 commented Sep 13, 2023

ahg-g commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023