-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a tensorflow example and update task documentation #253
add a tensorflow example and update task documentation #253
Conversation
docs/tasks/README.md
Outdated
|
||
- [mnist](examples/tensorflow/mnist.yaml) | ||
|
||
This is an example of running tensorflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note about what the expected output should be / how to check if the example is working?
spec: | ||
containers: | ||
- name: tensorflow | ||
image: docker.io/kubeflowkatib/tf-mnist-with-summaries:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed this is using kubeflow, which imo kind of defeats the purpose of using jobset, since kubeflow is presumably creating TFJobs under the hood, a CRD which encapsulates the TF distributed training primitives like the leader, workers, parameter server, which when modeled as a JobSet would each be a different ReplicatedJob, so they can have different success/failure policies applied to them and so on.
It's also handling the setting of the TF_CONFIG env var which is something the user still needs to configure when using JobSets, although automating the configuration of these kinds of env vars is part of the future plans for JobSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the image is mostly just a KubeFlow hosted image that builds/deploys tensorflow. It’s not using KubeFlow.
If you’d prefer to just host a tensorflow image in Jobset (like we did for PyTorch) we could go that route. I have very little experience with tensorflow so I was grabbing workable examples from the internet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the user wants to use this image I don't see any reason to use JobSet here, as none of its features are being used, it's just running a single Indexed Job. Is the headless service being utilized via pod-to-pod communication via pod hostnames at least? What happens when we set enableDNSHostnames: false
to explicitly not create the headless service, does the example still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to check. Where is your source for the pytorch image btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the Dockerfile or code for it published publicly but I can share it somewhere. Perhaps I should include the training script and Dockerfile in examples/pytorch along with the JobSet yaml that is already there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that is good idea IMO. I notice that these images/code can code rot pretty quickly.
I brought the tensorflow example being out-of-date on kubeflow ie kubeflow/training-operator#1884.
Looks like they have the source defined in the repo and some pushes to the images. Maybe we don't need to go that far but it would be nice to have a way to up-date these if they rot too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, the image has nothing to do with kubeflow itself, it is just the training script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor nit
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
@danielvegamyhre or @ahg-g any outstanding items on this PR? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kannon92 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #78
I wanted to add some documentation updates to tasks also as the above story says "to add a task" but we had no writeup in tasks. I decided to link to examples and write some words around these examples.
Added a really simple example for Tensorflow. I used https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/simple.yaml as a template.
There are some issues with this one actually.
The logs of the pods have some warnings around deprecations.