Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tasks documentation #453

Merged
merged 3 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions site/content/en/docs/faq/_index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---

title: "Faqs"
linkTitle: "Faqs"
title: "Troubleshooting"
linkTitle: "Troubleshooting"
weight: 10
date: 2022-02-14
description: >
Expand Down
35 changes: 14 additions & 21 deletions site/content/en/docs/tasks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,41 +9,34 @@ description: >
no_list: true
---

## PyTorch Examples
## PyTorch Example

In [pytorch](examples/pytorch), there are two examples using pytorch
- [Distributed Training of a CNN on the MNIST dataset using PyTorch and JobSet](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/pytorch/cnn-mnist/mnist.yaml)

- [mnist](examples/pytorch/mnist.yaml)
- [resnet](examples/pytorch/resnet.yaml)

Each of these examples demonstrate how you use the JobSet API to run pytorch jobs.

Machine Learning images can be quite large so it may take some time to pull the images.
**Note**: machine Learning images can be quite large so it may take some time to pull the images.

## Simple Examples

In [simple](examples/simple), we have some examples demonstrating features for the JobSet.
Here we have some simple examples demonstrating core JobSet features.

- [success-policy](examples/simple/driver-worker-success-policy.yaml)
- [max-restarts](examples/simple/max-restarts.yaml)
- [paralleljobs](examples/simple/paralleljobs.yaml)

[Success Policy](examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
- [Success Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
Success Policy allows one to specify when to mark a JobSet as success.
This example showcases an example of using the success policy to mark the JobSet as successful if the worker replicated job completes.

[Max Restarts](examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`.
Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.
- [Failure Policy with Max Restarts](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`. Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.

- [Exclusive Job Placement](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/exclusive-placement.yaml) demonstrates how you can configure a JobSet to have a 1:1 mapping between each child Job and a particular topology domain, such as a datacenter rack or zone. This means that all the pods belonging to a child job will be colocated in the same topology domain, while pods from other jobs will not be allowed to run within this domain. This gives the child job exclusive access to computer resources in this domain.

[Parallel Jobs](examples/simple/paralleljobs.yaml) demonstates how we can submit multiple replicated jobs in a jobset.
- [Parallel Jobs](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/paralleljobs.yaml) demonstrates how we can submit multiple replicated jobs in a jobset.

## Tensorflow Examples
- [Startup Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/startup-policy/startup-driver-ready.yaml) demonstrates how we can define a startup order for ReplicatedJobs in order to ensure a "leader"
pod is running before the "workers" are created. This is important for enabling the leader-worker paradigm in distributed ML training, where the workers will attempt to register with the leader as soon as they spawn.

In [tensorflow](examples/tensorflow), we have some examples demonstrating how to use Tensorflow with a JobSet.
## Tensorflow Example

- [mnist](examples/tensorflow/mnist.yaml)
- [Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using Tensorflow and JobSet](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/tensorflow/mnist.yaml)

[mnist](examples/tensorflow/mnist.yaml) runs an example job for a single epoch.
This example runs an example job for a single epoch.
You can view the progress of your jobs via `kubectl logs jobs/tensorflow-tensorflow-0`.

```
Expand Down