Skip to content

Commit

Permalink
update tasks
Browse files Browse the repository at this point in the history
  • Loading branch information
danielvegamyhre committed Mar 14, 2024
1 parent 8c673be commit a0c4fee
Showing 1 changed file with 10 additions and 17 deletions.
27 changes: 10 additions & 17 deletions site/content/en/docs/tasks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,38 +11,31 @@ no_list: true

## PyTorch Example

In [pytorch](examples/pytorch), there are two examples using pytorch

- [Distributed Training of a CNN on the MNIST dataset using PyTorch and JobSet](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/pytorch/cnn-mnist/mnist.yaml)

**Note**: machine Learning images can be quite large so it may take some time to pull the images.

## Simple Examples

In the [simple](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple) examples directory, we have some examples demonstrating core JobSet features.

- [success-policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/driver-worker-success-policy.yaml)
- [max-restarts](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/max-restarts.yaml)
- [paralleljobs](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/paralleljobs.yaml)
Here we have some simple examples demonstrating core JobSet features.

[Success Policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
- [Success Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
Success Policy allows one to specify when to mark a JobSet as success.
This example showcases an example of using the success policy to mark the JobSet as successful if the worker replicated job completes.

[Max Restarts](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`.
Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.
- [Failure Policy with Max Restarts](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`. Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.

[Parallel Jobs](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/paralleljobs.yaml) demonstates how we can submit multiple replicated jobs in a jobset.
- [Exclusive Job Placement](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/exclusive-placement.yaml) demonstrates how you can configure a JobSet to run 1 child Job per topology domain
(e.g., data center rack, zone, etc.). This is useful for use cases such as TPU MultiSlice training. Imagine we want to use a distributed data parallel (DDP) training strategy to train a model using multiple TPU Slices, running 1 model replica in each accelerator island (TPU slice), ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth ICI mesh linking TPU chips within a TPU slice, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth DCN (data center network).

[Startup Policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/startup-policy/startup-driver-ready.yaml) demonstrates how we can define a startup order for ReplicatedJobs in order to ensure a "leader"
pod is running before the "workers" are created. This is important for enabling the leader-worker paradigm in distributed ML
training, where the workers will attempt to register with the leader as soon as they spawn.
- [Parallel Jobs](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/paralleljobs.yaml) demonstrates how we can submit multiple replicated jobs in a jobset.

## Tensorflow Example
- [Startup Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/startup-policy/startup-driver-ready.yaml) demonstrates how we can define a startup order for ReplicatedJobs in order to ensure a "leader"
pod is running before the "workers" are created. This is important for enabling the leader-worker paradigm in distributed ML training, where the workers will attempt to register with the leader as soon as they spawn.

In [tensorflow](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/tensorflow), we have an example demonstrating how to use Tensorflow with a JobSet.
## Tensorflow Example

- [Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using Tensorflow and JobSet](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/tensorflow/mnist.yaml)
- [Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using Tensorflow and JobSet](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/tensorflow/mnist.yaml)

This example runs an example job for a single epoch.
You can view the progress of your jobs via `kubectl logs jobs/tensorflow-tensorflow-0`.
Expand Down

0 comments on commit a0c4fee

Please sign in to comment.