update tasks

kubernetes-sigs · Mar 14, 2024 · a0c4fee · a0c4fee
1 parent 8c673be
commit a0c4fee
Showing 1 changed file with 10 additions and 17 deletions.
diff --git a/site/content/en/docs/tasks/_index.md b/site/content/en/docs/tasks/_index.md
@@ -11,38 +11,31 @@ no_list: true
 
 ## PyTorch Example
 
-In [pytorch](examples/pytorch), there are two examples using pytorch
-
 - [Distributed Training of a CNN on the MNIST dataset using PyTorch and JobSet](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/pytorch/cnn-mnist/mnist.yaml)
 
 **Note**: machine Learning images can be quite large so it may take some time to pull the images.
 
 ## Simple Examples
 
-In the [simple](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple) examples directory, we have some examples demonstrating core JobSet features.
-
-- [success-policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/driver-worker-success-policy.yaml)
-- [max-restarts](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/max-restarts.yaml)
-- [paralleljobs](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/paralleljobs.yaml)
+Here we have some simple examples demonstrating core JobSet features.
 
-[Success Policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
+- [Success Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/driver-worker-success-policy.yaml) demonstrates an example of utilizing `successPolicy`.
 Success Policy allows one to specify when to mark a JobSet as success.  
 This example showcases an example of using the success policy to mark the JobSet as successful if the worker replicated job completes.
 
-[Max Restarts](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`.
-Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.
+- [Failure Policy with Max Restarts](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/max-restarts.yaml) demonstrates an example of utilizing `failurePolicy`. Failure Policy allows one to control how many restarts a JobSet can do before declaring the JobSet as failed.
 
-[Parallel Jobs](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/simple/paralleljobs.yaml) demonstates how we can submit multiple replicated jobs in a jobset.
+- [Exclusive Job Placement](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/exclusive-placement.yaml) demonstrates how you can configure a JobSet to run 1 child Job per topology domain
+(e.g., data center rack, zone, etc.). This is useful for use cases such as TPU MultiSlice training. Imagine we want to use a distributed data parallel (DDP) training strategy to train a model using multiple TPU Slices, running 1 model replica in each accelerator island (TPU slice), ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth ICI mesh linking TPU chips within a TPU slice, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth DCN (data center network). 
 
-[Startup Policy](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/startup-policy/startup-driver-ready.yaml) demonstrates how we can define a startup order for ReplicatedJobs in order to ensure a "leader"
-pod is running before the "workers" are created. This is important for enabling the leader-worker paradigm in distributed ML
-training, where the workers will attempt to register with the leader as soon as they spawn.
+- [Parallel Jobs](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/simple/paralleljobs.yaml) demonstrates how we can submit multiple replicated jobs in a jobset.
 
-## Tensorflow Example
+- [Startup Policy](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/startup-policy/startup-driver-ready.yaml) demonstrates how we can define a startup order for ReplicatedJobs in order to ensure a "leader"
+pod is running before the "workers" are created. This is important for enabling the leader-worker paradigm in distributed ML training, where the workers will attempt to register with the leader as soon as they spawn.
 
-In [tensorflow](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/tensorflow), we have an example demonstrating how to use Tensorflow with a JobSet.
+## Tensorflow Example
 
-- [Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using Tensorflow and JobSet](https://github.com/kubernetes-sigs/jobset/blob/1ae6c0c039c21d29083de38ae70d13c2c8ec613f/examples/tensorflow/mnist.yaml)
+- [Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using Tensorflow and JobSet](https://github.com/kubernetes-sigs/jobset/blob/release-0.4/examples/tensorflow/mnist.yaml)
 
 This example runs an example job for a single epoch.
 You can view the progress of your jobs via `kubectl logs jobs/tensorflow-tensorflow-0`.