docs: updating docs for local development (kubeflow#2074)

* adding new demo for arm64 and updating docs for local development Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * updated cpu Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * updated readme to add link to pytorch Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * removed links Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * fixed typo Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * adjusting based on feedback from Yuki Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * Removing the make run command from developer guide Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * updated dev docs with lates notes Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * Removing mnist2 example Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * added requirements to developer guide Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * adding link to Lima Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> * moved note about namesapce Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com> --------- Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>
johnugeorge · Apr 28, 2024 · ce6ecf1 · ce6ecf1
1 parent 3beb7ed
commit ce6ecf1
Showing 1 changed file with 89 additions and 23 deletions.
diff --git a/docs/development/developer_guide.md b/docs/development/developer_guide.md
@@ -5,6 +5,16 @@ Kubeflow Training Operator is currently at v1.
 ## Requirements
 
 - [Go](https://golang.org/) (1.22 or later)
+- [Docker](https://docs.docker.com/) 
+- [Docker](https://docs.docker.com/) (20.10 or later)
+- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
+- [Python](https://www.python.org/) (3.11 or later)
+- [kustomize](https://kustomize.io/) (4.0.5 or later)
+- [Kind](https://kind.sigs.k8s.io/) (0.22.0 or later)
+- [Lima](https://github.com/lima-vm/lima?tab=readme-ov-file#adopters) (an alternative to DockerDesktop) (0.21.0 or later)
+  - [Colima](https://github.com/abiosoft/colima) (Lima specifically for MacOS) (0.6.8 or later)
+
+Note for Lima the link is to the Adopters, which supports several different container environments.
 
 ## Building the operator
 
@@ -23,7 +33,7 @@ Install dependencies
 go mod tidy
 ```
 
-Build it
+Build the library
 
 ```sh
 go install github.com/kubeflow/training-operator/cmd/training-operator.v1
@@ -35,47 +45,103 @@ Running the operator locally (as opposed to deploying it on a K8s cluster) is co
 
 ### Run a Kubernetes cluster
 
-First, you need to run a Kubernetes cluster locally. There are lots of choices:
-
-- [kind](https://kind.sigs.k8s.io)
+First, you need to run a Kubernetes cluster locally. We recommend [Kind](https://kind.sigs.k8s.io).
 
+You can create a `kind` cluster by running
+```sh
+kind create cluster 
+```
+This will load your kubernetes config file with the new cluster. 
 
-### Configure KUBECONFIG and KUBEFLOW_NAMESPACE
+After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane. 
+```sh 
+kubectl get nodes
+```
+The output should look something like below:
+```
+$ kubectl get nodes
+NAME                 STATUS   ROLES           AGE   VERSION
+kind-control-plane   Ready    control-plane   32s   v1.27.3
+```
+Note, that for the example job below, the PyTorchJob uses the `kubeflow` namespace.
 
-We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
-a K8s cluster. Set your environment:
+From here we can apply the manifests to the cluster.
+```sh
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
+```
 
+Then we can patch it with the latest operator image.
 ```sh
-export KUBECONFIG=$(echo ~/.kube/config)
-export KUBEFLOW_NAMESPACE=$(your_namespace)
+kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
 ```
+Then we can run the job with the following command. 
 
-- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set.
+```sh 
+kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
+```
+And we can see the output of the job from the logs, which may take some time to produce but should look something like below.
+```
+$ kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --follow
+Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
+2024-04-19T19:00:29Z INFO     Train Epoch: 1 [4480/60000 (7%)]	loss=2.2295
+2024-04-19T19:00:32Z INFO     Train Epoch: 1 [5120/60000 (9%)]	loss=2.1790
+2024-04-19T19:00:35Z INFO     Train Epoch: 1 [5760/60000 (10%)]	loss=2.1150
+2024-04-19T19:00:38Z INFO     Train Epoch: 1 [6400/60000 (11%)]	loss=2.0294
+2024-04-19T19:00:41Z INFO     Train Epoch: 1 [7040/60000 (12%)]	loss=1.9156
+2024-04-19T19:00:44Z INFO     Train Epoch: 1 [7680/60000 (13%)]	loss=1.7949
+2024-04-19T19:00:47Z INFO     Train Epoch: 1 [8320/60000 (14%)]	loss=1.5567
+2024-04-19T19:00:50Z INFO     Train Epoch: 1 [8960/60000 (15%)]	loss=1.3715
+2024-04-19T19:00:54Z INFO     Train Epoch: 1 [9600/60000 (16%)]	loss=1.3385
+2024-04-19T19:00:57Z INFO     Train Epoch: 1 [10240/60000 (17%)]	loss=1.1650
+2024-04-19T19:00:29Z INFO     Train Epoch: 1 [4480/60000 (7%)]	loss=2.2295
+2024-04-19T19:00:32Z INFO     Train Epoch: 1 [5120/60000 (9%)]	loss=2.1790
+2024-04-19T19:00:35Z INFO     Train Epoch: 1 [5760/60000 (10%)]	loss=2.1150
+2024-04-19T19:00:38Z INFO     Train Epoch: 1 [6400/60000 (11%)]	loss=2.0294
+2024-04-19T19:00:41Z INFO     Train Epoch: 1 [7040/60000 (12%)]	loss=1.9156
+2024-04-19T19:00:44Z INFO     Train Epoch: 1 [7680/60000 (13%)]	loss=1.7949
+2024-04-19T19:00:47Z INFO     Train Epoch: 1 [8320/60000 (14%)]	loss=1.5567
+2024-04-19T19:00:50Z INFO     Train Epoch: 1 [8960/60000 (15%)]	loss=1.3715
+2024-04-19T19:00:53Z INFO     Train Epoch: 1 [9600/60000 (16%)]	loss=1.3385
+2024-04-19T19:00:57Z INFO     Train Epoch: 1 [10240/60000 (17%)]	loss=1.1650
+```
 
-### Create the TFJob CRD
+## Testing changes locally
 
-After the cluster is up, the TFJob CRD should be created on the cluster.
+Now that you confirmed you can spin up an operator locally, you can try to test your local changes to the operator.
+You do this by building a new operator image and loading it into your kind cluster.
 
-```bash
-make install
+### Build Operator Image
+```sh
+make docker-build IMG=my-username/training-operator:my-pr-01
 ```
+You can swap `my-username/training-operator:my-pr-01` with whatever you would like.
 
-### Run Operator
+## Load docker image 
+```sh
+kind load docker-image my-username/training-operator:my-pr-01
+``` 
 
-Now we are ready to run operator locally:
+## Modify operator image with new one
 
 ```sh
-make run
+cd ./manifests/overlays/standalone
+kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
 ```
+Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.
 
-To verify local operator is working, create an example job and you should see jobs created by it.
-
+Deploy the operator with: 
+```sh 
+kubectl apply -k ./manifests/overlays/standalone
+```
+And now we can submit jobs to the operator.
 ```sh
-cd ./examples/tensorflow/dist-mnist
-docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
-kubectl create -f ./tf_job_mnist.yaml
+kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "my-username/training-operator:my-pr-01"}]'
+kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
+```
+You should be able to see a pod for your training operator running in your namespace using
+```
+kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple 
 ```
-
 ## Go version
 
 On ubuntu the default go package appears to be gccgo-go which has problems see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang tarballs instead.