Skip to content

Commit

Permalink
docs: updating docs for local development (kubeflow#2074)
Browse files Browse the repository at this point in the history
* adding new demo for arm64 and updating docs for local development

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* updated cpu

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* updated readme to add link to pytorch

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* removed links

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* fixed typo

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* adjusting based on feedback from Yuki

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* Removing the make run command from developer guide

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* updated dev docs with lates notes

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* Removing mnist2 example

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* added requirements to developer guide

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* adding link to Lima

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

* moved note about namesapce

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>

---------

Signed-off-by: Francisco Javier Arceo <4163062+franciscojavierarceo@users.noreply.github.com>
  • Loading branch information
franciscojavierarceo authored and johnugeorge committed Apr 28, 2024
1 parent 3beb7ed commit ce6ecf1
Showing 1 changed file with 89 additions and 23 deletions.
112 changes: 89 additions & 23 deletions docs/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,16 @@ Kubeflow Training Operator is currently at v1.
## Requirements

- [Go](https://golang.org/) (1.22 or later)
- [Docker](https://docs.docker.com/)
- [Docker](https://docs.docker.com/) (20.10 or later)
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
- [Python](https://www.python.org/) (3.11 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
- [Kind](https://kind.sigs.k8s.io/) (0.22.0 or later)
- [Lima](https://github.com/lima-vm/lima?tab=readme-ov-file#adopters) (an alternative to DockerDesktop) (0.21.0 or later)
- [Colima](https://github.com/abiosoft/colima) (Lima specifically for MacOS) (0.6.8 or later)

Note for Lima the link is to the Adopters, which supports several different container environments.

## Building the operator

Expand All @@ -23,7 +33,7 @@ Install dependencies
go mod tidy
```

Build it
Build the library

```sh
go install github.com/kubeflow/training-operator/cmd/training-operator.v1
Expand All @@ -35,47 +45,103 @@ Running the operator locally (as opposed to deploying it on a K8s cluster) is co

### Run a Kubernetes cluster

First, you need to run a Kubernetes cluster locally. There are lots of choices:

- [kind](https://kind.sigs.k8s.io)
First, you need to run a Kubernetes cluster locally. We recommend [Kind](https://kind.sigs.k8s.io).

You can create a `kind` cluster by running
```sh
kind create cluster
```
This will load your kubernetes config file with the new cluster.

### Configure KUBECONFIG and KUBEFLOW_NAMESPACE
After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.
```sh
kubectl get nodes
```
The output should look something like below:
```
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 32s v1.27.3
```
Note, that for the example job below, the PyTorchJob uses the `kubeflow` namespace.

We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
a K8s cluster. Set your environment:
From here we can apply the manifests to the cluster.
```sh
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
```

Then we can patch it with the latest operator image.
```sh
export KUBECONFIG=$(echo ~/.kube/config)
export KUBEFLOW_NAMESPACE=$(your_namespace)
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
```
Then we can run the job with the following command.

- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set.
```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
And we can see the output of the job from the logs, which may take some time to produce but should look something like below.
```
$ kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --follow
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:54Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:53Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
```

### Create the TFJob CRD
## Testing changes locally

After the cluster is up, the TFJob CRD should be created on the cluster.
Now that you confirmed you can spin up an operator locally, you can try to test your local changes to the operator.
You do this by building a new operator image and loading it into your kind cluster.

```bash
make install
### Build Operator Image
```sh
make docker-build IMG=my-username/training-operator:my-pr-01
```
You can swap `my-username/training-operator:my-pr-01` with whatever you would like.

### Run Operator
## Load docker image
```sh
kind load docker-image my-username/training-operator:my-pr-01
```

Now we are ready to run operator locally:
## Modify operator image with new one

```sh
make run
cd ./manifests/overlays/standalone
kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
```
Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.

To verify local operator is working, create an example job and you should see jobs created by it.

Deploy the operator with:
```sh
kubectl apply -k ./manifests/overlays/standalone
```
And now we can submit jobs to the operator.
```sh
cd ./examples/tensorflow/dist-mnist
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
kubectl create -f ./tf_job_mnist.yaml
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "my-username/training-operator:my-pr-01"}]'
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
You should be able to see a pod for your training operator running in your namespace using
```
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
```

## Go version

On ubuntu the default go package appears to be gccgo-go which has problems see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang tarballs instead.
Expand Down

0 comments on commit ce6ecf1

Please sign in to comment.