Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: updating docs for local development #2074

Merged
merged 14 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 80 additions & 23 deletions docs/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Install dependencies
go mod tidy
franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved
```

Build it
Build the library
franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved

```sh
go install github.com/kubeflow/training-operator/cmd/training-operator.v1
Expand All @@ -35,47 +35,104 @@ Running the operator locally (as opposed to deploying it on a K8s cluster) is co

### Run a Kubernetes cluster

First, you need to run a Kubernetes cluster locally. There are lots of choices:

- [kind](https://kind.sigs.k8s.io)
First, you need to run a Kubernetes cluster locally. We recommend [Kind](https://kind.sigs.k8s.io).

You can create a `kind` cluster by running
```sh
kind create cluster
```
This will load your kubernetes config file with the new cluster.

### Configure KUBECONFIG and KUBEFLOW_NAMESPACE
After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.
```sh
kubectl get nodes
```
The output should look something like below:
```
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 32s v1.27.3
```

We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
a K8s cluster. Set your environment:
From here we can apply the manifests to the cluster.
```sh
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
```

Then we can patch it with the latest operator image.
```sh
export KUBECONFIG=$(echo ~/.kube/config)
export KUBEFLOW_NAMESPACE=$(your_namespace)
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
```
Then we can run the job with the following command.

- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set.
```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
And we can see the output of the job from the logs, which may take some time to produce but should look something like below.
```
$ kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --follow
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:54Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:53Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
```

## Testing changes locally

### Create the TFJob CRD
Now that you confirmed you can spin up an operator locally, you can try to test your local changes to the operator.
You do this by building a new operator image and loading it into your kind cluster.

After the cluster is up, the TFJob CRD should be created on the cluster.
Note, that for the example job below, the PyTorchJob uses the `kubeflow` namespace.
franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved

```bash
make install
### Build Operator Image
```sh
make docker-build IMG=my-username/training-operator:my-pr-01
```
You can swap `my-username/training-operator:my-pr-01` with whatever you would like.

### Run Operator
## Load docker image
```sh
kind load docker-image my-username/training-operator:my-pr-01
```

Now we are ready to run operator locally:
## Modify operator image with new one

```sh
make run
cd ./manifests/overlays/standalone
kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
```
Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.
franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved

To verify local operator is working, create an example job and you should see jobs created by it.

Deploy the operator with:
```sh
kubectl apply -k ./manifests/overlays/standalone
```
And now we can submit jobs to the operator.
```sh
cd ./examples/tensorflow/dist-mnist
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
kubectl create -f ./tf_job_mnist.yaml
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "my-username/training-operator:my-pr-01"}]'
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
```
You should be able to see a pod for your training operator running in your namespace using
```
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
```

## Go version

On ubuntu the default go package appears to be gccgo-go which has problems see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang tarballs instead.
Expand Down
17 changes: 17 additions & 0 deletions examples/pytorch/mnist2/Dockerfile.cpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that this new example is needed?

In my local arm64 env, this PyTorch example worked fine since the example uses the same image:

image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/simple.yaml

Did you use the master branch?

franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM python:3.11-slim

ADD . /opt/pytorch-mnist

WORKDIR /opt/pytorch-mnist

# Add folder for the logs.
RUN mkdir /katib
RUN pip install --prefer-binary --no-cache-dir torch==2.2.1 torchvision==0.17.1
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt

RUN chgrp -R 0 /opt/pytorch-mnist \
&& chmod -R g+rwX /opt/pytorch-mnist \
&& chgrp -R 0 /katib \
&& chmod -R g+rwX /katib

ENTRYPOINT ["python3", "/opt/pytorch-mnist/mnist.py"]
5 changes: 5 additions & 0 deletions examples/pytorch/mnist2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# PyTorch MNIST Image Classification Example
franciscojavierarceo marked this conversation as resolved.
Show resolved Hide resolved

This is the [PyTorch MNIST](https://github.com/pytorch/examples/blob/main/mnist/main.py) image classification training container with saving metrics
to the file or printing to the StdOut. It uses convolutional neural network to
train the model.
205 changes: 205 additions & 0 deletions examples/pytorch/mnist2/mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Copyright 2022 The Kubeflow Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be PyTorch authors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this over from the demos linked so I'm just taking what they listed. I can change of course.

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import print_function

import argparse
import logging
import os

import hypertune
from torchvision import datasets, transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))


class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4*4*50, 500)
self.fc2 = nn.Linear(500, 10)

def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4*4*50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)


def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item())
logging.info(msg)
niter = epoch * len(train_loader) + batch_idx


def test(args, model, device, test_loader, epoch, hpt):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction="sum").item() # sum up batch loss
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)
test_accuracy = float(correct) / len(test_loader.dataset)
logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format(
test_accuracy, test_loss))

if args.logger == "hypertune":
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='loss',
metric_value=test_loss,
global_step=epoch)
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='accuracy',
metric_value=test_accuracy,
global_step=epoch)


def should_distribute():
return dist.is_available() and WORLD_SIZE > 1


def is_distributed():
return dist.is_available() and dist.is_initialized()


def main():
# Training settings
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument("--batch-size", type=int, default=64, metavar="N",
help="input batch size for training (default: 64)")
parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N",
help="input batch size for testing (default: 1000)")
parser.add_argument("--epochs", type=int, default=10, metavar="N",
help="number of epochs to train (default: 10)")
parser.add_argument("--lr", type=float, default=0.01, metavar="LR",
help="learning rate (default: 0.01)")
parser.add_argument("--momentum", type=float, default=0.5, metavar="M",
help="SGD momentum (default: 0.5)")
parser.add_argument("--no-cuda", action="store_true", default=False,
help="disables CUDA training")
parser.add_argument("--seed", type=int, default=1, metavar="S",
help="random seed (default: 1)")
parser.add_argument("--log-interval", type=int, default=10, metavar="N",
help="how many batches to wait before logging training status")
parser.add_argument("--log-path", type=str, default="",
help="Path to save logs. Print to StdOut if log-path is not set")
parser.add_argument("--save-model", action="store_true", default=False,
help="For Saving the current Model")
parser.add_argument("--logger", type=str, choices=["standard", "hypertune"],
help="Logger", default="standard")

if dist.is_available():
parser.add_argument("--backend", type=str, help="Distributed backend",
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.GLOO)
args = parser.parse_args()

# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics.
# If log_path is empty print log to StdOut, otherwise print log to the file.
if args.log_path == "" or args.logger == "hypertune":
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.DEBUG)
else:
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.DEBUG,
filename=args.log_path)

if args.logger == "hypertune" and args.log_path != "":
os.environ['CLOUD_ML_HP_METRIC_FILE'] = args.log_path

# For JSON logging
hpt = hypertune.HyperTune()

use_cuda = not args.no_cuda and torch.cuda.is_available()
if use_cuda:
print("Using CUDA")

torch.manual_seed(args.seed)

device = torch.device("cuda" if use_cuda else "cpu")

if should_distribute():
print("Using distributed PyTorch with {} backend".format(args.backend))
dist.init_process_group(backend=args.backend)

kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}

train_loader = torch.utils.data.DataLoader(
datasets.FashionMNIST("./data",
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
datasets.FashionMNIST("./data",
train=False,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.test_batch_size, shuffle=False, **kwargs)

model = Net().to(device)

if is_distributed():
Distributor = nn.parallel.DistributedDataParallel if use_cuda \
else nn.parallel.DistributedDataParallelCPU
model = Distributor(model)

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

for epoch in range(1, args.epochs + 1):
train(args, model, device, train_loader, optimizer, epoch)
test(args, model, device, test_loader, epoch, hpt)

if (args.save_model):
torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions examples/pytorch/mnist2/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
cloudml-hypertune==0.1.0.dev6
Pillow>=9.1.1
Loading
Loading