Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Katib early stopping documentation #2336

Merged
merged 7 commits into from
Nov 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions content/en/docs/components/katib/early-stopping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
+++
title = "Using Early Stopping"
description = "How to use early stopping in Katib experiments"
weight = 60

+++

This guide shows how you can use
[early stopping](https://en.wikipedia.org/wiki/Early_stopping) to improve your
Katib experiments. Early stopping allows you to avoid overfitting when you
train your model during Katib experiments. It also helps by saving computing
resources and reducing experiment execution time by stopping the experiment's
trials when the target metric(s) no longer improves before the training process
is complete.

The major advantage of using early stopping in Katib is that you don't
need to modify your
[training container package](/docs/components/katib/experiment/#packaging-your-training-code-in-a-container-image).
All you have to do is make necessary changes in your experiment's YAML file.

Early stopping works in the same way as Katib's
[metrics collector](/docs/components/katib/experiment/#metrics-collector).
It analyses required metrics from the `stdout` or from the arbitrary output file
and an early stopping algorithm makes the decision if the trial needs to be
stopped. Currently, early stopping works only with
`StdOut` or `File` metrics collectors.

**Note**: Your training container must print training logs with the timestamp,
because early stopping algorithms need to know the sequence of reported metrics.
Check the
[`MXNet` example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/mxnet-mnist/mnist.py#L36)
to learn how to add a date format to your logs.

## Configure the experiment with early stopping

As a reference, you can use the YAML file of the
[early stopping example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/early-stopping/median-stop.yaml).

1. Follow the
[guide](/docs/components/katib/experiment/#configuring-the-experiment)
to configure your Katib experiment.

2. Next, to apply early stopping for your experiment, specify the `.spec.earlyStopping`
parameter, similar to the `.spec.algorithm`. Refer to the
[`EarlyStoppingSpec` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L41-L58)
for more information.

- `.earlyStopping.algorithmName` - the name of the early stopping algorithm.

- `.earlyStopping.algorithmSettings`- the settings for the early stopping algorithm.

What happens is your experiment's suggestion produces new trials. After that,
the early stopping algorithm generates early stopping rules for the created
trials. Once the trial reaches all the rules, it is stopped and the trial status
is changed to the `EarlyStopped`. Then, Katib calls the suggestion again to
ask for the new trials.

Learn more about Katib concepts
in the [overview guide](/docs/components/katib/overview/#katib-concepts).

Follow the
[Katib configuration guide](/docs/components/katib/katib-config/#early-stopping-settings)
to specify your own image for the early stopping algorithm.

### Early stopping algorithms in detail

Here’s a list of the early stopping algorithms available in Katib:

- [Median Stopping Rule](#median-stopping-rule)

More algorithms are under development.

You can add an early stopping algorithm to Katib yourself. Check the
[developer guide](https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md)
to contribute.

<a id="median-stopping-rule"></a>

### Median Stopping Rule

The early stopping algorithm name in Katib is `medianstop`.

The median stopping rule stops a pending trial `X` at step `S` if the trial's
best objective value by step `S` is worse than the median value of the running
averages of all completed trials' objectives reported up to step `S`.

To learn more about it, check
[Google Vizier: A Service for Black-Box Optimization](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf).

Katib supports the following early stopping settings:

<div class="table-responsive">
<table class="table table-bordered">
<thead class="thead-light">
<tr>
<th>Setting Name</th>
<th>Description</th>
<th>Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>min_trials_required</td>
<td>Minimal number of successful trials to compute median value</td>
<td>3</td>
</tr>
<tr>
<td>start_step</td>
<td>Number of reported intermediate results before stopping the trial</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>

### Submit an early stopping experiment from the UI

You can use Katib UI to submit an early stopping experiment. Follow
[these steps](/docs/components/katib/experiment/#running-the-experiment-from-the-katib-ui)
to create an experiment from the UI.

Once you reach the early stopping section, select the appropriate values:

<img src="/docs/images/katib/katib-early-stopping-parameter.png"
alt="UI form to deploy an early stopping Katib experiment"
class="mt-3 mb-3 border border-info rounded">

## View the early stopping experiment results

First, make sure you have [jq](https://stedolan.github.io/jq/download/)
installed.

Check the early stopped trials in your experiment:

```shell
kubectl get experiment <experiment-name> -n <experiment-namespace> -o json | jq -r ".status"
```

The last part of the above command output looks similar to this:

```yaml
. . .
"earlyStoppedTrialList": [
"median-stop-2ml8h96d",
"median-stop-cgjkq8zn",
"median-stop-pvn5p54p",
"median-stop-sjc9tcgc"
],
"startTime": "2020-11-05T03:03:43Z",
"succeededTrialList": [
"median-stop-2kmh57qf",
"median-stop-7ccstz4z",
"median-stop-7sqt7556",
"median-stop-lgvhfch2",
"median-stop-mkfjtwbj",
"median-stop-nfmgqd7w",
"median-stop-nsbxw5m9",
"median-stop-nsmhg4p2",
"median-stop-rp88xflk",
"median-stop-xl7dlf5n",
"median-stop-ztc58kwq"
],
"trials": 15,
"trialsEarlyStopped": 4,
"trialsSucceeded": 11
}
```

Check the status of the early stopped trial by running this command:

```shell
kubectl get trial median-stop-2ml8h96d -n <experiment-namespace>
```

and you should be able to view `EarlyStopped` status for the trial:

```shell
NAME TYPE STATUS AGE
median-stop-2ml8h96d EarlyStopped True 15m
```

In addition, you can check your results on the Katib UI.
The trial statuses on the experiment monitor page should look as follows:

<img src="/docs/images/katib/katib-early-stopping-trials.png"
alt="UI form to view trials"
class="mt-3 mb-3 border border-info rounded">

You can click on the early stopped trial name to get reported metrics before
this trial is early stopped:

<img src="/docs/images/katib/katib-early-stopping-trial-info.png"
alt="UI form to view trial info"
class="mt-3 mb-3 border border-info rounded">

## Next steps

- Learn how to
[configure and run your Katib experiments](/docs/components/katib/experiment/).

- How to
[restart your experiment and use the resume policies](/docs/components/katib/resume-experiment/).

- Check the
[Katib Configuration (Katib config)](/docs/components/katib/katib-config/).

- How to [set up environment variables](/docs/components/katib/env-variables/)
for each Katib component.
2 changes: 1 addition & 1 deletion content/en/docs/components/katib/env-variables.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
+++
title = "Environment Variables for Katib Components"
description = "How to set up environment variables for each Katib component"
weight = 60
weight = 80

+++

Expand Down
13 changes: 8 additions & 5 deletions content/en/docs/components/katib/experiment.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
+++
title = "Running an experiment"
title = "Running an Experiment"
description = "How to configure and run a hyperparameter tuning or neural architecture search experiment in Katib"
weight = 30

Expand Down Expand Up @@ -177,8 +177,7 @@ Katib currently supports several search algorithms.
Refer to the
[`AlgorithmSpec` type](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L22-L39).

Here's a list of the search algorithms available in Katib. The links lead to
descriptions on this page:
Here's a list of the search algorithms available in Katib:

- [Grid search](#grid-search)
- [Random search](#random-search)
Expand All @@ -189,8 +188,9 @@ descriptions on this page:
- [Neural Architecture Search based on ENAS](#enas)
- [Differentiable Architecture Search (DARTS)](#darts)

More algorithms are under development. You can add an algorithm to Katib
yourself. Check the guide to
More algorithms are under development.

You can add an algorithm to Katib yourself. Check the guide to
[adding a new algorithm](https://github.com/kubeflow/katib/blob/master/docs/new-algorithm-service.md)
and the
[developer guide](https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md).
Expand Down Expand Up @@ -815,6 +815,9 @@ View the results of the experiment in the Katib UI:
neural architecture search, check the
[introduction to Katib](/docs/components/katib/overview/).

- Boost your hyperparameter tuning experiment with
the [early stopping guide](/docs/components/katib/early-stopping/)

- Check the
[Katib Configuration (Katib config)](/docs/components/katib/katib-config/).

Expand Down
2 changes: 1 addition & 1 deletion content/en/docs/components/katib/hyperparameter.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
+++
title = "Getting started with Katib"
title = "Getting Started with Katib"
description = "How to set up Katib and perform hyperparameter tuning"
weight = 20

Expand Down
97 changes: 91 additions & 6 deletions content/en/docs/components/katib/katib-config.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
+++
title = "Katib Configuration Overview"
description = "How to make changes in Katib configuration"
weight = 50
weight = 70

+++

Expand All @@ -10,8 +10,17 @@ This guide describes
the Kubernetes
[Config Map](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/) that contains information about:

1. Current [metrics collectors](/docs/components/katib/experiment/#metrics-collector) (`key = metrics-collector-sidecar`).
1. Current [algorithms](/docs/components/katib/experiment/#search-algorithms-in-detail) (suggestions) (`key = suggestion`).
1. Current
[metrics collectors](/docs/components/katib/experiment/#metrics-collector)
(`key = metrics-collector-sidecar`).

1. Current
[algorithms](/docs/components/katib/experiment/#search-algorithms-in-detail)
(suggestions) (`key = suggestion`).

1. Current
[early stopping algorithms](/docs/components/katib/early-stopping/#early-stopping-algorithms-in-detail)
(`key = early-stopping`).

The Katib Config Map must be deployed in the
[`KATIB_CORE_NAMESPACE`](/docs/components/katib/env-variables/#katib-controller)
Expand Down Expand Up @@ -119,16 +128,16 @@ suggestion: |-
}
```

All of these settings except **`image`** can be omitted. If you don't specify any other settings,
a default value is set automatically.
All of these settings except **`image`** can be omitted. If you don't specify
any other settings, a default value is set automatically.

1. `image` - a Docker image for the suggestion's container with a `random`
algorithm (**must be specified**).

Image example: `docker.io/kubeflowkatib/<suggestion-name>`

For each algorithm (suggestion) you can specify one of the following
suggestion names in Docker image:
suggestion names in the Docker image:

<div class="table-responsive">
<table class="table table-bordered">
Expand Down Expand Up @@ -216,3 +225,79 @@ a default value is set automatically.
in which case, the pod uses the
[default](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#use-the-default-service-account-to-access-the-api-server)
service account.

**Note:** If you want to run your experiments with
[early stopping](/docs/components/katib/early-stopping/),
the suggestion's deployment must have permission to update the experiment's
trial status. If you don't specify a service account in the Katib config,
Katib controller creates required
[Kubernetes Role-based access control](https://kubernetes.io/docs/reference/access-authn-authz/rbac)
for the suggestion.

If you need your own service account for the experiment's
suggestion with early stopping, you have to follow the rules:

- The service account name can't be equal to
`<experiment-name>-<experiment-algorithm>`

- The service account must have sufficient permissions to update
the experiment's trial status.

## Early stopping settings

These settings are related to Katib early stopping, where:

- key: `early-stopping`
- value: corresponding JSON settings for each early stopping algorithm name

If you want to use a new early stopping algorithm, you need to update the
Katib config. For example, using a `medianstop` early stopping algorithm with
all settings looks as follows:

```json
early-stopping: |-
{
"medianstop": {
"image": "docker.io/kubeflowkatib/earlystopping-medianstop",
"imagePullPolicy": "Always"
},
...
}
```

All of these settings except **`image`** can be omitted. If you don't specify
any other settings, a default value is set automatically.

1. `image` - a Docker image for the early stopping's container with a
`medianstop` algorithm (**must be specified**).

Image example: `docker.io/kubeflowkatib/<early-stopping-name>`

For each early stopping algorithm you can specify one of the following
early stopping names in the Docker image:

<div class="table-responsive">
<table class="table table-bordered">
<thead class="thead-light">
<tr>
<th>Early stopping name</th>
<th>Early stopping algorithm</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>earlystopping-medianstop</code></td>
<td><code>medianstop</code></td>
<td><a href="https://github.com/kubeflow/katib/tree/master/pkg/earlystopping/v1beta1/medianstop">Katib
Median Stopping</a> implementation</td>
</tr>
</tbody>
</table>
</div>

1. `imagePullPolicy` - an
[image pull policy](https://kubernetes.io/docs/concepts/configuration/overview/#container-images)
for the early stopping's container with a `medianstop` algorithm.

The default value is `IfNotPresent`
Loading