Skip to content

Commit

Permalink
Fix legacy docs
Browse files Browse the repository at this point in the history
  • Loading branch information
andreyvelich committed Oct 9, 2021
1 parent fdc86d1 commit 835eb4d
Show file tree
Hide file tree
Showing 4 changed files with 70 additions and 48 deletions.
35 changes: 25 additions & 10 deletions docs/monitoring/README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,106 @@
# Prometheus Monitoring for TF operator
# Prometheus Monitoring for TFJob

## Available Metrics

Currently available metrics to monitor are listed below.

### Metrics for Each Component Container for TF operator
### Metrics for Each Component Container for TFJob

Component Containers:
* tf-operator
* tf-chief
* tf-ps
* tf-worker

- tf-operator
- tf-chief
- tf-ps
- tf-worker

#### Each Container Reports on its:

Use prometheus graph to run the following example commands to visualize metrics.

*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
_Note_: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.

**CPU usage**

```
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**GPU Usage**

```
sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Memory Usage**

```
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Network Usage**

```
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**I/O Usage**

```
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Keep-Alive check**
**Keep-Alive check**

```
up
```

This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).

**Is Leader check**

```
tf_operator_is_leader
```

*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
_Note_: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.

### Report TFJob metrics:

*Note*: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
_Note_: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.

**Job Creation**

```
tf_operator_jobs_created_total
```

**Job Creation**

```
sum (rate (tf_operator_jobs_created_total[60m]))
```

**Job Deletion**

```
tf_operator_jobs_deleted_total
```

**Successful Job Completions**

```
tf_operator_jobs_successful_total
```

**Failed Jobs**

```
tf_operator_jobs_failed_total
```

**Restarted Jobs**

```
tf_operator_jobs_restarted_total
```
51 changes: 26 additions & 25 deletions docs/quick-start-v1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Testing v1

Tf-operator is currently in v1. The quick start shows an example of v1 of TF operator. For more details please refer to [developer_guide.md](../developer_guide.md).
TFJob is currently in v1. The quick start shows an example of TFJob.
For more details please refer to [developer_guide.md](../developer_guide.md).

## Create a TFJob

Expand Down Expand Up @@ -38,12 +39,12 @@ spec:
creationTimestamp: null
spec:
containers:
- image: kubeflow/tf-dist-mnist-test:1.0
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
- image: kubeflow/tf-dist-mnist-test:1.0
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
Worker:
replicas: 4
restartPolicy: Never
Expand All @@ -52,26 +53,26 @@ spec:
creationTimestamp: null
spec:
containers:
- image: kubeflow/tf-dist-mnist-test:1.0
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
- image: kubeflow/tf-dist-mnist-test:1.0
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
status:
conditions:
- lastTransitionTime: 2019-03-06T09:50:36Z
lastUpdateTime: 2019-03-06T09:50:36Z
message: TFJob dist-mnist-for-e2e-test is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: 2019-03-06T09:50:57Z
lastUpdateTime: 2019-03-06T09:50:57Z
message: TFJob dist-mnist-for-e2e-test is running.
reason: TFJobRunning
status: "True"
type: Running
- lastTransitionTime: 2019-03-06T09:50:36Z
lastUpdateTime: 2019-03-06T09:50:36Z
message: TFJob dist-mnist-for-e2e-test is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: 2019-03-06T09:50:57Z
lastUpdateTime: 2019-03-06T09:50:57Z
message: TFJob dist-mnist-for-e2e-test is running.
reason: TFJobRunning
status: "True"
type: Running
replicaStatuses:
PS:
active: 2
Expand Down
28 changes: 17 additions & 11 deletions docs/testing/e2e_testing.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# How to Write an E2E Test for TF Operator
# How to Write an E2E Test for Kubeflow Training Operator

The E2E tests for TF operator are implemented as Argo workflows. For more background and details
The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details
about Argo (not required for understanding the rest of this document), please take a look at
[this link](https://github.com/kubeflow/testing/blob/master/README.md).

Test results can be monitored at the [Prow dashboard](https://prow.k8s.io/?repo=kubeflow%2Ftraining-operator).

At a high level, the E2E test suites are structured as Python test classes. Each test class contains
one or more tests. A test typically runs the following:
* Create a ksonnet component using a TFJob spec;
* Creates the specified TFJob;
* Verifies some expected results (e.g. number of pods started, job status);
* Deletes the TFJob.

- Create a ksonnet component using a TFJob spec;
- Creates the specified TFJob;
- Verifies some expected results (e.g. number of pods started, job status);
- Deletes the TFJob.

## Adding a Test Method

Expand All @@ -23,11 +23,12 @@ starting or deleting a TFJob), and performs verifications of expected results (e
correct status, pods are deleted, etc).

Test classes should follow this pattern:

```python
class MyTest(test_util.TestCase):
def __init__(self, args):
# Initialize environment

def test_case_1(self):
# Test code

Expand All @@ -40,17 +41,18 @@ if __name__ == "__main__"

The code here ideally should only contain API calls. Any common functionalities used by the test code should
be added to one of the helper modules:
* k8s_util - for K8s operations like querying/deleting a pod
* ks_util - for ksonnet operations
* tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase

- k8s_util - for K8s operations like querying/deleting a pod
- ks_util - for ksonnet operations
- tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase

## Adding a TFJob Spec

This is needed if you want to use your own TFJob spec instead of an existing one. An example can be found
[here](https://github.com/kubeflow/training-operator/tree/master/test/workflows/components/simple_tfjob_v1.jsonnet).
All TFJob specs should be placed in the same directory.

These are similar to actual TFJob specs. Note that many of these are using the
These are similar to actual TFJob specs. Note that many of these are using the
[training-operator-test-server](https://github.com/kubeflow/training-operator/tree/master/test/test-server) as the test image.
This gives us more control over when each replica exits, and allows us to send specific requests like fetching the
runtime TensorFlow config.
Expand All @@ -64,19 +66,23 @@ New test classes should be added as Argo workflow steps to the
[workflows.libsonnet](https://github.com/kubeflow/training-operator/blob/master/test/workflows/components/workflows.libsonnet) file.

Under the templates section, add the following to the dag:

```
{
name: "my-test",
template: "my-test",
dependencies: ["setup-kubeflow"],
},
```

This will configure Argo to run `my-test` after setting up the Kubeflow cluster.

Next, add the following lines toward the end of the file:

```
$.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTestTemplate(
"my-test"),
```

This assumes that there is a corresponding Python file named `my_test.py` (note the difference between dashes and
underscores).
4 changes: 2 additions & 2 deletions scripts/setup-tf-operator.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ GO_DIR=${GOPATH}/src/github.com/${REPO_OWNER}/${REPO_NAME}
echo "Configuring kubeconfig.."
aws eks update-kubeconfig --region=${REGION} --name=${CLUSTER_NAME}

echo "Update tf operator manifest with new name $REGISTRY and tag $VERSION"
echo "Update Training Operator manifest with new name $REGISTRY and tag $VERSION"
cd manifests/overlays/standalone
kustomize edit set image public.ecr.aws/j1r0q0g6/training/training-operator=${REGISTRY}:${VERSION}

echo "Installing tf operator manifests"
echo "Installing Training Operator manifests"
kustomize build . | kubectl apply -f -

TIMEOUT=30
Expand Down

0 comments on commit 835eb4d

Please sign in to comment.