Fix legacy docs

kubeflow · Oct 9, 2021 · 835eb4d · 835eb4d
1 parent fdc86d1
commit 835eb4d
Show file tree

Hide file tree

Showing 4 changed files with 70 additions and 48 deletions.
diff --git a/docs/monitoring/README.md b/docs/monitoring/README.md
@@ -1,91 +1,106 @@
-# Prometheus Monitoring for TF operator
+# Prometheus Monitoring for TFJob
 
 ## Available Metrics
 
 Currently available metrics to monitor are listed below.
 
-### Metrics for Each Component Container for TF operator
+### Metrics for Each Component Container for TFJob
 
 Component Containers:
-* tf-operator
-* tf-chief
-* tf-ps
-* tf-worker
+
+- tf-operator
+- tf-chief
+- tf-ps
+- tf-worker
 
 #### Each Container Reports on its:
 
 Use prometheus graph to run the following example commands to visualize metrics.
 
-*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
+_Note_: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
 
 **CPU usage**
+
 ```
 sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
 ```
 
 **GPU Usage**
+
 ```
 sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
 ```
 
 **Memory Usage**
+
 ```
 sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
 ```
 
 **Network Usage**
+
 ```
 sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
 ```
 
 **I/O Usage**
+
 ```
 sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
 ```
 
-**Keep-Alive check**  
+**Keep-Alive check**
+
 ```
 up
 ```
+
 This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).
 
 **Is Leader check**
+
 ```
 tf_operator_is_leader
 ```
 
-*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
+_Note_: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
 
 ### Report TFJob metrics:
 
-*Note*: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
+_Note_: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
 
 **Job Creation**
+
 ```
 tf_operator_jobs_created_total
 ```
 
 **Job Creation**
+
 ```
 sum (rate (tf_operator_jobs_created_total[60m]))
 ```
 
 **Job Deletion**
+
 ```
 tf_operator_jobs_deleted_total
 ```
 
 **Successful Job Completions**
+
 ```
 tf_operator_jobs_successful_total
 ```
 
 **Failed Jobs**
+
 ```
 tf_operator_jobs_failed_total
 ```
 
 **Restarted Jobs**
+
 ```
 tf_operator_jobs_restarted_total
 ```
diff --git a/docs/quick-start-v1.md b/docs/quick-start-v1.md
@@ -1,6 +1,7 @@
 # Testing v1
 
-Tf-operator is currently in v1. The quick start shows an example of v1 of TF operator. For more details please refer to [developer_guide.md](../developer_guide.md).
+TFJob is currently in v1. The quick start shows an example of TFJob.
+For more details please refer to [developer_guide.md](../developer_guide.md).
 
 ## Create a TFJob
 
@@ -38,12 +39,12 @@ spec:
           creationTimestamp: null
         spec:
           containers:
-          - image: kubeflow/tf-dist-mnist-test:1.0
-            name: tensorflow
-            ports:
-            - containerPort: 2222
-              name: tfjob-port
-            resources: {}
+            - image: kubeflow/tf-dist-mnist-test:1.0
+              name: tensorflow
+              ports:
+                - containerPort: 2222
+                  name: tfjob-port
+              resources: {}
     Worker:
       replicas: 4
       restartPolicy: Never
@@ -52,26 +53,26 @@ spec:
           creationTimestamp: null
         spec:
           containers:
-          - image: kubeflow/tf-dist-mnist-test:1.0
-            name: tensorflow
-            ports:
-            - containerPort: 2222
-              name: tfjob-port
-            resources: {}
+            - image: kubeflow/tf-dist-mnist-test:1.0
+              name: tensorflow
+              ports:
+                - containerPort: 2222
+                  name: tfjob-port
+              resources: {}
 status:
   conditions:
-  - lastTransitionTime: 2019-03-06T09:50:36Z
-    lastUpdateTime: 2019-03-06T09:50:36Z
-    message: TFJob dist-mnist-for-e2e-test is created.
-    reason: TFJobCreated
-    status: "True"
-    type: Created
-  - lastTransitionTime: 2019-03-06T09:50:57Z
-    lastUpdateTime: 2019-03-06T09:50:57Z
-    message: TFJob dist-mnist-for-e2e-test is running.
-    reason: TFJobRunning
-    status: "True"
-    type: Running
+    - lastTransitionTime: 2019-03-06T09:50:36Z
+      lastUpdateTime: 2019-03-06T09:50:36Z
+      message: TFJob dist-mnist-for-e2e-test is created.
+      reason: TFJobCreated
+      status: "True"
+      type: Created
+    - lastTransitionTime: 2019-03-06T09:50:57Z
+      lastUpdateTime: 2019-03-06T09:50:57Z
+      message: TFJob dist-mnist-for-e2e-test is running.
+      reason: TFJobRunning
+      status: "True"
+      type: Running
   replicaStatuses:
     PS:
       active: 2

diff --git a/docs/testing/e2e_testing.md b/docs/testing/e2e_testing.md
@@ -1,18 +1,18 @@
-# How to Write an E2E Test for TF Operator
+# How to Write an E2E Test for Kubeflow Training Operator
 
-The E2E tests for TF operator are implemented as Argo workflows. For more background and details
+The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details
 about Argo (not required for understanding the rest of this document), please take a look at
 [this link](https://github.com/kubeflow/testing/blob/master/README.md).
 
 Test results can be monitored at the [Prow dashboard](https://prow.k8s.io/?repo=kubeflow%2Ftraining-operator).
 
 At a high level, the E2E test suites are structured as Python test classes. Each test class contains
 one or more tests. A test typically runs the following:
-* Create a ksonnet component using a TFJob spec;
-* Creates the specified TFJob;
-* Verifies some expected results (e.g. number of pods started, job status);
-* Deletes the TFJob.
 
+- Create a ksonnet component using a TFJob spec;
+- Creates the specified TFJob;
+- Verifies some expected results (e.g. number of pods started, job status);
+- Deletes the TFJob.
 
 ## Adding a Test Method
 
@@ -23,11 +23,12 @@ starting or deleting a TFJob), and performs verifications of expected results (e
 correct status, pods are deleted, etc).
 
 Test classes should follow this pattern:
+
 ```python
 class MyTest(test_util.TestCase):
   def __init__(self, args):
     # Initialize environment
- 
+
   def test_case_1(self):
     # Test code
 
@@ -40,17 +41,18 @@ if __name__ == "__main__"
 
 The code here ideally should only contain API calls. Any common functionalities used by the test code should
 be added to one of the helper modules:
-* k8s_util - for K8s operations like querying/deleting a pod
-* ks_util - for ksonnet operations
-* tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase
+
+- k8s_util - for K8s operations like querying/deleting a pod
+- ks_util - for ksonnet operations
+- tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase
 
 ## Adding a TFJob Spec
 
 This is needed if you want to use your own TFJob spec instead of an existing one. An example can be found
 [here](https://github.com/kubeflow/training-operator/tree/master/test/workflows/components/simple_tfjob_v1.jsonnet).
 All TFJob specs should be placed in the same directory.
 
-These are similar to actual TFJob specs. Note that many of these are using the 
+These are similar to actual TFJob specs. Note that many of these are using the
 [training-operator-test-server](https://github.com/kubeflow/training-operator/tree/master/test/test-server) as the test image.
 This gives us more control over when each replica exits, and allows us to send specific requests like fetching the
 runtime TensorFlow config.
@@ -64,19 +66,23 @@ New test classes should be added as Argo workflow steps to the
 [workflows.libsonnet](https://github.com/kubeflow/training-operator/blob/master/test/workflows/components/workflows.libsonnet) file.
 
 Under the templates section, add the following to the dag:
+
 ```
   {
     name: "my-test",
     template: "my-test",
     dependencies: ["setup-kubeflow"],
   },
 ```
+
 This will configure Argo to run `my-test` after setting up the Kubeflow cluster.
 
 Next, add the following lines toward the end of the file:
+
 ```
   $.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTestTemplate(
          "my-test"),
 ```
+
 This assumes that there is a corresponding Python file named `my_test.py` (note the difference between dashes and
 underscores).
diff --git a/scripts/setup-tf-operator.sh b/scripts/setup-tf-operator.sh
@@ -30,11 +30,11 @@ GO_DIR=${GOPATH}/src/github.com/${REPO_OWNER}/${REPO_NAME}
 echo "Configuring kubeconfig.."
 aws eks update-kubeconfig --region=${REGION} --name=${CLUSTER_NAME}
 
-echo "Update tf operator manifest with new name $REGISTRY and tag $VERSION"
+echo "Update Training Operator manifest with new name $REGISTRY and tag $VERSION"
 cd manifests/overlays/standalone
 kustomize edit set image public.ecr.aws/j1r0q0g6/training/training-operator=${REGISTRY}:${VERSION}
 
-echo "Installing tf operator manifests"
+echo "Installing Training Operator manifests"
 kustomize build . | kubectl apply -f -
 
 TIMEOUT=30