Katib example in docs is not working #1425

azarezade · 2021-01-31T10:55:07Z

/kind bug

What steps did you take and what happened:
I have a running Kubernetes (two nodes on-prem) cluster and installed Kubeflow using kfctl_k8s_istio config. Followed Getting Started with Katib, I have created a TensorFlow example and go through all 3 steps. This is my tfjob-example.yaml file:

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                    imagePullPolicy: Always
                    command:
                      - "python"
                      - "/var/tf_mnist/mnist_with_summaries.py"
                      - "--log_dir=/train/metrics"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"

What did you expect to happen:
I expected to see the graphs and results of the experiments in Katib but all experiments remained in the Running status, although the logs of experiments containers shows that they are Completed.

Anything else you would like to add:
Is seems the observation_logs is empty:

$ kubectl -n kubeflow exec -it katib-mysql-5df4dddc57-jzdqs -- bash

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'show tables;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------------------+
| Tables_in_katib  |
+------------------+
| observation_logs |
+------------------+

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} 
mysql> select * from observation_logs;
Empty set (0.00 sec)

But, I don't know why it happed and how to trace it. Everything other seems to be alright.
Some other logs and debugging that I tried:

$ kubectl get pods --all-namespaces | grep tfj
kubeflow               tfjob-example-9sxb2jtg-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-9sxb2jtg-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-random-585dfc8499-r9g4x                        1/1     Running     0          58m
kubeflow               tfjob-example-twd8tsdk-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-twd8tsdk-worker-1                              0/1     Completed   0          58m

$ kubectl -n kubeflow get experiments
NAME            TYPE      STATUS   AGE
tfjob-example   Running   True     60m

$ kubectl -n kubeflow get trials
NAME                     TYPE      STATUS   AGE
tfjob-example-9sxb2jtg   Running   True     60m
tfjob-example-jtf9d96w   Running   True     60m
tfjob-example-twd8tsdk   Running   True     60m

$ kubectl -n kubeflow logs tfjob-example-9sxb2jtg-worker-0 --all-containers --tail=10
Accuracy at step 910: 0.9444
Accuracy at step 920: 0.9405
Accuracy at step 930: 0.9443
Accuracy at step 940: 0.9459
Accuracy at step 950: 0.9462
Accuracy at step 960: 0.9373
Accuracy at step 970: 0.9404
Accuracy at step 980: 0.945
Accuracy at step 990: 0.9485
Adding run metadata for 999

$ kubectl -n kubeflow logs -f katib-db-manager-59445ff6cb-wkcdp --all-containers
I0125 14:10:19.491012       1 init.go:11] Initializing v1beta1 DB schema
I0125 14:10:19.776431       1 main.go:92] Start Katib manager: 0.0.0.0:6789

$ kubectl -n kubeflow logs katib-controller-545bdfdb46-k6mlr --all-containers --tail=10
{"level":"info","ts":1612085695.0365138,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-twd8tsdk"}
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:35509: remote error: tls: bad certificate
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:32642: remote error: tls: bad certificate
{"level":"info","ts":1612085695.1833804,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/tfjob-example-9sxb2jtg","kind":"TFJob","name":"tfjob-example-9sxb2jtg"}
{"level":"info","ts":1612085695.2675023,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-9sxb2jtg"}
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:7037: remote error: tls: bad certificate
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:12279: remote error: tls: bad certificate
{"level":"info","ts":1612086144.2860768,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.2967129,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.3100634,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}

$ kubectl -n kubeflow get experiment tfjob-example -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"tfjob-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"metricsCollectorSpec":{"collector":{"kind":"TensorFlowEvent"},"source":{"fileSystemPath":{"kind":"Directory","path":"/train"}}},"objective":{"goal":0.99,"objectiveMetricName":"accuracy_1","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.05","min":"0.01"},"name":"learning_rate","parameterType":"double"},{"feasibleSpace":{"max":"200","min":"100"},"name":"batch_size","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"tensorflow","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"learning_rate"},{"description":"Batch Size","name":"batchSize","reference":"batch_size"}],"trialSpec":{"apiVersion":"kubeflow.org/v1","kind":"TFJob","spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train/metrics","--learning_rate=${trialParameters.learningRate}","--batch_size=${trialParameters.batchSize}"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","imagePullPolicy":"Always","name":"tensorflow"}]}}}}}}}}}
  creationTimestamp: "2021-01-31T09:34:38Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  managedFields:
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:algorithm:
          .: {}
          f:algorithmName: {}
        f:maxFailedTrialCount: {}
        f:maxTrialCount: {}
        f:metricsCollectorSpec:
          .: {}
          f:collector:
            .: {}
            f:kind: {}
          f:source:
            .: {}
            f:fileSystemPath:
              .: {}
              f:kind: {}
              f:path: {}
        f:objective:
          .: {}
          f:goal: {}
          f:objectiveMetricName: {}
          f:type: {}
        f:parallelTrialCount: {}
        f:parameters: {}
        f:trialTemplate:
          .: {}
          f:primaryContainerName: {}
          f:trialParameters: {}
          f:trialSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:spec:
              .: {}
              f:tfReplicaSpecs:
                .: {}
                f:Worker:
                  .: {}
                  f:replicas: {}
                  f:restartPolicy: {}
                  f:template:
                    .: {}
                    f:metadata:
                      .: {}
                      f:annotations:
                        .: {}
                        f:sidecar.istio.io/inject: {}
                    f:spec:
                      .: {}
                      f:containers: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-01-31T09:34:38Z"
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers: {}
      f:status:
        .: {}
        f:conditions: {}
        f:currentOptimalTrial:
          .: {}
          f:bestTrialName: {}
          f:observation:
            .: {}
            f:metrics: {}
          f:parameterAssignments: {}
        f:runningTrialList: {}
        f:startTime: {}
        f:trials: {}
        f:trialsRunning: {}
    manager: katib-controller
    operation: Update
    time: "2021-01-31T09:34:55Z"
  name: tfjob-example
  namespace: kubeflow
  resourceVersion: "5381129"
  uid: e6aedc20-d3ed-4829-ba49-c2a957427249
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: TensorFlowEvent
    source:
      fileSystemPath:
        kind: Directory
        path: /train
  objective:
    goal: 0.99
    objectiveMetricName: accuracy_1
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.05"
      min: "0.01"
    name: learning_rate
    parameterType: double
  - feasibleSpace:
      max: "200"
      min: "100"
    name: batch_size
    parameterType: int
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: learning_rate
    - description: Batch Size
      name: batchSize
      reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                - command:
                  - python
                  - /var/tf_mnist/mnist_with_summaries.py
                  - --log_dir=/train/metrics
                  - --learning_rate=${trialParameters.learningRate}
                  - --batch_size=${trialParameters.batchSize}
                  image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                  imagePullPolicy: Always
                  name: tensorflow
status:
  conditions:
  - lastTransitionTime: "2021-01-31T09:34:38Z"
    lastUpdateTime: "2021-01-31T09:34:38Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-01-31T09:34:54Z"
    lastUpdateTime: "2021-01-31T09:34:54Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: ""
    observation:
      metrics: null
    parameterAssignments: null
  runningTrialList:
  - tfjob-example-9sxb2jtg
  - tfjob-example-jtf9d96w
  - tfjob-example-twd8tsdk
  startTime: "2021-01-31T09:34:38Z"
  trials: 3
  trialsRunning: 3

Environment:

Kubeflow version: kfctl v1.2.0-0-gbc038f9
Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

OS :Ubuntu 20.04.1 LTS

The text was updated successfully, but these errors were encountered:

Gorosia · 2021-02-03T07:04:13Z

Hello, azarezade !

Have you solved this problem?

I have the same error. 😢

azarezade · 2021-02-03T09:03:53Z

Hi @Gorosia, no success yet. Do you have Kubernetes on a on-premise cluster, or a single node machine. I suspect the issue may be related to the connection between pods, since I have a two node cluster, and my pods that run experiments are in different node that katib-controller pod runs.

gaocegege · 2021-02-03T09:14:11Z

cc @johnugeorge @andreyvelich

Gorosia · 2021-02-03T10:50:27Z

@azarezade
Thank you for the reply.
I have Kubernetes on a 'on-premise' single node machine.

josepholaide · 2021-02-06T06:42:00Z

@azarezade I am trying to run the official katib documentation example using kubeflow deployed through microk8s and I am getting this error.
I have tried "kubeflow.org/v1", it still gives the same error but when I try "kubeflow.org/v1alpha3", it creates an experiment but the experiment doesn't run, nothing shows in the katib UI and no trials generate.

error: unable to recognize "random-example.yaml": no matches for kind "Experiment" in version "kubeflow.org/v1beta1"

azarezade · 2021-02-06T06:45:19Z

@Josepholaidepetro I think you should try kubeflow.org/v1beta1. I mean, in the first line of your_example_experiment.yaml it should be apiVersion: "kubeflow.org/v1beta1". For me, it runs the experiment, but I still have the aforementioned mentioned issue.

josepholaide · 2021-02-06T06:47:30Z

@azarezade That's what I did, I still got the error.

azarezade · 2021-02-06T07:13:22Z

I think you may need to open a new issue, unless you get the the same results in debugging command like kubeflow get experiments, kubeflow get trials and so on, that I posted in my first message.

josepholaide · 2021-02-06T07:40:38Z

@azarezade The experiment is created but nothing is running

andreyvelich · 2021-02-08T15:36:50Z

Thank you for creating this @azarezade.

It seems that you are creating Experiment in kubeflow namespace and webhook doesn't work and metrics are not collected.
If you deploy Katib as part of Kubeflow you should create Experiment only in your Profile namespace.
Please check the tutorial here.

azarezade · 2021-02-09T04:42:07Z

Thanks @andreyvelich for the reply. I also tried to create experiment with the name that I set when logging in to the Kubeflow dashboard for the first time, but it returned error:

Error from server (InternalError): error when creating "tfjob-example.yaml": Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org": Post "https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

Gorosia · 2021-02-09T06:47:57Z

@azarezade
Check here 👍
I fixed error that same as your error.

azarezade · 2021-02-09T06:52:17Z

Thanks @Gorosia. So I close this issue.

k8s-ci-robot added the kind/bug label Jan 31, 2021

azarezade closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Katib example in docs is not working #1425

Katib example in docs is not working #1425

azarezade commented Jan 31, 2021 •

edited

Loading

Gorosia commented Feb 3, 2021

azarezade commented Feb 3, 2021 •

edited

Loading

gaocegege commented Feb 3, 2021

Gorosia commented Feb 3, 2021

josepholaide commented Feb 6, 2021

azarezade commented Feb 6, 2021 •

edited

Loading

josepholaide commented Feb 6, 2021

azarezade commented Feb 6, 2021 •

edited

Loading

josepholaide commented Feb 6, 2021

andreyvelich commented Feb 8, 2021

azarezade commented Feb 9, 2021

Gorosia commented Feb 9, 2021

azarezade commented Feb 9, 2021

Katib example in docs is not working #1425

Katib example in docs is not working #1425

Comments

azarezade commented Jan 31, 2021 • edited Loading

Gorosia commented Feb 3, 2021

azarezade commented Feb 3, 2021 • edited Loading

gaocegege commented Feb 3, 2021

Gorosia commented Feb 3, 2021

josepholaide commented Feb 6, 2021

azarezade commented Feb 6, 2021 • edited Loading

josepholaide commented Feb 6, 2021

azarezade commented Feb 6, 2021 • edited Loading

josepholaide commented Feb 6, 2021

andreyvelich commented Feb 8, 2021

azarezade commented Feb 9, 2021

Gorosia commented Feb 9, 2021

azarezade commented Feb 9, 2021

azarezade commented Jan 31, 2021 •

edited

Loading

azarezade commented Feb 3, 2021 •

edited

Loading

azarezade commented Feb 6, 2021 •

edited

Loading

azarezade commented Feb 6, 2021 •

edited

Loading