Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib example in docs is not working #1425

Closed
azarezade opened this issue Jan 31, 2021 · 13 comments
Closed

Katib example in docs is not working #1425

azarezade opened this issue Jan 31, 2021 · 13 comments
Labels

Comments

@azarezade
Copy link

azarezade commented Jan 31, 2021

/kind bug

What steps did you take and what happened:
I have a running Kubernetes (two nodes on-prem) cluster and installed Kubeflow using kfctl_k8s_istio config. Followed Getting Started with Katib, I have created a TensorFlow example and go through all 3 steps. This is my tfjob-example.yaml file:

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                    imagePullPolicy: Always
                    command:
                      - "python"
                      - "/var/tf_mnist/mnist_with_summaries.py"
                      - "--log_dir=/train/metrics"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"

What did you expect to happen:
I expected to see the graphs and results of the experiments in Katib but all experiments remained in the Running status, although the logs of experiments containers shows that they are Completed.

Anything else you would like to add:
Is seems the observation_logs is empty:

$ kubectl -n kubeflow exec -it katib-mysql-5df4dddc57-jzdqs -- bash

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'show tables;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------------------+
| Tables_in_katib  |
+------------------+
| observation_logs |
+------------------+

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} 
mysql> select * from observation_logs;
Empty set (0.00 sec) 

But, I don't know why it happed and how to trace it. Everything other seems to be alright.
Some other logs and debugging that I tried:

$ kubectl get pods --all-namespaces | grep tfj
kubeflow               tfjob-example-9sxb2jtg-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-9sxb2jtg-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-random-585dfc8499-r9g4x                        1/1     Running     0          58m
kubeflow               tfjob-example-twd8tsdk-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-twd8tsdk-worker-1                              0/1     Completed   0          58m
$ kubectl -n kubeflow get experiments
NAME            TYPE      STATUS   AGE
tfjob-example   Running   True     60m
$ kubectl -n kubeflow get trials
NAME                     TYPE      STATUS   AGE
tfjob-example-9sxb2jtg   Running   True     60m
tfjob-example-jtf9d96w   Running   True     60m
tfjob-example-twd8tsdk   Running   True     60m
$ kubectl -n kubeflow logs tfjob-example-9sxb2jtg-worker-0 --all-containers --tail=10
Accuracy at step 910: 0.9444
Accuracy at step 920: 0.9405
Accuracy at step 930: 0.9443
Accuracy at step 940: 0.9459
Accuracy at step 950: 0.9462
Accuracy at step 960: 0.9373
Accuracy at step 970: 0.9404
Accuracy at step 980: 0.945
Accuracy at step 990: 0.9485
Adding run metadata for 999
$ kubectl -n kubeflow logs -f katib-db-manager-59445ff6cb-wkcdp --all-containers
I0125 14:10:19.491012       1 init.go:11] Initializing v1beta1 DB schema
I0125 14:10:19.776431       1 main.go:92] Start Katib manager: 0.0.0.0:6789
$ kubectl -n kubeflow logs katib-controller-545bdfdb46-k6mlr --all-containers --tail=10
{"level":"info","ts":1612085695.0365138,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-twd8tsdk"}
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:35509: remote error: tls: bad certificate
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:32642: remote error: tls: bad certificate
{"level":"info","ts":1612085695.1833804,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/tfjob-example-9sxb2jtg","kind":"TFJob","name":"tfjob-example-9sxb2jtg"}
{"level":"info","ts":1612085695.2675023,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-9sxb2jtg"}
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:7037: remote error: tls: bad certificate
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:12279: remote error: tls: bad certificate
{"level":"info","ts":1612086144.2860768,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.2967129,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.3100634,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
$ kubectl -n kubeflow get experiment tfjob-example -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"tfjob-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"metricsCollectorSpec":{"collector":{"kind":"TensorFlowEvent"},"source":{"fileSystemPath":{"kind":"Directory","path":"/train"}}},"objective":{"goal":0.99,"objectiveMetricName":"accuracy_1","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.05","min":"0.01"},"name":"learning_rate","parameterType":"double"},{"feasibleSpace":{"max":"200","min":"100"},"name":"batch_size","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"tensorflow","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"learning_rate"},{"description":"Batch Size","name":"batchSize","reference":"batch_size"}],"trialSpec":{"apiVersion":"kubeflow.org/v1","kind":"TFJob","spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train/metrics","--learning_rate=${trialParameters.learningRate}","--batch_size=${trialParameters.batchSize}"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","imagePullPolicy":"Always","name":"tensorflow"}]}}}}}}}}}
  creationTimestamp: "2021-01-31T09:34:38Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  managedFields:
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:algorithm:
          .: {}
          f:algorithmName: {}
        f:maxFailedTrialCount: {}
        f:maxTrialCount: {}
        f:metricsCollectorSpec:
          .: {}
          f:collector:
            .: {}
            f:kind: {}
          f:source:
            .: {}
            f:fileSystemPath:
              .: {}
              f:kind: {}
              f:path: {}
        f:objective:
          .: {}
          f:goal: {}
          f:objectiveMetricName: {}
          f:type: {}
        f:parallelTrialCount: {}
        f:parameters: {}
        f:trialTemplate:
          .: {}
          f:primaryContainerName: {}
          f:trialParameters: {}
          f:trialSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:spec:
              .: {}
              f:tfReplicaSpecs:
                .: {}
                f:Worker:
                  .: {}
                  f:replicas: {}
                  f:restartPolicy: {}
                  f:template:
                    .: {}
                    f:metadata:
                      .: {}
                      f:annotations:
                        .: {}
                        f:sidecar.istio.io/inject: {}
                    f:spec:
                      .: {}
                      f:containers: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-01-31T09:34:38Z"
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers: {}
      f:status:
        .: {}
        f:conditions: {}
        f:currentOptimalTrial:
          .: {}
          f:bestTrialName: {}
          f:observation:
            .: {}
            f:metrics: {}
          f:parameterAssignments: {}
        f:runningTrialList: {}
        f:startTime: {}
        f:trials: {}
        f:trialsRunning: {}
    manager: katib-controller
    operation: Update
    time: "2021-01-31T09:34:55Z"
  name: tfjob-example
  namespace: kubeflow
  resourceVersion: "5381129"
  uid: e6aedc20-d3ed-4829-ba49-c2a957427249
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: TensorFlowEvent
    source:
      fileSystemPath:
        kind: Directory
        path: /train
  objective:
    goal: 0.99
    objectiveMetricName: accuracy_1
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.05"
      min: "0.01"
    name: learning_rate
    parameterType: double
  - feasibleSpace:
      max: "200"
      min: "100"
    name: batch_size
    parameterType: int
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: learning_rate
    - description: Batch Size
      name: batchSize
      reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                - command:
                  - python
                  - /var/tf_mnist/mnist_with_summaries.py
                  - --log_dir=/train/metrics
                  - --learning_rate=${trialParameters.learningRate}
                  - --batch_size=${trialParameters.batchSize}
                  image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                  imagePullPolicy: Always
                  name: tensorflow
status:
  conditions:
  - lastTransitionTime: "2021-01-31T09:34:38Z"
    lastUpdateTime: "2021-01-31T09:34:38Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-01-31T09:34:54Z"
    lastUpdateTime: "2021-01-31T09:34:54Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: ""
    observation:
      metrics: null
    parameterAssignments: null
  runningTrialList:
  - tfjob-example-9sxb2jtg
  - tfjob-example-jtf9d96w
  - tfjob-example-twd8tsdk
  startTime: "2021-01-31T09:34:38Z"
  trials: 3
  trialsRunning: 3

Environment:

  • Kubeflow version: kfctl v1.2.0-0-gbc038f9
  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
  • OS :Ubuntu 20.04.1 LTS
@Gorosia
Copy link

Gorosia commented Feb 3, 2021

Hello, azarezade !

Have you solved this problem?

I have the same error. 😢

@azarezade
Copy link
Author

azarezade commented Feb 3, 2021

Hi @Gorosia, no success yet. Do you have Kubernetes on a on-premise cluster, or a single node machine. I suspect the issue may be related to the connection between pods, since I have a two node cluster, and my pods that run experiments are in different node that katib-controller pod runs.

@gaocegege
Copy link
Member

cc @johnugeorge @andreyvelich

@Gorosia
Copy link

Gorosia commented Feb 3, 2021

@azarezade
Thank you for the reply.
I have Kubernetes on a 'on-premise' single node machine.

@josepholaide
Copy link

@azarezade I am trying to run the official katib documentation example using kubeflow deployed through microk8s and I am getting this error.
I have tried "kubeflow.org/v1", it still gives the same error but when I try "kubeflow.org/v1alpha3", it creates an experiment but the experiment doesn't run, nothing shows in the katib UI and no trials generate.

error: unable to recognize "random-example.yaml": no matches for kind "Experiment" in version "kubeflow.org/v1beta1"

@azarezade
Copy link
Author

azarezade commented Feb 6, 2021

@Josepholaidepetro I think you should try kubeflow.org/v1beta1. I mean, in the first line of your_example_experiment.yaml it should be apiVersion: "kubeflow.org/v1beta1". For me, it runs the experiment, but I still have the aforementioned mentioned issue.

@josepholaide
Copy link

@azarezade That's what I did, I still got the error.

@azarezade
Copy link
Author

azarezade commented Feb 6, 2021

I think you may need to open a new issue, unless you get the the same results in debugging command like kubeflow get experiments, kubeflow get trials and so on, that I posted in my first message.

@josepholaide
Copy link

@azarezade The experiment is created but nothing is running

@andreyvelich
Copy link
Member

Thank you for creating this @azarezade.

It seems that you are creating Experiment in kubeflow namespace and webhook doesn't work and metrics are not collected.
If you deploy Katib as part of Kubeflow you should create Experiment only in your Profile namespace.
Please check the tutorial here.

@azarezade
Copy link
Author

Thanks @andreyvelich for the reply. I also tried to create experiment with the name that I set when logging in to the Kubeflow dashboard for the first time, but it returned error:

Error from server (InternalError): error when creating "tfjob-example.yaml": Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org": Post "https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

@Gorosia
Copy link

Gorosia commented Feb 9, 2021

@azarezade
Check here 👍
I fixed error that same as your error.

@azarezade
Copy link
Author

Thanks @Gorosia. So I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants