Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update the api to latest kubeflow pipelines for katib sample[ Fixes #467] #468

Merged
merged 9 commits into from
Mar 3, 2021

Conversation

ScrapCodes
Copy link
Contributor

@ScrapCodes ScrapCodes commented Feb 15, 2021

Which issue is resolved by this Pull Request:
Resolves #467 and maybe #411

Description of your changes:
Update the API to run with KFP 1.3.0
Environment tested:

  • Python Version (use python --version):Python 3.8.5
  • Tekton Version (use tkn version): Client version: 0.15.0
  • Kubernetes Version (use kubectl version):1.18.0
  • OS (e.g. from /etc/os-release):Mac OS X
$>pip list 
(.venv) MacBook-Pro:pipelines prashantsharma$ pip3 list
Package                  Version
------------------------ ---------
attrs                    20.3.0
cachetools               4.2.1
certifi                  2020.12.5
cffi                     1.14.5
chardet                  4.0.0
click                    7.1.2
cloudpickle              1.6.0
Deprecated               1.2.11
docstring-parser         0.7.3
google-api-core          1.26.0
google-auth              1.26.1
google-cloud-core        1.6.0
google-cloud-storage     1.36.0
google-crc32c            1.1.2
google-resumable-media   1.2.0
googleapis-common-protos 1.52.0
idna                     2.10
iniconfig                1.1.1
jsonschema               3.2.0
kfp                      1.3.0
kfp-pipeline-spec        0.1.5
kfp-server-api           1.3.0
kfp-tekton               0.6.0
kubeflow-katib           0.10.1
kubernetes               10.0.1
numpy                    1.20.1
oauthlib                 3.1.0
packaging                20.9
pip                      20.1.1
pluggy                   0.13.1
protobuf                 3.14.0
py                       1.10.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.20
pyparsing                2.4.7
pyrsistent               0.17.3
pytest                   6.2.2
python-dateutil          2.8.1
pytz                     2021.1
PyYAML                   5.4.1
requests                 2.25.1
requests-oauthlib        1.3.0
requests-toolbelt        0.9.1
rsa                      4.7
setuptools               47.1.0
six                      1.15.0
strip-hints              0.1.9
table-logger             0.3.6
tabulate                 0.8.7
toml                     0.10.2
urllib3                  1.26.3
websocket-client         0.57.0
wheel                    0.36.2
wrapt                    1.12.1

Checklist:

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ScrapCodes
To complete the pull request process, please assign tomcli after the PR has been reviewed.
You can assign the PR to them by writing /assign @tomcli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Feb 15, 2021

Currently, I am running into:

(.venv) MacBook-Pro:pipelines prashantsharma$ dsl-compile-tekton --py samples/katib/katib.py --output katib.yaml
KFP-Tekton Compiler 0.6.0
/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/components/_data_passing.py:168: UserWarning: Missing type name was inferred as "Float" based on the value "0.99".
  warnings.warn('Missing type name was inferred as "{}" based on the value "{}".'.format(type_name, str(value)))
/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/components/_data_passing.py:168: UserWarning: Missing type name was inferred as "Integer" based on the value "3".
  warnings.warn('Missing type name was inferred as "{}" based on the value "{}".'.format(type_name, str(value)))
/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/components/_data_passing.py:168: UserWarning: Missing type name was inferred as "Integer" based on the value "12".
  warnings.warn('Missing type name was inferred as "{}" based on the value "{}".'.format(type_name, str(value)))
/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/components/_data_passing.py:168: UserWarning: Missing type name was inferred as "Integer" based on the value "60".
  warnings.warn('Missing type name was inferred as "{}" based on the value "{}".'.format(type_name, str(value)))
/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/components/_data_passing.py:168: UserWarning: Missing type name was inferred as "Boolean" based on the value "True".
  warnings.warn('Missing type name was inferred as "{}" based on the value "{}".'.format(type_name, str(value)))
Traceback (most recent call last):
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/bin/dsl-compile-tekton", line 8, in <module>
    sys.exit(main())
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/main.py", line 89, in main
    compile_pyfile(args.py, args.function, args.output, not args.disable_type_check)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/main.py", line 77, in compile_pyfile
    _compile_pipeline_function(pipeline_funcs, function_name, output_path, type_check)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/main.py", line 68, in _compile_pipeline_function
    TektonCompiler().compile(pipeline_func, output_path, type_check)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 759, in compile
    super().compile(pipeline_func, package_path, type_check, pipeline_conf=pipeline_conf)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/compiler/compiler.py", line 948, in compile
    self._create_and_write_workflow(
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 852, in _create_and_write_workflow
    workflow = self._create_workflow(
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 680, in _create_workflow
    pipeline_func(*args_list)
  File "samples/katib/katib.py", line 158, in mnist_hpo
    experiment_spec=ApiClient().sanitize_for_serialization(experiment_spec),
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kubeflow/katib/api_client.py", line 218, in sanitize_for_serialization
    return {key: self.sanitize_for_serialization(val)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kubeflow/katib/api_client.py", line 218, in <dictcomp>
    return {key: self.sanitize_for_serialization(val)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kubeflow/katib/api_client.py", line 215, in sanitize_for_serialization
    for attr, _ in six.iteritems(obj.swagger_types)
AttributeError: 'PipelineParam' object has no attribute 'swagger_types'

I have tried with different kubernetes client versions.
e.g.
pip3 install kubernetes==11.0.0
pip3 install kubernetes==9.0.0

Results in same error as above.

@ScrapCodes
Copy link
Contributor Author

After making these changes: 5075821
Now, I am getting another error:

Traceback (most recent call last):
  File "samples/katib/katib.py", line 183, in <module>
    TektonCompiler().compile(mnist_hpo, __file__.replace('.py', '.yaml'))
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 762, in compile
    super().compile(pipeline_func, package_path, type_check, pipeline_conf=pipeline_conf)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp/compiler/compiler.py", line 948, in compile
    self._create_and_write_workflow(
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 855, in _create_and_write_workflow
    workflow = self._create_workflow(
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 690, in _create_workflow
    self._sanitize_and_inject_artifact(dsl_pipeline, pipeline_conf)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/compiler.py", line 635, in _sanitize_and_inject_artifact
    sanitize_k8s_object(op.container)
  File "/Users/prashantsharma/go/src/github.com/kubeflow/pipelines/.venv/lib/python3.8/site-packages/kfp_tekton/compiler/_k8s_helper.py", line 196, in sanitize_k8s_object
    type = k8s_obj.openapi_types[attr]
AttributeError: 'Container' object has no attribute 'openapi_types'
(.venv) MacBook-Pro:pipelines prashantsharma$ 

@Tomcli
Copy link
Member

Tomcli commented Feb 15, 2021

thanks @ScrapCodes. for KFP DSL, it only can accept basic type like string and int because it needs to convert the type into a new type called PipelineParam.

You can try to use this example and lock the katib sdk to 0.10.1 since I'm not sure did they introduce any new spec recently.
https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-katib/early-stopping.ipynb

@ScrapCodes
Copy link
Contributor Author

@Tomcli Thank you for taking a look.

I have updated the existing sample, which is the one posted in the link you pasted as above. So my guess, that wont work either. Even if I remove all the items from experiment spec, I am getting above error. So I think it could be related to something else.
Error is similar to kubernetes-client/python#1112 where @jinchihe solved it by using dynamic client. I could not understand that part fully. Any help would make this attempt to update the SDK for samples complete.

Thanks !!

@Tomcli
Copy link
Member

Tomcli commented Feb 17, 2021

https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/requirements.txt#L6
I'm not sure why they locked the kubernetes client to 10.0.1. For KFP we need kubernetes client 11.0.0+.

You can compile the pipeline after running pip3 install kubernetes==11.0.0

After compiled, I try to run the pipeline and see this error. If you think this is an katib issue then we should open it on the katib repo https://github.com/kubeflow/katib

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post 
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
: EOF","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s: EOF"}]},"code":500}
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
: EOF"}]},"code":500}

Since the Katib component is not actively maintained, if you encountered too many issues with Katib we can consider moving some examples to use tfjob instead.

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Feb 17, 2021

@Tomcli Now, I have the same error as you have posted above.

On probing katib controller logs, I have found:

2021/02/17 08:37:23 http: panic serving 172.17.47.2:55394: runtime error: invalid memory address or nil pointer dereference
goroutine 678780 [running]:
net/http.(*conn).serve.func1(0xc00019c280)
	/usr/local/go/src/net/http/server.go:1801 +0x147
panic(0x1507140, 0x2229490)
	/usr/local/go/src/runtime/panic.go:975 +0x47a
github.com/kubeflow/katib/pkg/apis/controller/experiments/v1beta1.(*Experiment).setDefaultObjective(0xc0003c6840)
	/go/src/github.com/kubeflow/katib/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go:53 +0x55
github.com/kubeflow/katib/pkg/apis/controller/experiments/v1beta1.(*Experiment).SetDefault(0xc0003c6840)
	/go/src/github.com/kubeflow/katib/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go:33 +0x66
github.com/kubeflow/katib/pkg/webhook/v1beta1/experiment.(*experimentDefaulter).Handle(0xc000878ca0, 0x18ce8e0, 0xc000190000, 0xc000758160, 0xffffffffffffffff, 0xc000efb8c8, 0xa3ef45, 0x1894c00)
	/go/src/github.com/kubeflow/katib/pkg/webhook/v1beta1/experiment/mutate_webhook.go:47 +0xa5
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).handleMutating(0xc0001fd780, 0x18ce8e0, 0xc000190000, 0xc000758160, 0x3, 0xc00044d300, 0xc00044d300, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:133 +0xd8
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle(0xc0001fd780, 0x18ce8e0, 0xc000190000, 0xc000758160, 0x0, 0x18b2120, 0xc0007523c0, 0x18b2120)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:120 +0x1fa
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP(0xc0001fd780, 0x18c7ea0, 0xc00164e0e0, 0xc00042df00)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:93 +0x9f5
net/http.(*ServeMux).ServeHTTP(0xc0008bc780, 0x18c7ea0, 0xc00164e0e0, 0xc00042df00)
	/usr/local/go/src/net/http/server.go:2417 +0x1ad
net/http.serverHandler.ServeHTTP(0xc000903960, 0x18c7ea0, 0xc00164e0e0, 0xc00042df00)
	/usr/local/go/src/net/http/server.go:2843 +0xa3
net/http.(*conn).serve(0xc00019c280, 0x18ce8a0, 0xc00060e000)
	/usr/local/go/src/net/http/server.go:1925 +0x8ad
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2969 +0x36c

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Feb 17, 2021

On trying mpi-job-horovod sample, same error as above was found. Filed kubeflow/katib#1435

samples/katib/katib.py Outdated Show resolved Hide resolved
samples/katib/katib.py Outdated Show resolved Hide resolved
@ScrapCodes ScrapCodes changed the title [WIP] fix: update the api to latest kubeflow pipelines for katib sample[ Fixes #467] fix: update the api to latest kubeflow pipelines for katib sample[ Fixes #467] Feb 19, 2021
@ScrapCodes
Copy link
Contributor Author

Now this runs fine, but fails to finish even with very large timeout.

@Tomcli
Copy link
Member

Tomcli commented Feb 19, 2021

@ScrapCodes we only need an example on how to run Katib with KFP-Tekton. You can replace the PR example to the new Katib example from KFP and see can you run it with Tekton. https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-katib/mpi-job-horovod.py

The old example is created by the previous Katib committer which is deprecated now.

@andreyvelich
Copy link
Member

Yes you can try to take MPI job example from KFP.
Check the Katib Experiment status and Trials statuses.

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Feb 23, 2021

@andreyvelich and @Tomcli The jobs are never marked successful and they continue to run until timeout is reached - leading to failure.

@ScrapCodes
Copy link
Contributor Author

@Tomcli As per the issue #411, do you think this PR makes sense for merging? Or the intention is, if we do not want to maintain this example, we can get rid of it altogether since there are other katib samples available (which are updated).

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Feb 26, 2021

Looks like, I am again running into some errors - when run on another IKS cluster with k8s version v1.19.8+IKS

19
kubernetes.client.rest.ApiException: (500)
20
Reason: Internal Server Error
21
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 12:55:22 GMT', 'Content-Length': '747'})
22
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
23
24
25
26
During handling of the above exception, another exception occurred:
27
28
Traceback (most recent call last):
29
  File "src/launch_experiment.py", line 115, in <module>
30
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
31
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 78, in create_experiment
32
    %s\n" % e)
33
RuntimeError: Exception when calling CustomObjectsApi->create_namespaced_custom_object:         (500)
34
Reason: Internal Server Error
35
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 12:55:22 GMT', 'Content-Length': '747'})
36
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
37
38
39
40

Katib-controller logs look like this:

(.venv) MacBook-Pro:katib prashantsharma$ kk logs -f pod/katib-controller-7fcc95676b-bz6w2
{"level":"info","ts":1614344042.9566338,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","cert-local-filesystem":false,"webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"tekton.dev","Version":"v1beta1","Kind":"PipelineRun"}]}
{"level":"info","ts":1614344043.3818798,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1614344043.382442,"logger":"entrypoint","msg":"Setting up controller"}
{"level":"info","ts":1614344043.3824809,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1614344043.3826084,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3828194,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.382991,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.383147,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1614344043.3832188,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3832498,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3833654,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3834949,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3836193,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1614344043.3837295,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.383765,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1614344043.3839185,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"}
{"level":"info","ts":1614344043.3839483,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1614344043.3840609,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"}
{"level":"info","ts":1614344043.3840866,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1614344043.3841786,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1614344043.3842037,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=MPIJob"}
{"level":"error","ts":1614344043.3842342,"logger":"kubebuilder.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":{"Group":"kubeflow.org","Kind":"MPIJob"},"error":"no matches for kind \"MPIJob\" in version \"kubeflow.org/v1\"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngit.luolix.top/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:89\ngit.luolix.top/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Watch\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\ngit.luolix.top/kubeflow/katib/pkg/controller.v1beta1/trial.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:106\ngit.luolix.top/kubeflow/katib/pkg/controller.v1beta1/trial.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:65\ngit.luolix.top/kubeflow/katib/pkg/controller%2ev1beta1.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:112\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1614344043.384385,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"}
{"level":"info","ts":1614344043.3843982,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: tekton.dev/v1beta1, Kind=PipelineRun"}
{"level":"info","ts":1614344043.3845088,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
{"level":"info","ts":1614344043.38453,"logger":"trial-controller","msg":"Trial controller created"}
{"level":"info","ts":1614344043.3845363,"logger":"entrypoint","msg":"Setting up webhooks"}
{"level":"info","ts":1614344043.3846786,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1614344043.4860818,"logger":"kubebuilder.webhook","msg":"installing webhook configuration in cluster"}
{"level":"info","ts":1614344043.486201,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"experiment-controller"}
{"level":"info","ts":1614344043.4863234,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trial-controller"}
{"level":"info","ts":1614344043.4864097,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"suggestion-controller"}
{"level":"info","ts":1614344043.5864823,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trial-controller","worker count":1}
{"level":"info","ts":1614344043.586536,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"experiment-controller","worker count":1}
{"level":"info","ts":1614344043.6880174,"logger":"kubebuilder.admission.cert.writer","msg":"cert is invalid or expiring, regenerating a new one"}
{"level":"info","ts":1614344043.690113,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"suggestion-controller","worker count":1}
2021/02/26 12:55:16 http: TLS handshake error from 172.30.26.202:45184: remote error: tls: bad certificate
2021/02/26 12:55:22 http: TLS handshake error from 172.30.26.202:45358: remote error: tls: bad certificate
^C

@ScrapCodes
Copy link
Contributor Author

Now exploring why experiment never finishes.


pod/mnist56-5qhlvcj2-8g2wz                                            0/2     Completed   0          7m7s
pod/mnist56-l28xsfzt-ws58l                                            0/2     Completed   0          7m7s
pod/mnist56-n6tklfz5-966xl                                            0/2     Completed   0          7m6s
pod/mnist56-random-8668f4c66b-vx48s                                   1/1     Running     0          7m41s

(base) MacBook-Pro:pipelines prashantsharma$ kp logs -f pod/mnist56-random-8668f4c66b-vx48s 
INFO:pkg.suggestion.v1beta1.hyperopt.base_service:GetSuggestions returns 3 new Trial

^C
(base) MacBook-Pro:pipelines prashantsharma$ kp describe pod/mnist56-random-8668f4c66b-vx48s 
Name:         mnist56-random-8668f4c66b-vx48s
Namespace:    prashant
Priority:     0
Node:         10.240.128.7/10.240.128.7
Start Time:   Fri, 26 Feb 2021 19:20:13 +0530
Labels:       deployment=mnist56-random
              experiment=mnist56
              pod-template-hash=8668f4c66b
              suggestion=mnist56
Annotations:  cni.projectcalico.org/podIP: 172.17.52.15/32
              cni.projectcalico.org/podIPs: 172.17.52.15/32
              kubernetes.io/psp: ibm-privileged-psp
              sidecar.istio.io/inject: false
Status:       Running
IP:           172.17.52.15
IPs:
  IP:           172.17.52.15
Controlled By:  ReplicaSet/mnist56-random-8668f4c66b
Containers:
  suggestion:
    Container ID:   containerd://2d2f21415b86b3703974070212898fb7a1998b0545120385357fcd60e9837d24
    Image:          docker.io/kubeflowkatib/suggestion-hyperopt:v1beta1-a96ff59
    Image ID:       docker.io/kubeflowkatib/suggestion-hyperopt@sha256:ba2de63dee57dda8f03770190f1b6355897ae200f47b28d43eae0ccc5bc2a848
    Port:           6789/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 26 Feb 2021 19:20:14 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Liveness:             exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1beta1.Suggestion] delay=10s timeout=1s period=120s #success=1 #failure=12
    Readiness:            exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1beta1.Suggestion] delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:          <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fhwf4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-fhwf4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fhwf4
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
                 node.kubernetes.io/unreachable:NoExecute for 600s
Events:
  Type    Reason     Age    From                   Message
  ----    ------     ----   ----                   -------
  Normal  Scheduled  8m32s  default-scheduler      Successfully assigned prashant/mnist56-random-8668f4c66b-vx48s to 10.240.128.7
  Normal  Pulled     8m31s  kubelet, 10.240.128.7  Container image "docker.io/kubeflowkatib/suggestion-hyperopt:v1beta1-a96ff59" already present on machine
  Normal  Created    8m31s  kubelet, 10.240.128.7  Created container suggestion
  Normal  Started    8m31s  kubelet, 10.240.128.7  Started container suggestion

Even though the suggestions are generated, this pod pod/mnist56-random-8668f4c66b-vx48s is stuck in running forever.

@andreyvelich
Copy link
Member

Looks like, I am again running into some errors - when run on another IKS cluster with k8s version v1.19.8+IKS

19
kubernetes.client.rest.ApiException: (500)
20
Reason: Internal Server Error
21
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 12:55:22 GMT', 'Content-Length': '747'})
22
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
23
24
25
26
During handling of the above exception, another exception occurred:
27
28
Traceback (most recent call last):
29
  File "src/launch_experiment.py", line 115, in <module>
30
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
31
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 78, in create_experiment
32
    %s\n" % e)
33
RuntimeError: Exception when calling CustomObjectsApi->create_namespaced_custom_object:         (500)
34
Reason: Internal Server Error
35
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 26 Feb 2021 12:55:22 GMT', 'Content-Length': '747'})
36
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mutating.experiment.katib.kubeflow.org\": Post \"https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s
\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}]},"code":500}
37
38
39
40

Katib-controller logs look like this:

(.venv) MacBook-Pro:katib prashantsharma$ kk logs -f pod/katib-controller-7fcc95676b-bz6w2
{"level":"info","ts":1614344042.9566338,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","cert-local-filesystem":false,"webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"tekton.dev","Version":"v1beta1","Kind":"PipelineRun"}]}
{"level":"info","ts":1614344043.3818798,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1614344043.382442,"logger":"entrypoint","msg":"Setting up controller"}
{"level":"info","ts":1614344043.3824809,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1614344043.3826084,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3828194,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.382991,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.383147,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1614344043.3832188,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3832498,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3833654,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3834949,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.3836193,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1614344043.3837295,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1614344043.383765,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1614344043.3839185,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"}
{"level":"info","ts":1614344043.3839483,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1614344043.3840609,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"}
{"level":"info","ts":1614344043.3840866,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1614344043.3841786,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1614344043.3842037,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=MPIJob"}
{"level":"error","ts":1614344043.3842342,"logger":"kubebuilder.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":{"Group":"kubeflow.org","Kind":"MPIJob"},"error":"no matches for kind \"MPIJob\" in version \"kubeflow.org/v1\"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngit.luolix.top/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:89\ngit.luolix.top/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Watch\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\ngit.luolix.top/kubeflow/katib/pkg/controller.v1beta1/trial.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:106\ngit.luolix.top/kubeflow/katib/pkg/controller.v1beta1/trial.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:65\ngit.luolix.top/kubeflow/katib/pkg/controller%2ev1beta1.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:112\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
{"level":"info","ts":1614344043.384385,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"}
{"level":"info","ts":1614344043.3843982,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: tekton.dev/v1beta1, Kind=PipelineRun"}
{"level":"info","ts":1614344043.3845088,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
{"level":"info","ts":1614344043.38453,"logger":"trial-controller","msg":"Trial controller created"}
{"level":"info","ts":1614344043.3845363,"logger":"entrypoint","msg":"Setting up webhooks"}
{"level":"info","ts":1614344043.3846786,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1614344043.4860818,"logger":"kubebuilder.webhook","msg":"installing webhook configuration in cluster"}
{"level":"info","ts":1614344043.486201,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"experiment-controller"}
{"level":"info","ts":1614344043.4863234,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trial-controller"}
{"level":"info","ts":1614344043.4864097,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"suggestion-controller"}
{"level":"info","ts":1614344043.5864823,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trial-controller","worker count":1}
{"level":"info","ts":1614344043.586536,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"experiment-controller","worker count":1}
{"level":"info","ts":1614344043.6880174,"logger":"kubebuilder.admission.cert.writer","msg":"cert is invalid or expiring, regenerating a new one"}
{"level":"info","ts":1614344043.690113,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"suggestion-controller","worker count":1}
2021/02/26 12:55:16 http: TLS handshake error from 172.30.26.202:45184: remote error: tls: bad certificate
2021/02/26 12:55:22 http: TLS handshake error from 172.30.26.202:45358: remote error: tls: bad certificate
^C

Yes, that is known issue: kubeflow/katib#1395.
We are working on: kubeflow/katib#1450 to be able to run Katib with k8s v1.19

@Tomcli
Copy link
Member

Tomcli commented Feb 26, 2021

@Tomcli As per the issue #411, do you think this PR makes sense for merging? Or the intention is, if we do not want to maintain this example, we can get rid of it altogether since there are other katib samples available (which are updated).

For KFP-Tekton, the goal is to have an example for running the Katib experiment using Tekton pipeline. We don't want to maintain the experiment itself, so I recommended to replace the existing example with mpi-job-horovod.py.

## Acknowledgements

Thanks [Hougang Liu](https://github.com/hougangliu) for creating the original katib example.
- Compile compressed YAML definition of the Pipeline using Katib Experiment with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this example, maybe add a note that we need to install mpijob controller since it doesn't come with Kubeflow 1.2 by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a note for Kubeflow 1.2, Katib cannot run on k8s 1.19+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I couldn't run mpijobs on k8s 1.18. This is the errors i got

W0301 21:04:17.685243       1 reflector.go:302] pkg/mod/k8s.io/client-go@v0.15.10/tools/cache/reflector.go:98: watch of *v1.Pod ended with: too old resource version: 232224066 (232226844)

kfp-tekton needs minimum k8s 1.17 to run. do you have any success with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not install mpi job controller, nor did I see that error. I am wondering, how did it worked on my k8s version 1.18.

@@ -551,7 +551,8 @@
}
],
"source": [
"kfp.Client().create_run_from_pipeline_func(median_stop, arguments={})"
"from kfp_tekton._client import TektonClient",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a newline for this code, otherwise jupyter notebook will not able to run this line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also at the top, run !pip install kfp-tekton==0.4.0 to let it run on Kubeflow 1.2 with tekton.

@Tomcli
Copy link
Member

Tomcli commented Mar 1, 2021

@ScrapCodes the notebook is good after adding the right dependencies (kfp-tekton==0.4.0).
I didn't have any luck with the mpi example and seems like the mpi operator couldn't run on my k8s 1.18 cluster.

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Mar 2, 2021

@ScrapCodes the notebook is good after adding the right dependencies (kfp-tekton==0.4.0).

I was successful in running it without the need to add above dependency. Could be that I installed as part of some previous notebooks - and my notebook server has been running for weeks..

I will add it, thanks for catching it.

I didn't have any luck with the mpi example and seems like the mpi operator couldn't run on my k8s 1.18 cluster.

It seemed to work on 1.18 cluster.

(base) MacBook-Pro:pipelines prashantsharma$ kp describe suggestion.kubeflow.org/mpi-horovod-mnist1
Name:         mpi-horovod-mnist1
Namespace:    prashant
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Suggestion
Metadata:
  Creation Timestamp:  2021-03-01T07:38:15Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
          f:algorithmSettings:
        f:requests:
        f:resumePolicy:
      f:status:
        .:
        f:conditions:
        f:startTime:
        f:suggestionCount:
        f:suggestions:
    Manager:    katib-controller
    Operation:  Update
    Time:       2021-03-01T07:38:49Z
  Owner References:
    API Version:           kubeflow.org/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  mpi-horovod-mnist1
    UID:                   9d4547ec-1860-40a1-8af1-3cdf3c25b0d5
  Resource Version:        17284198
  Self Link:               /apis/kubeflow.org/v1beta1/namespaces/prashant/suggestions/mpi-horovod-mnist1
  UID:                     1f375931-a031-43e3-8505-df3d5b10879a
Spec:
  Algorithm:
    Algorithm Name:  bayesianoptimization
    Algorithm Settings:
      Name:       random_state
      Value:      10
  Requests:       2
  Resume Policy:  LongRunning
Status:
  Conditions:
    Last Transition Time:  2021-03-01T07:38:15Z
    Last Update Time:      2021-03-01T07:38:15Z
    Message:               Suggestion is created
    Reason:                SuggestionCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-03-01T07:38:28Z
    Last Update Time:      2021-03-01T07:38:28Z
    Message:               Deployment is ready
    Reason:                DeploymentReady
    Status:                True
    Type:                  DeploymentReady
    Last Transition Time:  2021-03-01T07:38:49Z
    Last Update Time:      2021-03-01T07:38:49Z
    Message:               Suggestion is running
    Reason:                SuggestionRunning
    Status:                True
    Type:                  Running
  Start Time:              2021-03-01T07:38:15Z
  Suggestion Count:        2
  Suggestions:
    Name:  mpi-horovod-mnist1-bnz7qbjk
    Parameter Assignments:
      Name:   lr
      Value:  0.0013884981186857927
      Name:   num-steps
      Value:  99
    Name:     mpi-horovod-mnist1-bzjzmxcn
    Parameter Assignments:
      Name:   lr
      Value:  0.0016269411644444262
      Name:   num-steps
      Value:  133
Events:       <none>

Suggestions are generated as above, but the status always remain running. ( I would say the run is buggy - as UI reports it as timed out") I have tried it several times, same observations.

@Tomcli
Copy link
Member

Tomcli commented Mar 2, 2021

suggestion

The suggestion is always running because the mpijob was never created on your cluster since you don't have the mpi operator. In my case, the suggestion is also running forever because my mpijob aren't able to create pods to run the actual workloads.

I think this is an MPI operator issue as Katib runs fine in the other example.

@Tomcli
Copy link
Member

Tomcli commented Mar 2, 2021

Maybe we can remove the MPI job example for now since this won't come as the default installation.

@@ -201,6 +201,7 @@
"# Update the PIP version.\n",
"!python -m pip install --upgrade pip\n",
"!pip install kfp==1.1.1\n",
"!pip install kfp-tekton==0.4.0\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put it under kubeflow-katib, I realized kubeflow-katib is using kubernetes==10.0.1 which is breaking the kfp client package.

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Mar 3, 2021

@Tomcli MPIJob based sample is removed.

Do you think we should look at converting it into a job based sample, at later point?

Also, I have included notebook outputs from a actual notebook run, if this is not required, then last commit can be removed.

@Tomcli
Copy link
Member

Tomcli commented Mar 3, 2021

@ScrapCodes We only want to make sure Katib can work with KFP-Tekton. For MPI Job example is the issue with the MPI operator so we don't have to maintain it in our repo.

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ScrapCodes, Tomcli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 4ebca9d into kubeflow:master Mar 3, 2021
@ScrapCodes
Copy link
Contributor Author

@Tomcli Thanks a lot for reviewing and getting it merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sample: katib is not up to date with kfp 1.3.0, and thus fails to run.
5 participants