-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: update the api to latest kubeflow pipelines for katib sample[ Fixes #467] #468
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ScrapCodes The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Currently, I am running into:
I have tried with different kubernetes client versions. Results in same error as above. |
After making these changes: 5075821
|
thanks @ScrapCodes. for KFP DSL, it only can accept basic type like string and int because it needs to convert the type into a new type called You can try to use this example and lock the katib sdk to 0.10.1 since I'm not sure did they introduce any new spec recently. |
@Tomcli Thank you for taking a look. I have updated the existing sample, which is the one posted in the link you pasted as above. So my guess, that wont work either. Even if I remove all the items from experiment spec, I am getting above error. So I think it could be related to something else. Thanks !! |
https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/requirements.txt#L6 You can compile the pipeline after running After compiled, I try to run the pipeline and see this error. If you think this is an katib issue then we should open it on the katib repo https://github.com/kubeflow/katib
Since the Katib component is not actively maintained, if you encountered too many issues with Katib we can consider moving some examples to use tfjob instead. |
@Tomcli Now, I have the same error as you have posted above. On probing katib controller logs, I have found:
|
On trying mpi-job-horovod sample, same error as above was found. Filed kubeflow/katib#1435 |
Now this runs fine, but fails to finish even with very large timeout. |
@ScrapCodes we only need an example on how to run Katib with KFP-Tekton. You can replace the PR example to the new Katib example from KFP and see can you run it with Tekton. https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-katib/mpi-job-horovod.py The old example is created by the previous Katib committer which is deprecated now. |
Yes you can try to take MPI job example from KFP. |
@andreyvelich and @Tomcli The jobs are never marked successful and they continue to run until timeout is reached - leading to failure. |
Looks like, I am again running into some errors - when run on another IKS cluster with k8s version
Katib-controller logs look like this:
|
Now exploring why experiment never finishes.
Even though the suggestions are generated, this pod |
Yes, that is known issue: kubeflow/katib#1395. |
For KFP-Tekton, the goal is to have an example for running the Katib experiment using Tekton pipeline. We don't want to maintain the experiment itself, so I recommended to replace the existing example with |
samples/katib/README.md
Outdated
## Acknowledgements | ||
|
||
Thanks [Hougang Liu](https://github.com/hougangliu) for creating the original katib example. | ||
- Compile compressed YAML definition of the Pipeline using Katib Experiment with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this example, maybe add a note that we need to install mpijob controller since it doesn't come with Kubeflow 1.2 by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a note for Kubeflow 1.2, Katib cannot run on k8s 1.19+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I couldn't run mpijobs on k8s 1.18. This is the errors i got
W0301 21:04:17.685243 1 reflector.go:302] pkg/mod/k8s.io/client-go@v0.15.10/tools/cache/reflector.go:98: watch of *v1.Pod ended with: too old resource version: 232224066 (232226844)
kfp-tekton needs minimum k8s 1.17 to run. do you have any success with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not install mpi job controller, nor did I see that error. I am wondering, how did it worked on my k8s version 1.18.
samples/katib/early-stopping.ipynb
Outdated
@@ -551,7 +551,8 @@ | |||
} | |||
], | |||
"source": [ | |||
"kfp.Client().create_run_from_pipeline_func(median_stop, arguments={})" | |||
"from kfp_tekton._client import TektonClient", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need a newline for this code, otherwise jupyter notebook will not able to run this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also at the top, run !pip install kfp-tekton==0.4.0
to let it run on Kubeflow 1.2 with tekton.
@ScrapCodes the notebook is good after adding the right dependencies (kfp-tekton==0.4.0). |
I was successful in running it without the need to add above dependency. Could be that I installed as part of some previous notebooks - and my notebook server has been running for weeks.. I will add it, thanks for catching it.
It seemed to work on 1.18 cluster.
Suggestions are generated as above, but the status always remain running. ( I would say the run is buggy - as UI reports it as timed out") I have tried it several times, same observations. |
The suggestion is always running because the mpijob was never created on your cluster since you don't have the mpi operator. In my case, the suggestion is also running forever because my mpijob aren't able to create pods to run the actual workloads. I think this is an MPI operator issue as Katib runs fine in the other example. |
Maybe we can remove the MPI job example for now since this won't come as the default installation. |
samples/katib/early-stopping.ipynb
Outdated
@@ -201,6 +201,7 @@ | |||
"# Update the PIP version.\n", | |||
"!python -m pip install --upgrade pip\n", | |||
"!pip install kfp==1.1.1\n", | |||
"!pip install kfp-tekton==0.4.0\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put it under kubeflow-katib
, I realized kubeflow-katib is using kubernetes==10.0.1 which is breaking the kfp client package.
@Tomcli MPIJob based sample is removed. Do you think we should look at converting it into a job based sample, at later point? Also, I have included notebook outputs from a actual notebook run, if this is not required, then last commit can be removed. |
@ScrapCodes We only want to make sure Katib can work with KFP-Tekton. For MPI Job example is the issue with the MPI operator so we don't have to maintain it in our repo. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ScrapCodes, Tomcli The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Tomcli Thanks a lot for reviewing and getting it merged :) |
Which issue is resolved by this Pull Request:
Resolves #467 and maybe #411
Description of your changes:
Update the API to run with KFP 1.3.0
Environment tested:
python --version
):Python 3.8.5tkn version
): Client version: 0.15.0kubectl version
):1.18.0/etc/os-release
):Mac OS XChecklist:
Do you want this pull request (PR) cherry-picked into the current release branch?
Learn more about cherry-picking updates into the release branch.