Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the kubeflow-m2m-oidc-configurator a CronJob #2667

Conversation

kromanow94
Copy link
Contributor

Which issue is resolved by this Pull Request:
Resolves #2646

Description of your changes:
Changing the Job to CronJob improves the robustness of the setup in case if the JWKS will change or the user accidentally overwrote the requestauthentication.

Checklist:

  • Tested on kind and on vcluster.

@kromanow94
Copy link
Contributor Author

kromanow94 commented Apr 4, 2024

@juliusvonkohout or @kimwnasptd can we restart the tests? Both of them failed because of a non-related issue:

timed out waiting for the condition on pods/kubeflow-m2m-oidc-configurator-28537425-s8kzm
timed out waiting for the condition on pods/activator-bd5fdc585-rrnqf
timed out waiting for the condition on pods/autoscaler-5655dd9df5-4knpj
timed out waiting for the condition on pods/controller-5447f77dc5-ljx5r
timed out waiting for the condition on pods/domain-mapping-757799d898-knf69
timed out waiting for the condition on pods/domainmapping-webhook-5d875ccb7d-z2qjv
timed out waiting for the condition on pods/net-istio-controller-5f89595bcb-dv7h2
timed out waiting for the condition on pods/net-istio-webhook-dc448cfc4-rws5f
timed out waiting for the condition on pods/webhook-578c5cf66f-25sf9
timed out waiting for the condition on pods/coredns-5dd5756b68-hpg77
timed out waiting for the condition on pods/coredns-5dd5756b68-vv66m
timed out waiting for the condition on pods/etcd-kind-control-plane
timed out waiting for the condition on pods/kindnet-9l886
timed out waiting for the condition on pods/kindnet-pftsz
timed out waiting for the condition on pods/kindnet-z5qpl
timed out waiting for the condition on pods/kube-apiserver-kind-control-plane
timed out waiting for the condition on pods/kube-controller-manager-kind-control-plane
timed out waiting for the condition on pods/kube-proxy-64vj7
timed out waiting for the condition on pods/kube-proxy-vk4lr
timed out waiting for the condition on pods/kube-proxy-xwm8d
timed out waiting for the condition on pods/kube-scheduler-kind-control-plane
timed out waiting for the condition on pods/local-path-provisioner-7577fdbbfb-7zv5k
timed out waiting for the condition on pods/oauth2-proxy-86d8c97455-hvjl8
timed out waiting for the condition on pods/oauth2-proxy-86d8c97455-z9vjw
Error: Process completed with exit code 1.

@juliusvonkohout
Copy link
Member

@KRomanov, i restarted the tests. If they fail again we might have to increase the timeouts in this PR.

@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from b98a24d to 4abca40 Compare April 11, 2024 13:41
@kromanow94
Copy link
Contributor Author

@juliusvonkohout this is super weird. I limited the CronJob with concurrencyPolicy: Forbid. I don't know if this should be handled with increasing the timeout or by increaseing the resources for the CICD Jobs... I can also try to split the installation steps to limit how many pods are created at the same time...

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Apr 15, 2024

I restarted the tests. yeah our CICD is a bit problematic at the moment. If we can specify more resources in this public repository yes, otherwise we have to increase the timeouts. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

@kromanow94
Copy link
Contributor Author

@juliusvonkohout maybe the issue is with CICD resource sharing? If the memory and cpu is shared between multiple workflows, it may be problematic. I see one of the failing tests completed with success. Can you restart the last test workflow?

Also, is this something I could do myself, for example with the github bot with commands in comment?

name: kubeflow-m2m-oidc-configurator
namespace: istio-system
spec:
schedule: '* * * * *'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHould we not go with every 5 minutes instead of every minute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to every 5 minutes. There is also configuration for not adding more jobs until the last one is completed and from the latest log from cicd workflows shows that there is no more than 1 job created at a time.

defaultMode: 0777
items:
- key: script.sh
path: script.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that script.sh is idempotent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, well it doesn't verify if the JWKS is present and after all is always performing the patch so this might be an improvement. I think the JWKS value should be also compared and only patched if different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes so the script will first check for the JWKS present in RequestAuthentication and only patch if not equal to the desired JWKS.

@juliusvonkohout
Copy link
Member

@juliusvonkohout maybe the issue is with CICD resource sharing? If the memory and cpu is shared between multiple workflows, it may be problematic. I see one of the failing tests completed with success. Can you restart the last test workflow?

Also, is this something I could do myself, for example with the github bot with commands in comment?

I did restart and it failed again. In the KFP repository that was possible with /retest or /retest-failed or so. Probably something i can investigate in the next weeks when i am less busy.

@kromanow94
Copy link
Contributor Author

@juliusvonkohout maybe we could add verbosity to the logs in CICD GH Workflows? We currently know that the pods aren't ready but what is the actual reason? DockerHub pull rate limits? Not enough resources? Failing Pod?

@juliusvonkohout
Copy link
Member

@juliusvonkohout maybe we could add verbosity to the logs in CICD GH Workflows? We currently know that the pods aren't ready but what is the actual reason? DockerHub pull rate limits? Not enough resources? Failing Pod?

Yes, lets do that in a separate PR with @codablock as well.

@juliusvonkohout
Copy link
Member

The tests in #2696 were successful so i reran the test and hope that the CICD is happy now. If not please rebase the PR against the master branch.

@juliusvonkohout
Copy link
Member

https://github.com/kubeflow/manifests/actions/runs/8891109875 here is the successful test.

@juliusvonkohout
Copy link
Member

So we need a rebase and step by step debugging with minimal changes.

@juliusvonkohout
Copy link
Member

/hold

@juliusvonkohout
Copy link
Member

/retest

@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from 4abca40 to 0a707ba Compare June 12, 2024 07:08
@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from 0a707ba to e791d53 Compare June 12, 2024 07:12
kromanow94 and others added 5 commits June 13, 2024 15:21
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski.kr3@roche.com>
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
It was tested with self-hosted runner using custom dockerconfig credentials for debugging.

Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
…tor.yaml

Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from 4eb270b to 8d54066 Compare June 13, 2024 15:23
@kromanow94
Copy link
Contributor Author

@diegolovison this is the PR we've discussed on the Manifests WG Call.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jun 13, 2024

@kimwnasptd @rimolive please also review

@juliusvonkohout juliusvonkohout self-assigned this Jun 13, 2024
@juliusvonkohout
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 13, 2024
@juliusvonkohout
Copy link
Member

/hold

@kromanow94
Copy link
Contributor Author

Huh, the CICD failed again with:

Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'x-powered-by': 'Express', 'www-authenticate': '***"http://10.96.103.33:8888/pipeline/apis/v2beta1/experiments?filter=%7B%22predicates%22%3A+%5B%7B%22operation%22%3A+1%2C+%22key%22%3A+%22display_name%22%2C+%22stringValue%22%3A+%22m2m-test%22%7D%5D%7D&namespace=kubeflow-user-example-com", error="invalid_token"', 'content-length': '22', 'content-type': 'text/plain', 'date': 'Thu, 13 Jun 2024 15:35:11 GMT', 'server': 'istio-envoy', 'x-envoy-upstream-service-time': '3'})

https://github.com/kubeflow/manifests/actions/runs/9502404627/job/26190294755?pr=2667

But, only 1/3 m2m cicd workflows failed... I'll have another look and try to pinpoint why this is happening. I guess this should be rather a small change.

@diegolovison
Copy link
Contributor

Going to wait for the feedback

@google-oss-prow google-oss-prow bot removed the lgtm label Jun 13, 2024
@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from aa466e7 to 6972652 Compare June 13, 2024 19:19
Signed-off-by: Krzysztof Romanowski <krzysztof.romanowski94@gmail.com>
@kromanow94 kromanow94 force-pushed the make-the-oidc-configurator-a-cronjob branch from 6972652 to c830a7a Compare June 13, 2024 19:22
@kromanow94
Copy link
Contributor Author

I made changes to the script so it will patch and then verify if the patch with jwks persisted. If it's not persisted, the Pod will finish with failure. The CronJob is configured to restart failed Job Pod 3 times. If this also fails, the ./tests/gh-actions/wait_for_kubeflow_m2m_oidc_configurator.sh will also fail.

@diegolovison , @juliusvonkohout , @kimwnasptd , please review.

@juliusvonkohout
Copy link
Member

@diegolovison can you test now?

@rimolive we have to merge this for rc.2

@juliusvonkohout
Copy link
Member

/lgtm
/approve

There was no feedback for a week and we need it in the next RC @rimolive

@google-oss-prow google-oss-prow bot added the lgtm label Jun 21, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juliusvonkohout
Copy link
Member

/unhold

@google-oss-prow google-oss-prow bot merged commit a1dbf47 into kubeflow:master Jun 21, 2024
8 checks passed
@kromanow94 kromanow94 deleted the make-the-oidc-configurator-a-cronjob branch June 21, 2024 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Make the oidc-issuer configurator a CronJob to ensure correct JWKS for the in-cluster self-signed OIDC Issuer
3 participants