Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc to note how to cleanup Volcano completely #2079

Closed
Yikun opened this issue Mar 12, 2022 · 10 comments
Closed

Add doc to note how to cleanup Volcano completely #2079

Yikun opened this issue Mar 12, 2022 · 10 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/docs kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@Yikun
Copy link
Member

Yikun commented Mar 12, 2022

What would you like to be added:

Add doc to note how to cleanup Volcano completely

I0312 06:48:38.683192       1 event.go:291] "Event occurred" object="volcano-system/volcano-admission-54b4798bff" kind="ReplicaSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="Error creating: Internal error occurred: failed calling webhook \"mutatepod.volcano.sh\": failed to call webhook: Post \"https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s\": dial tcp xxx.xxx.xxx.xxx:443: connect: connection refused"

Why is this needed:

kubectl apply -f ./installer/volcano-development-arm64.yaml
kubectl delete -f ./installer/volcano-development-arm64.yaml

Use apply and delete reinstall volcano, we still need to cleanup validatingwebhookconfigurations and mutatingwebhookconfigurations, then re-apply:

k delete validatingwebhookconfigurations volcano-admission-service-jobs-validate volcano-admission-service-pods-validate volcano-admission-service-queues-validate
k delete mutatingwebhookconfigurations volcano-admission-service-jobs-mutate volcano-admission-service-podgroups-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

#2102 (comment): Supplyment: volcano-admission-init will create a secret which cannot be cleaned thoroughly.

@LeonardAukea
Copy link

For some reason I still see some events like:

Error creating: Internal error occurred: failed calling webhook "mutatepod.volcano.sh": Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s": x509: certificate has expired or is not yet valid: current time 2022-03-19T07:53:35Z is after 2021-07-03T10:00:41Z

Even after deletion of volcano and validatingwebhookconfigurations , mutatingwebhookconfigurations

@hwdef
Copy link
Member

hwdef commented Apr 1, 2022

Here are some improvements I need about webhook:

  1. Instead of using code to generate webhook, it is written in yaml and deployed with yaml.This makes it easy to change the scope of application of the webhook and to ignore some pods
  2. If possible, narrow the scope of application of webhook, currently intercepting all pods, once volcano has a problem, the entire cluster cannot create pods

@whybeyoung
Copy link
Contributor

Here are some improvements I need about webhook:

  1. Instead of using code to generate webhook, it is written in yaml and deployed with yaml.This makes it easy to change the scope of application of the webhook and to ignore some pods
  2. If possible, narrow the scope of application of webhook, currently intercepting all pods, once volcano has a problem, the entire cluster cannot create pods

agree with you...

For the first you mention: maybe the cert generate is difficult to generate in yaml. i understand why we use job generate now, it easy for user to install, but not considered how graceful clean them , my suggestion is make a separated job task to them as the installation does...

For the second, i think the it maybe influence the whole design because of pod resource's usage caculation i think ..not a very simple problem....(i'm not very very familiar with it now...) All in all, it's the volcano HA can avoid this.......

@shinytang6
Copy link
Member

I think the simplest solution for now is to add a delete hook(Job) to delete these webhooks individually :)

@Thor-wl Thor-wl added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 28, 2022
@hwdef
Copy link
Member

hwdef commented Apr 29, 2022

@pietermarsman
Copy link

pietermarsman commented May 3, 2022

For future users with this issue, I resolved it with:

kubectl delete validatingwebhookconfigurations volcano-admission-service-pods-validate volcano-admission-service-jobs-validate volcano-admission-service-queues-validate
kubectl delete mutatingwebhookconfigurations volcano-admission-service-podgroups-mutate volcano-admission-service-jobs-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

@kongjibai
Copy link

For future users with this issue, I resolved it with:

kubectl delete validatingwebhookconfigurations volcano-admission-service-pods-validate volcano-admission-service-jobs-validate volcano-admission-service-queues-validate
kubectl delete mutatingwebhookconfigurations volcano-admission-service-podgroups-mutate volcano-admission-service-jobs-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

I have deleted validatingwebhookconfigurations and mutatingwebhookconfigurations as you mentioned, but when i run "kubectl apply -f mpi-example.yaml", it only outputs "job.batch.volcano.sh/lm-mpi-job configured", run "kubectl get pod", it outputs "No resources found in default namespace". How to solve this problem?

@stale
Copy link

stale bot commented Aug 10, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@hwdef
Copy link
Member

hwdef commented Aug 12, 2022

fix in #2346
/close

@volcano-sh-bot
Copy link
Contributor

@hwdef: Closing this issue.

In response to this:

fix in #2346
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/docs kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

9 participants