Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I run an experiment on a k8s cluster with taint? #1174

Closed
hyeonsangjeon opened this issue Apr 29, 2020 · 10 comments
Closed

How do I run an experiment on a k8s cluster with taint? #1174

hyeonsangjeon opened this issue Apr 29, 2020 · 10 comments

Comments

@hyeonsangjeon
Copy link

hyeonsangjeon commented Apr 29, 2020

I have a question about katib excution of on k8s taint cluster.
I ran katib on k8s with 5 clusters.
Currently, all 5 clusters have a taint.
When experiment is executed, it was not working with the pending state.

Can you tell me how can run the experiment on If you have a all the taint k8s?
Can put the toleration setting in the experiment yaml?

 kubectl describe node [MY_NODE] | grep Taints

Taints:             DHPWORK=CLUSTER:NoSchedule
Taints:             dedicated=infra:NoSchedule
Taints:             dedicated=infra:NoSchedule
Taints:             DHPWORK=CLUSTER:NoSchedule
Taints:             dedicated=infra:NoSchedule
kubectl describe pod jhstest-random-585d5fcfb7-snx7c -n kubeflow
Name:               jhstest-random-585d5fcfb7-snx7c
Namespace:          kubeflow
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             deployment=jhstest-random
                    experiment=jhstest
                    pod-template-hash=585d5fcfb7
                    suggestion=jhstest
Annotations:        kubernetes.io/psp: ibm-anyuid-hostpath-psp
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/jhstest-random-585d5fcfb7
Containers:
  suggestion:
    Image:      gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt
    Port:       6789/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     500m
      memory:  100Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Liveness:     exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1alpha3.Suggestion] delay=10s timeout=1s period=120s #success=1 #failure=12
    Readiness:    exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1alpha3.Suggestion] delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rj72j (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-rj72j:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rj72j
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  4m26s (x124 over 110m)  default-scheduler  0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate.

I ran the test yaml below this, but it didn't work.

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  labels:
    controller-tools.k8s.io: "1.0"
  name: random-example
spec:
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: infra
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
   .....

And also pending state, when trialtemplate setting with toleration

  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/kubeflowkatib/mxnet-mnist
                  command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
                tolerations:
                - key: DHPWORK
                  operator: Equal
                  value: CLUSTER
                  effect: NoSchedule
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
question 0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.96

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jsga
Copy link

jsga commented Nov 10, 2020

Hi @hyeonsangjeon,
it was a while since you opened the issue. Did you manage to solve it?

@andreyvelich
Copy link
Member

@hyeonsangjeon @jsga Currently you can't specify tolerations for the Experiment Suggestion's Pod, but you can do it for the Experiment Trials' pods since it is just a template.

Check here: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#concepts how you can remove taint from one of your Nodes.

@andreyvelich
Copy link
Member

Feel free to re-open the issue if it is needed.

@ekesken
Copy link

ekesken commented Oct 9, 2022

I believe we need a solution for Experiment Suggestion's Pod as well, we can't enforce people to remove their taints from their nodes.

For example in our specific case, we are the team providing Katib as a service in customer clusters, we have our own nodes to deploy Katib service which are protected by taints to avoid user workload to be placed in these nodes, and we can't guarantee there will be a non-tainted node to execute our end2end tests in the clusters.

So it's a blocker issue for us.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 11, 2022

@ekesken Thanks for your comments.
We have the similar feature request in #1737.

@tom-pavz
Copy link

@andreyvelich @tenzen-y

I see that in v0.15.0 release (https://github.com/kubeflow/katib/releases/tag/v0.15.0), this PR #2000 was included but I do not think that those changes addressed the feature request of this specific issue, related to being able to set the nodeSelector / tolerations of the Suggestion pod.

Can we re-open this issue, please? It seems like some of the features in #1737 were addressed, but not of this issue. Please let me know if I am misunderstanding anything. I just would love to see the ability to set the nodeSelector / tolerations of the Suggestion pod, and it seems like there are others with the same request, so would love to see it get on the roadmap.

@andreyvelich
Copy link
Member

Hi @tom-pavz, that's right, we don't support nodeSelector yet.
Our long term plan to give user control for the whole Suggestion Deployment spec.
I re-opened #1737 issue.
All contributions are welcome!

@tom-pavz
Copy link

Hi @tom-pavz, that's right, we don't support nodeSelector yet. Our long term plan to give user control for the whole Suggestion Deployment spec. I re-opened #1737 issue. All contributions are welcome!

Thanks so much! 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants