How do I run an experiment on a k8s cluster with taint? #1174

hyeonsangjeon · 2020-04-29T14:13:29Z

I have a question about katib excution of on k8s taint cluster.
I ran katib on k8s with 5 clusters.
Currently, all 5 clusters have a taint.
When experiment is executed, it was not working with the pending state.

Can you tell me how can run the experiment on If you have a all the taint k8s?
Can put the toleration setting in the experiment yaml?

 kubectl describe node [MY_NODE] | grep Taints

Taints:             DHPWORK=CLUSTER:NoSchedule
Taints:             dedicated=infra:NoSchedule
Taints:             dedicated=infra:NoSchedule
Taints:             DHPWORK=CLUSTER:NoSchedule
Taints:             dedicated=infra:NoSchedule

kubectl describe pod jhstest-random-585d5fcfb7-snx7c -n kubeflow
Name:               jhstest-random-585d5fcfb7-snx7c
Namespace:          kubeflow
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             deployment=jhstest-random
                    experiment=jhstest
                    pod-template-hash=585d5fcfb7
                    suggestion=jhstest
Annotations:        kubernetes.io/psp: ibm-anyuid-hostpath-psp
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/jhstest-random-585d5fcfb7
Containers:
  suggestion:
    Image:      gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt
    Port:       6789/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     500m
      memory:  100Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Liveness:     exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1alpha3.Suggestion] delay=10s timeout=1s period=120s #success=1 #failure=12
    Readiness:    exec [/bin/grpc_health_probe -addr=:6789 -service=manager.v1alpha3.Suggestion] delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rj72j (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-rj72j:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rj72j
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  4m26s (x124 over 110m)  default-scheduler  0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate.

I ran the test yaml below this, but it didn't work.

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  labels:
    controller-tools.k8s.io: "1.0"
  name: random-example
spec:
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: infra
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
   .....

And also pending state, when trialtemplate setting with toleration

  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/kubeflowkatib/mxnet-mnist
                  command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
                tolerations:
                - key: DHPWORK
                  operator: Equal
                  value: CLUSTER
                  effect: NoSchedule

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-04-29T14:13:35Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
question	0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-05-06T02:52:17Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.96

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jsga · 2020-11-10T16:31:48Z

Hi @hyeonsangjeon,
it was a while since you opened the issue. Did you manage to solve it?

andreyvelich · 2020-11-13T21:56:53Z

@hyeonsangjeon @jsga Currently you can't specify tolerations for the Experiment Suggestion's Pod, but you can do it for the Experiment Trials' pods since it is just a template.

Check here: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#concepts how you can remove taint from one of your Nodes.

andreyvelich · 2021-03-18T22:42:26Z

Feel free to re-open the issue if it is needed.

ekesken · 2022-10-09T09:45:07Z

I believe we need a solution for Experiment Suggestion's Pod as well, we can't enforce people to remove their taints from their nodes.

For example in our specific case, we are the team providing Katib as a service in customer clusters, we have our own nodes to deploy Katib service which are protected by taints to avoid user workload to be placed in these nodes, and we can't guarantee there will be a non-tainted node to execute our end2end tests in the clusters.

So it's a blocker issue for us.

tenzen-y · 2022-10-11T14:03:47Z

@ekesken Thanks for your comments.
We have the similar feature request in #1737.

tom-pavz · 2024-01-22T22:10:55Z

@andreyvelich @tenzen-y

I see that in v0.15.0 release (https://github.com/kubeflow/katib/releases/tag/v0.15.0), this PR #2000 was included but I do not think that those changes addressed the feature request of this specific issue, related to being able to set the nodeSelector / tolerations of the Suggestion pod.

Can we re-open this issue, please? It seems like some of the features in #1737 were addressed, but not of this issue. Please let me know if I am misunderstanding anything. I just would love to see the ability to set the nodeSelector / tolerations of the Suggestion pod, and it seems like there are others with the same request, so would love to see it get on the roadmap.

andreyvelich · 2024-01-23T15:53:49Z

Hi @tom-pavz, that's right, we don't support nodeSelector yet.
Our long term plan to give user control for the whole Suggestion Deployment spec.
I re-opened #1737 issue.
All contributions are welcome!

tom-pavz · 2024-01-23T15:57:36Z

Hi @tom-pavz, that's right, we don't support nodeSelector yet. Our long term plan to give user control for the whole Suggestion Deployment spec. I re-opened #1737 issue. All contributions are welcome!

Thanks so much! 🤞

issue-label-bot bot added the question label Apr 29, 2020

jlewi added kind/question and removed question labels Apr 29, 2020

issue-label-bot bot added the area/katib label May 6, 2020

andreyvelich mentioned this issue Nov 13, 2020

[HELP WANTED] All Pending #943

Closed

andreyvelich closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I run an experiment on a k8s cluster with taint? #1174

How do I run an experiment on a k8s cluster with taint? #1174

hyeonsangjeon commented Apr 29, 2020 •

edited

Loading

issue-label-bot bot commented Apr 29, 2020

issue-label-bot bot commented May 6, 2020

jsga commented Nov 10, 2020

andreyvelich commented Nov 13, 2020

andreyvelich commented Mar 18, 2021

ekesken commented Oct 9, 2022

tenzen-y commented Oct 11, 2022 •

edited

Loading

tom-pavz commented Jan 22, 2024

andreyvelich commented Jan 23, 2024

tom-pavz commented Jan 23, 2024

How do I run an experiment on a k8s cluster with taint? #1174

How do I run an experiment on a k8s cluster with taint? #1174

Comments

hyeonsangjeon commented Apr 29, 2020 • edited Loading

issue-label-bot bot commented Apr 29, 2020

issue-label-bot bot commented May 6, 2020

jsga commented Nov 10, 2020

andreyvelich commented Nov 13, 2020

andreyvelich commented Mar 18, 2021

ekesken commented Oct 9, 2022

tenzen-y commented Oct 11, 2022 • edited Loading

tom-pavz commented Jan 22, 2024

andreyvelich commented Jan 23, 2024

tom-pavz commented Jan 23, 2024

hyeonsangjeon commented Apr 29, 2020 •

edited

Loading

tenzen-y commented Oct 11, 2022 •

edited

Loading