Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

studyJob should fail when specifying invalid worker kind #285

Closed
hougangliu opened this issue Dec 12, 2018 · 2 comments · Fixed by #287
Closed

studyJob should fail when specifying invalid worker kind #285

hougangliu opened this issue Dec 12, 2018 · 2 comments · Fixed by #287
Assignees

Comments

@hougangliu
Copy link
Member

When I specify an invalid WorkerSpec kind (saying TFJobx as below), no worker will be created, and the studyJob status keeps running with trials attached.

kubectl get studyjob tfjob-example -n kubeflow -o yaml

apiVersion: kubeflow.org/v1alpha1
kind: StudyJob
metadata:
creationTimestamp: 2018-12-12T06:12:21Z
generation: 1
labels:
controller-tools.k8s.io: "1.0"
name: tfjob-example
namespace: kubeflow
resourceVersion: "13722220"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/tfjob-example
uid: e2cdec25-fdd4-11e8-af31-005056ad997c
spec:
metricsCollectorSpec:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
schedule: "*/1 * * * *"
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: gcr.io/kubeflow-ci/katib/tfevent-metrics-collector:v0.1.2-alpha-77-g9324cad
args:
- "python"
- "main.py"
- "-m"
- "vizier-core"
- "-s"
- "{{.StudyID}}"
- "-w"
- "{{.WorkerID}}"
- "-d"
- "/train/{{.WorkerID}}"
volumeMounts:
- mountPath: "/train"
name: "train"
volumes:
- name: "train"
persistentVolumeClaim:
claimName: "tfevent-volume"
restartPolicy: Never
serviceAccountName: metrics-collector
metricsnames:

  • accuracy_1
    objectivevaluename: accuracy_1
    optimizationgoal: 0.99
    optimizationtype: maximize
    owner: crd
    parameterconfigs:
  • feasible:
    max: "0.05"
    min: "0.01"
    name: --learning_rate
    parametertype: double
  • feasible:
    max: "200"
    min: "100"
    name: --batch_size
    parametertype: int
    requestcount: 4
    studyName: tfjob-example
    suggestionSpec:
    requestNumber: 3
    suggestionAlgorithm: random
    suggestionParameters:
    • name: SuggestionCount
      value: "0"
      workerSpec:
      goTemplate:
      rawTemplate: |-
      apiVersion: "kubeflow.org/v1beta1"
      kind: TFJobx
      metadata:
      name: {{.WorkerID}}
      namespace: kubeflow
      spec:
      tfReplicaSpecs:
      Worker:
      replicas: 1
      restartPolicy: Never
      template:
      spec:
      containers:
      - name: tensorflow
      image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
      command:
      - "python"
      - "/var/tf_mnist/mnist_with_summaries.py"
      - "--log_dir=/train/{{.WorkerID}}"
      {{- with .HyperParameters}}
      {{- range .}}
      - "{{.Name}}={{.Value}}"
      {{- end}}
      {{- end}}
      volumeMounts:
      - mountPath: "/train"
      name: "train"
      volumes:
      - name: "train"
      persistentVolumeClaim:
      claimName: "tfevent-volume"
      status:
      conditon: Running
      earlyStoppingParameterId: ""
      studyid: ab5bcacd42e3d071
      suggestionCount: 2
      suggestionParameterId: j51d3c50a38c7c74
      **trials:
  • trialid: ma6bbb7db6889bd6
    workeridlist:
    • completionTime: null
      conditon: Created
      kind: TFJobx
      startTime: 2018-12-12T06:12:19Z
      workerid: bb7e3421bb2845bd
  • trialid: o6acac42ab646f0c
    workeridlist:
    • completionTime: null
      conditon: Created
      kind: TFJobx
      startTime: 2018-12-12T06:12:19Z
      workerid: ef09022bc257869a
  • trialid: f98d0b68acd29650
    workeridlist:
    • completionTime: null
      conditon: Created
      kind: TFJobx
      startTime: 2018-12-12T06:12:19Z
      workerid: o3c4d9191287eb50**

kubectl logs studyjob-controller-56588dc6f9-c4rnc -n kubeflow

2018/12/12 03:22:03 Registering Components.
2018/12/12 03:22:03 Starting the Cmd.
2018/12/12 05:58:56 Study kubeflow/tfjob-example was deleted. Resouces will be released.
2018/12/12 05:59:38 Create Study tfjob-example
....
2018/12/12 06:12:19 Create Study tfjob-example
2018/12/12 06:12:19 Study ID ab5bcacd42e3d071
2018/12/12 06:12:19 Study ID ab5bcacd42e3d071 StudyConfname:"tfjob-example" owner:"crd" optimization_type:MAXIMIZE optimization_goal:0.99 parameter_configs:<configs:<name:"--learning_rate" parameter_type:DOUBLE feasible:<max:"0.05" min:"0.01" > > configs:<name:"--batch_size" parameter_type:INT feasible:<max:"200" min:"100" > > > objective_value_name:"accuracy_1" metrics:"accuracy_1" jobId:"e2cdec25-fdd4-11e8-af31-005056ad997c"
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 Suggestion Spec &{random [] 3}
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 setSuggesitonParameterReply param_id:"j51d3c50a38c7c74"
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 CreatedTrials :
2018/12/12 06:12:19 trial_id:"ma6bbb7db6889bd6" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0111" > parameter_set:<name:"--batch_size" parameter_type:INT value:"108" >
2018/12/12 06:12:19 trial_id:"o6acac42ab646f0c" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0462" > parameter_set:<name:"--batch_size" parameter_type:INT value:"146" >
2018/12/12 06:12:19 trial_id:"f98d0b68acd29650" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0420" > parameter_set:<name:"--batch_size" parameter_type:INT value:"145" >
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 Suggestions trials:<trial_id:"ma6bbb7db6889bd6" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0111" > parameter_set:<name:"--batch_size" parameter_type:INT value:"108" > > trials:<trial_id:"o6acac42ab646f0c" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0462" > parameter_set:<name:"--batch_size" parameter_type:INT value:"146" > > trials:<trial_id:"f98d0b68acd29650" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0420" > parameter_set:<name:"--batch_size" parameter_type:INT value:"145" > >

@hougangliu
Copy link
Member Author

/assign @hougangliu

@hougangliu
Copy link
Member Author

I will submit a PR to fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant