You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I specify an invalid WorkerSpec kind (saying TFJobx as below), no worker will be created, and the studyJob status keeps running with trials attached.
kubectl get studyjob tfjob-example -n kubeflow -o yaml
When I specify an invalid WorkerSpec kind (saying TFJobx as below), no worker will be created, and the studyJob status keeps running with trials attached.
kubectl get studyjob tfjob-example -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha1
kind: StudyJob
metadata:
creationTimestamp: 2018-12-12T06:12:21Z
generation: 1
labels:
controller-tools.k8s.io: "1.0"
name: tfjob-example
namespace: kubeflow
resourceVersion: "13722220"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/tfjob-example
uid: e2cdec25-fdd4-11e8-af31-005056ad997c
spec:
metricsCollectorSpec:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
schedule: "*/1 * * * *"
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: gcr.io/kubeflow-ci/katib/tfevent-metrics-collector:v0.1.2-alpha-77-g9324cad
args:
- "python"
- "main.py"
- "-m"
- "vizier-core"
- "-s"
- "{{.StudyID}}"
- "-w"
- "{{.WorkerID}}"
- "-d"
- "/train/{{.WorkerID}}"
volumeMounts:
- mountPath: "/train"
name: "train"
volumes:
- name: "train"
persistentVolumeClaim:
claimName: "tfevent-volume"
restartPolicy: Never
serviceAccountName: metrics-collector
metricsnames:
objectivevaluename: accuracy_1
optimizationgoal: 0.99
optimizationtype: maximize
owner: crd
parameterconfigs:
max: "0.05"
min: "0.01"
name: --learning_rate
parametertype: double
max: "200"
min: "100"
name: --batch_size
parametertype: int
requestcount: 4
studyName: tfjob-example
suggestionSpec:
requestNumber: 3
suggestionAlgorithm: random
suggestionParameters:
value: "0"
workerSpec:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1beta1"
kind: TFJobx
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/{{.WorkerID}}"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
volumeMounts:
- mountPath: "/train"
name: "train"
volumes:
- name: "train"
persistentVolumeClaim:
claimName: "tfevent-volume"
status:
conditon: Running
earlyStoppingParameterId: ""
studyid: ab5bcacd42e3d071
suggestionCount: 2
suggestionParameterId: j51d3c50a38c7c74
**trials:
workeridlist:
conditon: Created
kind: TFJobx
startTime: 2018-12-12T06:12:19Z
workerid: bb7e3421bb2845bd
workeridlist:
conditon: Created
kind: TFJobx
startTime: 2018-12-12T06:12:19Z
workerid: ef09022bc257869a
workeridlist:
conditon: Created
kind: TFJobx
startTime: 2018-12-12T06:12:19Z
workerid: o3c4d9191287eb50**
kubectl logs studyjob-controller-56588dc6f9-c4rnc -n kubeflow
2018/12/12 03:22:03 Registering Components.
2018/12/12 03:22:03 Starting the Cmd.
2018/12/12 05:58:56 Study kubeflow/tfjob-example was deleted. Resouces will be released.
2018/12/12 05:59:38 Create Study tfjob-example
....
2018/12/12 06:12:19 Create Study tfjob-example
2018/12/12 06:12:19 Study ID ab5bcacd42e3d071
2018/12/12 06:12:19 Study ID ab5bcacd42e3d071 StudyConfname:"tfjob-example" owner:"crd" optimization_type:MAXIMIZE optimization_goal:0.99 parameter_configs:<configs:<name:"--learning_rate" parameter_type:DOUBLE feasible:<max:"0.05" min:"0.01" > > configs:<name:"--batch_size" parameter_type:INT feasible:<max:"200" min:"100" > > > objective_value_name:"accuracy_1" metrics:"accuracy_1" jobId:"e2cdec25-fdd4-11e8-af31-005056ad997c"
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 Suggestion Spec &{random [] 3}
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 setSuggesitonParameterReply param_id:"j51d3c50a38c7c74"
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 CreatedTrials :
2018/12/12 06:12:19 trial_id:"ma6bbb7db6889bd6" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0111" > parameter_set:<name:"--batch_size" parameter_type:INT value:"108" >
2018/12/12 06:12:19 trial_id:"o6acac42ab646f0c" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0462" > parameter_set:<name:"--batch_size" parameter_type:INT value:"146" >
2018/12/12 06:12:19 trial_id:"f98d0b68acd29650" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0420" > parameter_set:<name:"--batch_size" parameter_type:INT value:"145" >
2018/12/12 06:12:19 Study: ab5bcacd42e3d071 Suggestions trials:<trial_id:"ma6bbb7db6889bd6" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0111" > parameter_set:<name:"--batch_size" parameter_type:INT value:"108" > > trials:<trial_id:"o6acac42ab646f0c" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0462" > parameter_set:<name:"--batch_size" parameter_type:INT value:"146" > > trials:<trial_id:"f98d0b68acd29650" study_id:"ab5bcacd42e3d071" parameter_set:<name:"--learning_rate" parameter_type:DOUBLE value:"0.0420" > parameter_set:<name:"--batch_size" parameter_type:INT value:"145" > >
The text was updated successfully, but these errors were encountered: