First, GetSupportedJobList
in common.go needs to be updated.
func GetSupportedJobList() []schema.GroupVersionKind {
supportedJobList := []schema.GroupVersionKind{
{
Group: "batch",
Version: "v1",
Kind: "Job",
},
{
Group: "kubeflow.org",
Version: "v1",
Kind: "TFJob",
},
{
Group: "kubeflow.org",
Version: "v1",
Kind: "PyTorchJob",
},
}
return supportedJobList
}
In this function, we define the Kubernetes GroupVersionKind
that are supported in Katib. If you want to add a new kind, please append the supportedJobList
.
GetDeployedJobStatus
in trial_controller_util.go needs to be updated.
It is used to determine if the trial is completed (Succeeded or Failed).
isWorkerContainer
in inject_webhook.go needs to be updated.
func isWorkerContainer(jobKind string, index int, c v1.Container) bool {
switch jobKind {
case BatchJob:
if index == 0 {
// for Job worker, the first container will be taken as worker container,
// katib document should note it
return true
}
case TFJob:
if c.Name == TFJobWorkerContainerName {
return true
}
case PyTorchJob:
if c.Name == PyTorchJobWorkerContainerName {
return true
}
default:
log.Info("Invalid Katib worker kind", "JobKind", jobKind)
return false
}
return false
}
The function is used to determine which container in the job is the actual main container.
In Katib, we only inject metrics collector sidecar into the master pod (See metrics-collector.md for more details). Thus we need to update the JobRoleMap
in const.go.
var JobRoleMap = map[string][]string{
"TFJob": {JobRoleLabel, TFJobRoleLabel},
"PyTorchJob": {JobRoleLabel, PyTorchJobRoleLabel},
"Job": {},
}