-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Implement Initializer builders in the JobSet plugin #2316
Changes from 4 commits
8e670ef
ba9e215
58d563d
e9f017c
1e14800
b9f9e8a
c35b8d1
cd0bc23
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -167,8 +167,9 @@ type DatasetConfig struct { | |
// These values will be merged with the TrainingRuntime's dataset initializer environments. | ||
Env []corev1.EnvVar `json:"env,omitempty"` | ||
|
||
// Reference to the TrainJob's secrets to download dataset. | ||
SecretRef *corev1.SecretReference `json:"secretRef,omitempty"` | ||
// Reference to the secret with credentials to download dataset. | ||
// Secret must be created in the TrainJob's namespace. | ||
SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tenzen-y I updated the API to use |
||
} | ||
|
||
// ModelConfig represents the desired model configuration. | ||
|
@@ -193,8 +194,9 @@ type InputModel struct { | |
// These values will be merged with the TrainingRuntime's model initializer environments. | ||
Env []corev1.EnvVar `json:"env,omitempty"` | ||
|
||
// Reference to the TrainJob's secrets to download model. | ||
SecretRef *corev1.SecretReference `json:"secretRef,omitempty"` | ||
// Reference to the secret with credentials to download model. | ||
// Secret must be created in the TrainJob's namespace. | ||
SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` | ||
} | ||
|
||
// OutputModel represents the desired trained model configuration. | ||
|
@@ -206,8 +208,9 @@ type OutputModel struct { | |
// These values will be merged with the TrainingRuntime's model exporter environments. | ||
Env []corev1.EnvVar `json:"env,omitempty"` | ||
|
||
// Reference to the TrainJob's secrets to export model. | ||
SecretRef *corev1.SecretReference `json:"secretRef,omitempty"` | ||
// Reference to the secret with credentials to export model. | ||
// Secret must be created in the TrainJob's namespace. | ||
SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` | ||
} | ||
|
||
// PodSpecOverride represents the custom overrides that will be applied for the TrainJob's resources. | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ import ( | |
"fmt" | ||
|
||
batchv1 "k8s.io/api/batch/v1" | ||
corev1 "k8s.io/api/core/v1" | ||
) | ||
|
||
const ( | ||
|
@@ -26,12 +27,19 @@ const ( | |
// JobInitializer is the Job name for the initializer. | ||
JobInitializer string = "initializer" | ||
|
||
// VolumeNameInitializer is the name for the initializer Pod's Volume and VolumeMount. | ||
// TODO (andreyvelich): Add validation to check that initializer Pod has the correct volume. | ||
VolumeNameInitializer string = "initializer" | ||
|
||
// ContainerModelInitializer is the container name for the model initializer. | ||
ContainerModelInitializer string = "model-initializer" | ||
|
||
// ContainerDatasetInitializer is the container name for the dataset initializer. | ||
ContainerDatasetInitializer string = "dataset-initializer" | ||
|
||
// InitializerEnvStorageUri is the env name for the initializer storage uri. | ||
InitializerEnvStorageUri string = "STORAGE_URI" | ||
|
||
// PodGroupKind is the Kind name for the PodGroup. | ||
PodGroupKind string = "PodGroup" | ||
|
||
|
@@ -56,4 +64,36 @@ const ( | |
var ( | ||
// JobCompletionIndexFieldPath is the field path for the Job completion index annotation. | ||
JobCompletionIndexFieldPath string = fmt.Sprintf("metadata.annotations['%s']", batchv1.JobCompletionIndexAnnotation) | ||
|
||
// This is the temporary container that we use in the initializer ReplicatedJob. | ||
// TODO (andreyvelich): Once JobSet supports execution policy, we can remove it. | ||
// TODO (andreyvelich): Add validation to check that initializer ReplicatedJob has this container. | ||
ContainerBusyBox corev1.Container = corev1.Container{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that this is only for the testing wrappers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned here, I think we should use these consts for Webhook validation: #2316 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. In that case, could we move these default values into the JobSet plugins since these are semantically tied into the JobSet, then we can reuse these in the JobSet There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, that makes sense! |
||
Name: "busybox", | ||
Image: "busybox", | ||
andreyvelich marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// VolumeMountModelInitializer is the volume mount for the model initializer container. | ||
// TODO (andreyvelich): Add validation to check that initializer ReplicatedJob has the following volumes. | ||
VolumeMountModelInitializer = corev1.VolumeMount{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
Name: VolumeNameInitializer, | ||
MountPath: "/workspace/model", | ||
} | ||
|
||
// VolumeMountModelInitializer is the volume mount for the dataset initializer container. | ||
VolumeMountDatasetInitializer = corev1.VolumeMount{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
Name: VolumeNameInitializer, | ||
MountPath: "/workspace/dataset", | ||
} | ||
|
||
// VolumeInitializer is the volume for the initializer ReplicatedJob. | ||
// TODO (andreyvelich): We should make VolumeSource configurable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
VolumeInitializer = corev1.Volume{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akshaychitneni @tenzen-y I think, we should validate that runtime has the correct configuration for the volume if it has the |
||
Name: VolumeNameInitializer, | ||
VolumeSource: corev1.VolumeSource{ | ||
PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{ | ||
ClaimName: VolumeNameInitializer, | ||
}, | ||
}, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this?
training-operator/.github/workflows/test-go.yaml
Lines 32 to 35 in 7c5ea70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Oh, you want to separate it.
Don't we want to include
manifests
command into generate script ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to say removing "Check manifests" from workflow because we will perform manifests verification with
make generate
since this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right!