Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Coordinator field to JobSet spec #618

Merged
merged 3 commits into from
Jul 26, 2024

Conversation

danielvegamyhre
Copy link
Contributor

Part of #617. I plan on adding validation logic for this new field in a follow up PR, as well as more tests.

  • Add coordinator field to JobSet spec, allowing the user to define the ReplicatedJob, JobIndex, and PodIndex of the pod they wish to be the global coordinator.
  • Add a new label/annotation, jobset.sigs.k8s.io/coordinator, which will store the stable network endpoint where the coordinator pod can be reached, when the jobset.spec.coordinator field is defined.
  • Minor refactoring of test code so it is more flexible, allowing us to do a "set or merge" operation when adding labels/annotations to Job or Pod wrappers.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2024
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kannon92 July 15, 2024 19:56
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 15, 2024
Copy link

netlify bot commented Jul 15, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 8a1b2ee
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66a400f9a531ec0008bc2f4f

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 15, 2024
@@ -44,6 +44,11 @@ const (
// JobSetControllerName is the reserved value for the managedBy field for the built-in
// JobSet controller.
JobSetControllerName = "jobset.sigs.k8s.io/jobset-controller"

// CoordinatorKey is used as an annotation and label on Jobs and Pods. If the JobSet spec
// defines the .spec.coordinator field, this annotation/label will be added to store a stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I have a small question, why do we need to put this tag on both the annotation and the label? Is it not enough to just use the label?

Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but inconsistency between labels vs annotations and Job labels/annotations vs Pod labels/annotations has led to bugs in the past, as well as confusion for users. So while the API is still in alpha I've just been adding everything as both a label and annotation to both Jobs and Pods, so there is full consistency across Jobs, pods, labels, and annotations. Then for graduation to v1 I plan on reviewing these and assigning them more thoughtfully, and documenting them in the JobSet docsite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for your reply

@googs1025
Copy link
Member

I have a small question: Does this feature require the use of featureGate?

@danielvegamyhre
Copy link
Contributor Author

I have a small question: Does this feature require the use of featureGate?

i don't think we'll add a feature gate for this one since it is isn't a risky feature. It just sets a new annotation and label on jobs and pods if the field is set.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Jul 25, 2024

cc @kannon92 would you ming taking a look at this if you have time? Abdullah is OOO currently.

Happy to discuss the feature/context further if you'd like.

@@ -746,6 +746,12 @@ func labelAndAnnotateObject(obj metav1.Object, js *jobset.JobSet, rjob *jobset.R
annotations[jobset.JobIndexKey] = strconv.Itoa(jobIdx)
annotations[jobset.JobKey] = jobHashKey(js.Namespace, jobName)

// Apply coordinator annotation/label if a coordinator is defined in the JobSet spec.
if js.Spec.Coordinator != nil {
labels[jobset.CoordinatorKey] = coordinatorEndpoint(js)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One subtle bug here would be that if someone specifies these labels we would overwrite.

Do we want a warning if someone uses these labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I have a follow up PR with webhook changes ready to go after this is merged. Will tag you on it shortly.

------------ | ------------- | ------------- | -------------
**job_index** | **int** | JobIndex is the index of Job which contains the coordinator pod (i.e., for a ReplicatedJob with N replicas, there are Job indexes 0 to N-1). Defaults to 0 if unset. | [optional]
**pod_index** | **int** | PodIndex is the Job completion index of the coordinator pod. Defaults to 0 if unset. | [optional]
**replicated_job** | **str** | ReplicatedJob is the name of the ReplicatedJob which contains the coordinator pod. | [default to '']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we get "default to ''" from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I think because the replicatedJob field does not have "omitempty" so it will default to empty string if unset. In the follow up PR with webhook changes I will validate it set to a valid replicated job name, though.

@kannon92
Copy link
Contributor

/hold

/lgtm

Code looks good. I have some minor nits. Feel free to unhold.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 26, 2024
@danielvegamyhre
Copy link
Contributor Author

@kannon92 can you reapply LGTM when you have a sec please? I made changes to address the comments and it was removed

@kannon92
Copy link
Contributor

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit a1bbff3 into kubernetes-sigs:main Jul 26, 2024
13 checks passed
@danielvegamyhre danielvegamyhre mentioned this pull request Aug 19, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants