Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include first failed job name in event emitted when JobSet fails, as well as the JobSet failure condition #477

Merged
merged 5 commits into from
Mar 27, 2024

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Mar 26, 2024

Fixes #466

  • Include first failed job name in event emitted when JobSet fails, as well as the JobSet failure condition.
  • Move constants into a separate constants package
  • Move all success policy related helper functions into a separate file

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 26, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kannon92 March 26, 2024 17:05
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 26, 2024
Copy link

netlify bot commented Mar 26, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 5b33e6f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66044a50517914000792513e

@@ -916,3 +914,42 @@ func findReplicatedStatus(replicatedJobStatus []jobset.ReplicatedJobStatus, repl
}
return jobset.ReplicatedJobStatus{}
}

// messageWithFirstFailedJob appends the first failed job to the original event message in human readable way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these out of this file?

They don't require the controller so it may be useful to move them to a separate file.

Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah jobset_controller.go is overdue for refactoring. To start I think we can refactor some of the many helper functions into separate files based on the feature (similar to what you did with startup policy).

I did some refactoring in this PR (e.g. moving some functions into success_policy.go, adding a constants pkg, etc.)

However, for these particular functions, I'm not sure of the best place to put them yet. They are about finding the first failed job for a Jobset and generating an event message for it, which doesn't fit into any existing (or new) logical grouping.

I think for now we should leave these 3 functions here and maybe in a separate PR we can refactor some more, I don't want to go overboard splitting things up.

"sigs.k8s.io/jobset/pkg/util/collections"
)

// TODO: add unit tests for the functions in this file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this done as part of this PR? or a TODO for others?

Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a ticket to do it in follow up PR once this is merged so the file exists in main and the issue can link to it

@kannon92
Copy link
Contributor

/hold
/lgtm

I have a minor nit around constants naming but otherwise this looks good to me.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2024
@danielvegamyhre
Copy link
Contributor Author

/retest

@danielvegamyhre
Copy link
Contributor Author

All tests are passing locally but the build fails in e2e CI? odd...

@kannon92
Copy link
Contributor

All tests are passing locally but the build fails in e2e CI? odd...

Not sure why we did this but it bit us again...

https://github.com/kubernetes-sigs/jobset/blob/main/Dockerfile

Docker file is copying elements one by one in pkg.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Mar 27, 2024

All tests are passing locally but the build fails in e2e CI? odd...

Not sure why we did this but it bit us again...

https://github.com/kubernetes-sigs/jobset/blob/main/Dockerfile

Docker file is copying elements one by one in pkg.

Ahhhh not again... this dockerfile will be the death of me. Fixing.

Let's just change it to pkg/, the only unnecessary part of pkg/ is the pkg/testing and it's just one file.

@danielvegamyhre
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2024
@danielvegamyhre
Copy link
Contributor Author

@kannon92 I fixed the dockerfile but lgtm was removed because of this, can you take another look please?

@kannon92
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2024
@k8s-ci-robot k8s-ci-robot merged commit 9968a0a into kubernetes-sigs:main Mar 27, 2024
12 checks passed
@danielvegamyhre danielvegamyhre mentioned this pull request Apr 15, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit event containing name of first failed Job which ultimately triggered JobSet failure
3 participants