Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job pending condition #1

Closed
wants to merge 2 commits into from
Closed

Job pending condition #1

wants to merge 2 commits into from

Conversation

kannon92
Copy link
Owner

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@@ -795,7 +796,11 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr
deleted, err := jm.deleteActivePods(ctx, &job, activePods)
if uncounted == nil {
// Legacy behavior: pretend all active pods were successfully removed.
deleted = active
if strings.Contains(finishedCondition.Message, "has condition for pending") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the legacy behavior when there is no tracking with finalizers... the plan is to remove this code in 1.27.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I notice that this code https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller_test.go#L2010 basically doesn't set the finalizers path to true at all. I think it is set with 1 case and the rest may be default to false.

So when I was debugging the integration tests I was hitting this path quite a bit. Maybe we should look at those tests and make sure they are testing the finalizer route?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now!

Comment on lines +1751 to +1753
_, _, action := matchPodFailurePolicy(job.Spec.PodFailurePolicy, p)
failJob := batch.PodFailurePolicyActionFailJob
return action != nil && *action == failJob

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this mean that there is no timeout at all?
That as soon as the pod has the condition, even if transient, it matches the policy?
I don't think we want that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the next steps? I think I am hearing that we should include a deadline for PodFailurePolicy to match on a particular condition.

Is it possible to add this to the retriable jobs KEP as an enhancement found during Beta? Or is this a separate KEP on top of a beta feature?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a mechanism that, after a certain time has passed, marks a Pending pod as Failed.

As discussed in the WG Batch meeting, this could be an external controller. But it could also be kubelet if there is an API for it.

That's definitely material for a separate KEP. But I would suggest bringing up this discussion back to SIG Node again.

Copy link

@mimowo mimowo Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transitioning Pending pods with DeletionTimestamp!= nil is planned to be done as part of the KEP-3329: Retriable and non-retriable Pod failures for Jobs: "Also, extend PodGC to mark as failed (and delete) terminating pods (with set deletionTimestamp) which stuck in the pending.". We planned to do this in PodGC. However, this only refers to Pending pods which are deleted, so not sure it applies in your case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the external controller can simply issue a DELETE based on its timeout.

kannon92 pushed a commit that referenced this pull request Oct 16, 2023
These were found with a modified klog that enables "go vet" to check klog call
parameters:

    cmd/kubeadm/app/features/features.go:149:4: printf: k8s.io/klog/v2.Warningf format %t has arg v of wrong type string (govet)
    			klog.Warningf("Setting deprecated feature gate %s=%t. It will be removed in a future release.", k, v)
    test/images/sample-device-plugin/sampledeviceplugin.go:147:5: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
    				klog.Errorf("error: %w", err)
    test/images/sample-device-plugin/sampledeviceplugin.go:155:3: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
    		klog.Errorf("Failed to add watch to %q: %w", triggerPath, err)
    staging/src/k8s.io/code-generator/cmd/prerelease-lifecycle-gen/prerelease-lifecycle-generators/status.go:207:5: printf: k8s.io/klog/v2.Fatalf does not support error-wrapping directive %w (govet)
    				klog.Fatalf("Package %v: unsupported %s value: %q :%w", i, tagEnabledName, ptag.value, err)
    staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:286:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
    		klog.V(4).Infof("Node %s missing in vSphere cloud provider cache, trying node informer")
    staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:302:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
    		klog.V(4).Infof("Node %s missing in vSphere cloud provider caches, trying the API server")
@kannon92 kannon92 closed this Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants