Add KEP for job tracking in implementable status #2308

alculquicondor · 2021-01-21T22:03:15Z

Ref #2307 kubernetes/kubernetes#8486

/sig apps

alculquicondor · 2021-01-21T22:03:50Z

/assign @erictune
for sig-apps
/assign @lavalamp
for api changes

lavalamp

Looks good, I think the downgrade scenario is the only important comment.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

lavalamp · 2021-01-21T23:08:05Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been
+// accounted in Job status counters.
+type UncountedPodUIDs struct {
+    // Enabled indicates whether Job tracking uses this struct.


Perhaps just make the UncountedPodUIDs field optional, then you don't need this?

something like the following?

type JobStatus struct { // +optional UncountedPodUIDs *UncountedPodUIDs }

Would that differentiate nil from empty struct?

Yes. But after some more thought, using an explicit "Enabled" field is probably clearer, so I guess let's keep it how you have it.

What I don't like about the boolean is that it will eventually be useless. And the struct is only supposed to be handled by the Job controller, so I think I'm fine with the pointer if we can differentiate between nil and empty struct.

👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.

changed to pointer.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

lavalamp · 2021-01-21T23:20:13Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+   them afterwards. The controller sends a status update.
+
+In the case where a user or another controller removes a Pod, which sets a
+deletion timestamp, the Job controller removes the finalizer. This is along with


I'm not sure I follow what you're trying to say here? It's not wrong, I just don't understand why it needs to be said here.

Moved to its own section and added an unresolved note.

lavalamp · 2021-01-21T23:21:53Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+might not have a finalizer. We can:
+1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking
+   change.
+1. Add the finalizer in the same patch request that modifies the owner


Should go with this option.

+1 - I don't see any disadvantages of this option.

I'm thinking of incremental changes. 1 would be easier, so I can do it in the first implementation PR. If I have time until the codefreeze, I can do 2. But if I miss the deadline, then I think it's fine to leave it for beta. WDYT?

I don't think this should be much harder than 1.

removed unresolved taking option 2.

lavalamp · 2021-01-21T23:22:47Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the
+finalizer to all Pods while preventing race conditions.
+
+We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was


What happens if we roll back to a version of the job controller that doesn't know about the finalizer?

And what happens when we upgrade again :)

That would cause many issues. There is nothing that can remove the finalizers.

So my answer would be: do not enable the feature gate in production clusters while it's in alpha. Disabling the feature gate once the feature is enabled by default should be fine, as the controller (and an N-1 version of it) knows how to remove finalizers.

Is this acceptable?

Unfortunately, I think you might need to roll out in two releases; the first release will remove the finalizer if it is present, but doesn't add it. The second release can then start adding the finalizer and implement the rest of the feature.

The alternative is to do both of a) turning off the feature flag won't disable the finalizer removal logic and b) give users a command/script which they can use to manually remove all finalizers in the event of a downgrade.

If you're going to have an alpha release anyway, then I think it's fine to omit b) since the vast majority of users won't be in this situation.

I'm fine with the alternative. Is there a directory where we already have scripts like the one described?

No, the release notes is probably the place to put it. But if it's just alpha in the first release, probably no need.

@liggitt may have an opinion

turning off the feature flag won't disable the finalizer removal logic

I haven't read the full context, but auto-added finalizers not being removed on rollback or failed HA upgrade or feature disable is a really nasty failure mode. Ungated logic to clean up a finalizer starting in alpha stage could be reasonable. This is similar to what was done with the kubernetes.io/pvc-protection finalizer in kubernetes/kubernetes#61324 / kubernetes/kubernetes#61370

Ungated logic to clean up a finalizer starting in alpha stage could be reasonable.

That is what's being proposed.

wojtek-t

Agree with Daniel - that looks reasonable to me.

wojtek-t · 2021-01-22T09:08:05Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+might not have a finalizer. We can:
+1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking
+   change.
+1. Add the finalizer in the same patch request that modifies the owner


+1 - I don't see any disadvantages of this option.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

wojtek-t · 2021-01-22T09:12:02Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the
+finalizer to all Pods while preventing race conditions.
+
+We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was


And what happens when we upgrade again :)

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

ahg-g · 2021-01-22T17:31:09Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+3. The Job controller removes the `batch.kubernetes.io/job-completion` finalizer
+   from all Pods on Succeeded or Failed phase. We can assume that all of them
+   have been added to the lists if the previous step succeeded. The controller
+   keeps note of the Pod updates that succedeed.


not all since you are planning to do them in batches. we will remove the ones we added to the lists in the previous step.

ahg-g · 2021-01-22T17:37:01Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+   - had no finalizer at the beginning of the sync, or
+   - had the patch in step 3 succeed, or


this is implementation detail, the algorithm is: "count pods listed in uncountedPodUIDs that either have no finalizer or deleted"

I don't see harm in the detail.

Sometimes the detail add confusion rather than clarity. As a reader, I paused a little reading this wondering why are we distinguishing between the two. When I read a proposal, I expect to read about the semantics or behavior we are trying to achieve, if there is uncertainly on how to achieve that behavior, then I expect to see an implementation detail section for example. This is a nit, so take it for what it is worth.

The high-level overview is in Proposal and Notes. This is the section for nitty-gritty details.

The whole doc is a proposal (it is a key enhancement proposal) :)

there are limits to nitty-gritty details, there are ones that are useful and ones that add confusion because they don't add to the understanding of what is being proposed, my feedback is that this added confusion... again, I am replying not because I am insisting on the change, but just as a response to the justification you provided.

I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.

I added this as a request from API reviewer to be explicit about continuing to process the Pods that succeed.
Moved it as a note a paragraph below.

ahg-g · 2021-01-22T17:41:14Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+This allows for deleted Pods to not count towards the failed counter, like it
+is today.


not clear to me which part is "like it is today", is this a change of behavior? if not, I would s/like it is today/and so this doesn't change the current behavior

it's not a change of behavior. Adjusting wording

ahg-g · 2021-01-22T17:50:16Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+Note that step 2 doesn't pay attention to deletion timestamps. So the Job
+controller might account the Pod in the lists and counters if the Pods
+enter Failed or Succeeded phase before or while being deleted.


out of curiosity, why is this worth noting?

This is how the controller works today. Except that once the Pods are completely removed, they don't count anymore.

So maybe we should just stop counting pods with a termination timestamp altogether? That's the discussion for this UNRESOLVED section.

@soltysh @erictune

I would appreciate your input here. There is one case where I think it might be worth reconsidering the semantics. Let's say a Node becomes unavailable and gets removed. In this case, the garbage collection also removes the Pod. Should this be considered a failure or not? Today, it's not.

I think users ideally want failures caused by non-zero exit code (a problem within the user's control) to be counted separately from failures out of their control (preemption, node failure).

Does Failure today include both or just the former? If it is just the former, we should try to keep it that way. If it is a mixture, then needs a bit more thought.

Today, it only includes failures from non-zero exit code. This is because other failures get the Pod removed.

Then, it sounds like we should be skipping Pods that have a termination timestamp from any calculations.

Wait, but there's a problem with the above:
The Pod GC doesn't take into account finalizers. That means that Pods that this controller marks for deletion would also not be counted towards calculations.
Then I think the solution would be: not to count Pods that have a termination timestamp and modify the Pod GC to skip Pods with finalizers.

The impression I've gathered from what I've seen from user reports is that users get quite grumpy when a pod starts and doesn't complete for some reason, and would like to know about that. Would they like the two classes of failures conflated? I'm less sure about the answer to that.

Like I said above, controller created pods that failed (removed, evicted, non-zero exit) should all be treated as failed. The fact that current controller doesn't do it is one of the limitation this proposal is trying to address. Iow. with improved status calculation we'll get more accurate signal of failure. Example, if there's a bad actor in the system consistently removing pods from a job an increased failure rate on a job might make the job author alarmed, and that's a good thing, imo.

I agree that this can be seen as a limitation of the legacy tracking.

Only when the job controller itself removes the Pod it should NOT be considered a failure. This happens when parallelism is reduced or when the Job controller created extra unnecessary Pods. Added a note.

ahg-g · 2021-01-22T17:54:27Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+If a Job with `.status.uncountedPodUIDs.enabled = true` can adopt a Pod, this
+might not have a finalizer. We can:


I think this doesn't read well. Also, isn't this possible only if the status is not yet set to Failed/Succeeded? if so, please clarify that.

The qualifications for Pod adoption don't change from current behavior. Clarified.

Does the current qualification criteria include pods that failed/succeeded? If the pod was in finished/succeeded status, the absence of the finalizer could be because the pod was already processed by the controller in a previous cycle.

In that scenario, the Pod would already have an updated owner reference.

what guarantees that? is it possible that an owner reference be cleared after the pod was counted for?

Actually there is currently no criteria other than the Job selector, which is mutable.

In theory, a Pod could finish, be accounted for Job A, loose its finalizer and then be claimed by another Job B. At this point, I think it's fair to adopt this Pod and account for it in Job B.
But what if Job A claims the Pod again? Then we have no way of adopting it without counting it again.

@wojtek-t @lavalamp thoughts?

Perhaps we should actually only adopt a Pod if it has the finalizer.

You should really not have this problem, because only one job can have the "controller reference"; you should not clear that, and no one else should either.

right, users should not, but they could. While unlikely, I think this is worth a note because in practice this is a regression compared to current situation.

Are you referring to this being documented as unexpected behavior? https://kubernetes.io/docs/concepts/workloads/controllers/job/#specifying-your-own-pod-selector

Although I'm proposing a scenario where ownership is transferred from Job A to Job B and then back to Job A by carefully updating the selectors so that there are no conflicts at a given time. But maybe we can document this as unexpected behavior too.

erictune · 2021-01-25T17:31:21Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+Steps 2 to 4 might deal with a potentially big number of Pods. Thus, status
+updates can potentially stress the kube-apiserver. For this reason, the Job
+controller repeats steps 2 to 4, dealing with up to 100 Pods each time, until
+the controller processes all the Pods.


I think you added this in response to Daniel's comment.
@lavalamp , were you worried about too frequent of updates (not batching) or about the size of the diff from a single update (which the above seems concerned with).

I don't think I'm concerned about either. I do think it's probably good to limit the size of the chunks we process just to guarantee progress (partial update is better than never completing a huge update). But I personally would probably have a time limit-- process all the pods you can in 1m. I would say 15s but we have to consider the possibility of apiserver taking the maximum amount of time to respond to a request (currently 60s). The tests probably need to test under the "apiserver is really slow" condition.

process all the pods you can in 1m.

Uhmm. That would have to look like:

repeatedly update K Pods in parallel while 1m hasn't passed.

setup a channel and K workers, continue processing for 1m.

So then we have to decide on K. Which exact method we use can probably be left for code reviews.

cc @wojtek-t since you were asking about batching updates in the other KEP

Hmm - I don't understand why updating the status (a) with N pods and (b) with M pods is different from kube-apiserver perspective.

I think what we should do is: (a) take a time into account, (b) have a cap on number of pods addeded.

We need a limit to ensure that the object won't grow to super large case, e.g. if job-controller was down for some time and in the meantime gazillions pod finished.

So I take this as: we need to cap on time and number. Updated the KEP

erictune · 2021-01-25T17:38:12Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+   - were removed from the system.
+   The counts increment the `.status.failed` and `.status.succeeded` and clears
+    counted Pods from `.status.uncountedPodsUIDs` lists. The controller sends a
+    status update.


Talk about if there are changes to the logic where the controller decides whether to start new pods.
Does it look at the current .status.succeeded and compare that to .spec.completions? Or does it look at .status.succeeded + len( status.uncountedPodUIDs.succeeded? The list could be non-empty due to the throttling mentioned below. Or does it block starting new pods if there are any uncounted pods?

Added to step 1.

I agree that the correct name should have terminated in it. What about uncountedTerminatedPodUIDs?

This is a good point, "uncounted" isn't quite right, it's "terminatedAndUncounted". Yes, that logic needs to look at 3 sources of information, a) the counts, b) the existing pods which are not in c) the terminatedAndUncounted list.

ahg-g · 2021-01-27T16:07:21Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+the controller adds finished Pods to a list in the Job status. After removing
+the Pod finalizers, the controller clears the list and updates the counters.
+
+### Risks and Mitigations


is increasing object size a risk that we should mention?

yes, added.

What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?

There are several problems with very large Job statuses:

The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.

In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.

Using kubectl will be very painful if there are 40,000 UIDs in the list.

I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.

This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.

We've been discussing that too.

The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.

There are ways of mitigating stuff;

for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want

in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"

So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.

FYI, the KEP already mentions limiting the UIDs in the controller

Which are you doing:

Option 1:

Say 100 pods recently finished.

make 100 calls to apiserver for each pods resource.

make 100 calls to apiserver for the job resource

Option 2:

Say 100 pods recently finished.

make 100 calls to apiserver for each pods resource.

make 1 call to apiserver for the job resource

option 2, but it's actually 2 api calls to the job resource

Added expected number of processed Pods per minute and a target of 5000 for beta graduation.

Thanks, and I see GA grad criteria has 10^5 for completions (for a single job I presume). So the job tracking takes about 20 minutes for a job of that size, in this plan, in the degenerate scenario where every pod just exits. Is that 5000 per job, or across all jobs? If there are 10 jobs of 100k pods running, accounting takes 200 minutes? Is that OK for our users? (really asking, I have no idea. it sounds reasonable to me).

Clarified 5000 to be across any number of Jobs.

As for 10^5 criteria, clarified to be per Job and in the order of minutes. I think we can aim to be 10min, but we can tune that after alpha is implemented.

soltysh · 2021-01-28T13:46:35Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been
+// accounted in Job status counters.
+type UncountedPodUIDs struct {
+    // Enabled indicates whether Job tracking uses this struct.


👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

soltysh · 2021-01-28T13:54:23Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+   - had no finalizer at the beginning of the sync, or
+   - had the patch in step 3 succeed, or


I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.

soltysh · 2021-01-28T13:55:25Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+controller repeats steps 2 to 4, capping the number of Pods each time, until
+the controller processes all the Pods. The number of Pods is caped by:
+- time: in each iteration, the job controller removes all the Pods' finalizers
+  it can in 1 minute. This allows to throttle the number of status updates.


Similarly to previous example, I'd suggest saying in a small unit of time, w/o actually specifying how small that is. This might change over time so it's better not to put it explicitly here.

But 👍 for the idea itself

changed to "order of minutes". It's actually not a small unit of time.

soltysh · 2021-01-28T14:12:14Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+controller tracks Pods using the legacy algorithm.
+
+The kube-apiserver sets `.status.uncountedTerminatedPodUIDs.enabled = true` when
+the feature gate `JobTrackingWithoutLingeringPods` is enabled, at Job creation.


I'm not sure if this is actually needed, the controller should be capable of reacting to the current state of the job and update its status accordingly.

We need this for any new API field to be able to handle partial upgrades in a HA control plane.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/kep.yaml

ahg-g · 2021-01-28T15:49:45Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+   However, the ability to handle higher number of Pods and Jobs at a given
+   point justifies the added cost of the new API calls.


I don't understand the above two lines, can you clarify?

ahg-g · 2021-01-28T16:00:15Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+
+### Risks and Mitigations
+
+1. The new implementation produces new API calls: one per Pod to remove


Can we have an estimate of the number of extra qps updates that will be used to implement this approach?

Workloads that run many short running single-pod jobs may not benefit too much from this new approach, at the same time it will negatively impact throughput since we are using some of the controller qps limit towards adding and removing the finalizer from the pod and perhaps extra updates to the job status.

/cc @wojtek-t

Not sure what else I can say: there is one more API call for each Pod created by the controller. And one for each Job, each time a group of Pods finish.

Other than that, there is no other option for single-pod jobs. We still need to guarantee that the Pod is accounted, so the finalizer is also needed.

I considered just adding the Pod to the uncounted lists at Pod creation. But that doesn't work, because we don't know if the Pod will be successful or not.

We can estimate the impact on throughput assuming an x qps limit for the following two extremes: 1) for one pod per job and 2) many pods per job. The former is probably the more important one especially for short running jobs.

How many updates per job do we currently send for a single pod job? two (one for creating the job and one for updating the status when the job completes)?

added more details and justification

And @lavalamp gave me the idea of one alleviation: to skip updates sometimes. See latest commit.

So for one pod per job we will see ~50% drop in throughput? used to be 1 query to create the pod and 1 to update job status, now we add two more: another job status update and 1 to remove the finalizer?

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

soltysh · 2021-01-29T18:17:52Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+account for deleted Pods. This is a limitation that this KEP also wants to
+solve.
+
+However, if it's the Job controller what deletes the Pod (when parallelism


Suggested change

However, if it's the Job controller what deletes the Pod (when parallelism

However, if the Job controller deletes the Pod (when parallelism

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

soltysh

/lgtm

alculquicondor · 2021-01-29T21:48:20Z

squashed

johnbelamaric · 2021-01-29T22:36:15Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+the controller adds finished Pods to a list in the Job status. After removing
+the Pod finalizers, the controller clears the list and updates the counters.
+
+### Risks and Mitigations


What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?

There are several problems with very large Job statuses:

The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.

In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.

Using kubectl will be very painful if there are 40,000 UIDs in the list.

I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.

This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.

johnbelamaric · 2021-01-30T00:04:14Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+* **Will enabling / using this feature result in any new API calls?**
+
+  - PATCH Pods, to remove finalizers.
+    - estimated throughput: one per created Pod, when Pod finishes or is removed.


one per Pod created by the Job controller, correct? Not for every Pod created.

correct, added clarification.

johnbelamaric · 2021-01-30T00:05:04Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+  - Pod
+    - Estimated increase: new finalizer of 33 bytes.
+  - Job status
+    - Estimated increase: new array temporarily containing terminated Pod UIDs.


See discussion above. Would like to see estimates on how big these arrays can get in both normal operation and failure modes.

I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.

estimating to 10kb for about 500 UUIDs

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/kep.yaml

keps/prod-readiness/sig-apps/2307.yaml

wojtek-t · 2021-02-01T12:48:12Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+the controller adds finished Pods to a list in the Job status. After removing
+the Pod finalizers, the controller clears the list and updates the counters.
+
+### Risks and Mitigations


We've been discussing that too.

The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.

There are ways of mitigating stuff;

for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want

in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"

So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.

wojtek-t · 2021-02-01T12:55:21Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+- time: in each iteration, the job controller removes all the Pods' finalizers
+  it can in a unit of time in the order of minutes. This allows to throttle the
+  number of Job status updates.
+- count: Preventing big writes to etcd.


We may want to provide some estimate, e.g. no more than 500 UIDs can be stored in UncountedPods at any point in time.

saying "order of hundreds"

wojtek-t · 2021-02-01T13:06:05Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+    // haven't been accounted in the counters.
+    // If nil, Job tracking doesn't make use of this structure.
+    // +optional
+    UncountedTerminatedPods *UncountedTerminatedPods


@lavalamp @erictune - IIUC, in the past we were using a convention that status contains information, than can be computed based on the cluster state.
This is more or less, why things like "AcquireTime" or "RenewTime" are in LeaseSpec:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/coordination/v1/types.go#L40
[that was Brian`s suggestion back then].

Did we give up with this rule? If not, then all Succeeded, Failed, and UncountedTerminatedPods technically no longer belong to status...

[I don't have strong opinion here - just want to flag this for discussion.]

@liggitt - for his API reviewer/approver opinion

Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.

Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.

agreed.

AIUI, the intent of this API Convention was that status could be stored in some less reliable storage for performance. 5+ years on from this intention, we haven't made use of that possibility. And I really doubt that all clients are prepared for the event where status disappears and then is later reconstructed. I am thinking of e.g. kubectl waiting for a Ready condition, and then all the status disappears. Is the client code going to do the right thing? What is the right thing? We never designed for that. So we should remove that API convention.

Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.

I think the intent is also that this rule forces requirements on the cluster introspection mechanism to be able to surface anything important enough to track in status. If we have the ability to reconstruct status, then we know resources are not leaked out; our reconciliation is fully closed, without drift. If we put something in status that we cannot discover from the cluster itself we no longer have that guarantee - status could diverge from reality.

That said, in the case of job tracking, there are no resources leaked or lost track of. We lose track of some job stats potentially, but not things that are consuming underlying resources. So, only issue is the slippery slope.

wojtek-t · 2021-02-01T13:06:53Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+
+* **Are there any tests for feature enablement/disablement?**
+
+  Yes, there are integration tests.


Are there? I'm assuming these are planned, but let's not mention they already exist....

wojtek-t · 2021-02-01T13:08:45Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+    - originating component: kube-controller-manager
+  - PUT Job status, to keep track of uncounted Pods.
+    - estimated throughput: at least one per Job sync. Extra calls are throttled
+      at 1 per minute.


1 per minute might be too aggressive

will this throttling delay reflecting job completion in the API object? I didn't see tests explicitly for that.

Added to test plan as integration.

wojtek-t · 2021-02-01T13:09:26Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+  - Pod
+    - Estimated increase: new finalizer of 33 bytes.
+  - Job status
+    - Estimated increase: new array temporarily containing terminated Pod UIDs.


I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/kep.yaml

wojtek-t

I'm fine with this modulo one small comment.

Please gather SIG approval and I will approve.

wojtek-t · 2021-02-02T09:18:56Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+    // haven't been accounted in the counters.
+    // If nil, Job tracking doesn't make use of this structure.
+    // +optional
+    UncountedTerminatedPods *UncountedTerminatedPods


Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.

wojtek-t · 2021-02-02T09:22:50Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+    - originating component: kube-controller-manager
+  - PUT Job status, to keep track of uncounted Pods.
+    - estimated throughput: at least one per Job sync. The job controller
+      throttles more calls at 1 per a few minutes (precise throughput TBD from


nit: per few seconds, not few minutes;

[or to be more specific I think what we should do is:

as soon as we gather N pods (to feed the 20kB limit you mentioned below)

but at least every X(=10?) seconds, to make it somewhat responsive

Changed to seconds and added more detail in the Algorithm section.

Some reviewers requested that the KEP leaves room for experimentation before settling on specific numbers.

alculquicondor · 2021-02-02T14:23:13Z

@janetkuo this is ready for SIG approval

Signed-off-by: Aldo Culquicondor <acondor@google.com>

erictune · 2021-02-03T15:48:40Z

/lgtm

janetkuo · 2021-02-04T03:27:23Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+type UncountedTerminatedPods struct {
+    // Succeeded holds UIDs of succeeded Pods.
+    Succeeded []types.UID
+    // Succeeded holds UIDs of failed Pods.


nit: Failed

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

janetkuo · 2021-02-04T03:42:34Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+cycle.
+
+1. The Job controller calculates the number of succeeded Pods as the sum of:
+   - `.status.succeeded`,


So the behavior of .status.succeeded is slightly different now. Does this break API compatibility?

@janetkuo I don't think we are changing this significantly much, there are slight variances how this will get calculated during execution, but the final result will be the same. Nobody should ever rely on how this number is being updated, nor partial results. I'm not personally worried about that breaking API compatibility.

janetkuo · 2021-02-04T03:44:49Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+1. The Job controller calculates the number of succeeded Pods as the sum of:
+   - `.status.succeeded`,
+   - the size of `job.status.uncountedTerminatedPods.succeeded` and
+   - the number of finished Pods that are not in `job.status.uncountedTerminatedPods.succeeded`.


Don't these overlap with the pods counted in .status.succeeded? Would you rephrase?

right, they should have a finalizer too, fixed.

janetkuo · 2021-02-04T04:04:33Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+
+However, if the Job controller deletes the Pod (when parallelism is decreased,
+for example), the controller removes the finalizer before deleting it. Thus,
+these deletions don't count towards the failures.


From the description here, it'd apply to suspended Jobs (KEP 2232) as well, I assume?

yes, but applies to any other future feature that might delete pods.

janetkuo · 2021-02-04T04:11:53Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+during API review.
+
+### Algorithm
+


If the user deletes the Job, how are the finalizers of its Pods handled?

Added a section.

wojtek-t · 2021-02-04T11:26:22Z

keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md

+#### Beta -> GA Graduation
+
+- E2E test graduates to conformance.
+- Job tracking scales to 10^5 completions per Job processed within an order of


If we assume that "order of minutes" is 5 minutes - that would mean O(300) QPS to delete finalizers.

I don't think we should be that aggressive - it's much more than we support now.

I was thinking more like 10min. Is that still too much? This would be assuming that this is the only Job though.

IMO in the situation where you have a job that's that big, 300 QPS is perfectly reasonable-- that's probably a large fraction of what you're running on the cluster anyway. Compare it to the writes it took to actually get all those pods through their lifecycle. It's not adding much.

If you're saying how it should be in ideal world - sure, I believe even more than should be possible.
If you're asking where we are now - I doubt we will be able to do that (I mean, we can handle 300 qps of course, but we can't blindly bump qps limits until we have P&F supporting things like LISTS, and also some rate-limiting that would rate-limit watches - whatever that means).

Also note - that it highly depends on how long a particular pod takes to run - if those are very short-running, then it matters, if they are long-running, then it's not that

This is still in the Beta -> GA graduation, so I think it should be fine to have this as tentative goal

Signed-off-by: Aldo Culquicondor <acondor@google.com>

wojtek-t · 2021-02-08T07:54:42Z

/approve PRR

k8s-ci-robot · 2021-02-08T07:54:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, erictune, janetkuo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-apps/OWNERS~~ [janetkuo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2021-02-08T07:55:31Z

Adding lgtm, as Janet`s approval was the last thing in this PR.

/lgtm

k8s-ci-robot requested review from kow3ns and mattfarina January 21, 2021 22:03

k8s-ci-robot assigned erictune and lavalamp Jan 21, 2021

alculquicondor force-pushed the job-tracking branch from f6b9b56 to 100a5be Compare January 21, 2021 22:10

lavalamp reviewed Jan 21, 2021

View reviewed changes

wojtek-t reviewed Jan 22, 2021

View reviewed changes

alculquicondor force-pushed the job-tracking branch from 18ff62b to b7c89e8 Compare January 22, 2021 16:02

ahg-g reviewed Jan 22, 2021

View reviewed changes

erictune suggested changes Jan 25, 2021

View reviewed changes

alculquicondor force-pushed the job-tracking branch 2 times, most recently from 410f451 to 4164b63 Compare January 27, 2021 15:33

ahg-g reviewed Jan 27, 2021

View reviewed changes

soltysh reviewed Jan 28, 2021

View reviewed changes

ahg-g reviewed Jan 28, 2021

View reviewed changes

k8s-ci-robot requested a review from wojtek-t January 28, 2021 16:01

alculquicondor force-pushed the job-tracking branch from 2bbb8e5 to cf9a010 Compare January 28, 2021 18:57

alculquicondor changed the title ~~Add KEP for job tracking in provisional status~~ Add KEP for job tracking in implementable status Jan 28, 2021

wojtek-t self-assigned this Jan 29, 2021

soltysh reviewed Jan 29, 2021

View reviewed changes

soltysh approved these changes Jan 29, 2021

View reviewed changes

k8s-ci-robot assigned soltysh Jan 29, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2021

alculquicondor force-pushed the job-tracking branch from e1bcc6e to baf6463 Compare January 29, 2021 21:48

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2021

johnbelamaric reviewed Jan 30, 2021

View reviewed changes

wojtek-t mentioned this pull request Feb 1, 2021

Add completedIndexes to Indexed Job status #2362

Merged

wojtek-t reviewed Feb 1, 2021

View reviewed changes

wojtek-t reviewed Feb 2, 2021

View reviewed changes

alculquicondor force-pushed the job-tracking branch from 289dc2a to 7d98878 Compare February 2, 2021 14:21

alculquicondor force-pushed the job-tracking branch 2 times, most recently from 3462c95 to 43dd18f Compare February 2, 2021 21:21

Add KEP for job tracking in implementable status

0fe1d55

Signed-off-by: Aldo Culquicondor <acondor@google.com>

alculquicondor force-pushed the job-tracking branch from 43dd18f to 0fe1d55 Compare February 2, 2021 21:25

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2021

erictune approved these changes Feb 3, 2021

View reviewed changes

janetkuo reviewed Feb 4, 2021

View reviewed changes

wojtek-t reviewed Feb 4, 2021

View reviewed changes

Add note for deleted job and nits

a6bc416

Signed-off-by: Aldo Culquicondor <acondor@google.com>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2021

alculquicondor mentioned this pull request Feb 5, 2021

Job tracking without lingering Pods #2307

Closed

12 tasks

janetkuo approved these changes Feb 6, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021

k8s-ci-robot merged commit 5d98e09 into kubernetes:master Feb 8, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 8, 2021

		- had no finalizer at the beginning of the sync, or
		- had the patch in step 3 succeed, or

		This allows for deleted Pods to not count towards the failed counter, like it
		is today.

		If a Job with `.status.uncountedPodUIDs.enabled = true` can adopt a Pod, this
		might not have a finalizer. We can:

		However, the ability to handle higher number of Pods and Jobs at a given
		point justifies the added cost of the new API calls.


		### Risks and Mitigations

		1. The new implementation produces new API calls: one per Pod to remove

	However, if it's the Job controller what deletes the Pod (when parallelism
	However, if the Job controller deletes the Pod (when parallelism


		* Are there any tests for feature enablement/disablement?

		Yes, there are integration tests.

Add KEP for job tracking in implementable status #2308

Add KEP for job tracking in implementable status #2308

Conversation

alculquicondor commented Jan 21, 2021 • edited Loading

alculquicondor commented Jan 21, 2021

lavalamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

alculquicondor Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Jan 21, 2021 •

edited

Loading

liggitt Jan 22, 2021 •

edited

Loading

ahg-g Jan 22, 2021 •

edited

Loading

ahg-g Jan 22, 2021 •

edited

Loading

alculquicondor Jan 22, 2021 •

edited

Loading

alculquicondor Jan 22, 2021 •

edited

Loading