Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for job tracking in implementable status #2308

Merged
merged 2 commits into from
Feb 8, 2021

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Jan 21, 2021

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 21, 2021
@alculquicondor
Copy link
Member Author

/assign @erictune
for sig-apps
/assign @lavalamp
for api changes

Copy link
Member

@lavalamp lavalamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think the downgrade scenario is the only important comment.

// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been
// accounted in Job status counters.
type UncountedPodUIDs struct {
// Enabled indicates whether Job tracking uses this struct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just make the UncountedPodUIDs field optional, then you don't need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like the following?

type JobStatus struct {
  // +optional
  UncountedPodUIDs *UncountedPodUIDs
}

Would that differentiate nil from empty struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But after some more thought, using an explicit "Enabled" field is probably clearer, so I guess let's keep it how you have it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't like about the boolean is that it will eventually be useless. And the struct is only supposed to be handled by the Job controller, so I think I'm fine with the pointer if we can differentiate between nil and empty struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to pointer.

them afterwards. The controller sends a status update.

In the case where a user or another controller removes a Pod, which sets a
deletion timestamp, the Job controller removes the finalizer. This is along with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what you're trying to say here? It's not wrong, I just don't understand why it needs to be said here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to its own section and added an unresolved note.

might not have a finalizer. We can:
1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking
change.
1. Add the finalizer in the same patch request that modifies the owner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go with this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I don't see any disadvantages of this option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of incremental changes. 1 would be easier, so I can do it in the first implementation PR. If I have time until the codefreeze, I can do 2. But if I miss the deadline, then I think it's fine to leave it for beta. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be much harder than 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed unresolved taking option 2.

`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the
finalizer to all Pods while preventing race conditions.

We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we roll back to a version of the job controller that doesn't know about the finalizer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what happens when we upgrade again :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would cause many issues. There is nothing that can remove the finalizers.

So my answer would be: do not enable the feature gate in production clusters while it's in alpha. Disabling the feature gate once the feature is enabled by default should be fine, as the controller (and an N-1 version of it) knows how to remove finalizers.

Is this acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think you might need to roll out in two releases; the first release will remove the finalizer if it is present, but doesn't add it. The second release can then start adding the finalizer and implement the rest of the feature.

The alternative is to do both of a) turning off the feature flag won't disable the finalizer removal logic and b) give users a command/script which they can use to manually remove all finalizers in the event of a downgrade.

If you're going to have an alpha release anyway, then I think it's fine to omit b) since the vast majority of users won't be in this situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the alternative. Is there a directory where we already have scripts like the one described?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the release notes is probably the place to put it. But if it's just alpha in the first release, probably no need.

@liggitt may have an opinion

Copy link
Member

@liggitt liggitt Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turning off the feature flag won't disable the finalizer removal logic

I haven't read the full context, but auto-added finalizers not being removed on rollback or failed HA upgrade or feature disable is a really nasty failure mode. Ungated logic to clean up a finalizer starting in alpha stage could be reasonable. This is similar to what was done with the kubernetes.io/pvc-protection finalizer in kubernetes/kubernetes#61324 / kubernetes/kubernetes#61370

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ungated logic to clean up a finalizer starting in alpha stage could be reasonable.

That is what's being proposed.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Daniel - that looks reasonable to me.

might not have a finalizer. We can:
1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking
change.
1. Add the finalizer in the same patch request that modifies the owner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I don't see any disadvantages of this option.

`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the
finalizer to all Pods while preventing race conditions.

We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what happens when we upgrade again :)

Comment on lines 209 to 212
3. The Job controller removes the `batch.kubernetes.io/job-completion` finalizer
from all Pods on Succeeded or Failed phase. We can assume that all of them
have been added to the lists if the previous step succeeded. The controller
keeps note of the Pod updates that succedeed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all since you are planning to do them in batches. we will remove the ones we added to the lists in the previous step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted

Comment on lines 215 to 216
- had no finalizer at the beginning of the sync, or
- had the patch in step 3 succeed, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is implementation detail, the algorithm is: "count pods listed in uncountedPodUIDs that either have no finalizer or deleted"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see harm in the detail.

Copy link
Member

@ahg-g ahg-g Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the detail add confusion rather than clarity. As a reader, I paused a little reading this wondering why are we distinguishing between the two. When I read a proposal, I expect to read about the semantics or behavior we are trying to achieve, if there is uncertainly on how to achieve that behavior, then I expect to see an implementation detail section for example. This is a nit, so take it for what it is worth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level overview is in Proposal and Notes. This is the section for nitty-gritty details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole doc is a proposal (it is a key enhancement proposal) :)

there are limits to nitty-gritty details, there are ones that are useful and ones that add confusion because they don't add to the understanding of what is being proposed, my feedback is that this added confusion... again, I am replying not because I am insisting on the change, but just as a response to the justification you provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a request from API reviewer to be explicit about continuing to process the Pods that succeed.
Moved it as a note a paragraph below.

Comment on lines 233 to 234
This allows for deleted Pods to not count towards the failed counter, like it
is today.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear to me which part is "like it is today", is this a change of behavior? if not, I would s/like it is today/and so this doesn't change the current behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a change of behavior. Adjusting wording

Comment on lines 235 to 237
Note that step 2 doesn't pay attention to deletion timestamps. So the Job
controller might account the Pod in the lists and counters if the Pods
enter Failed or Succeeded phase before or while being deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, why is this worth noting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the controller works today. Except that once the Pods are completely removed, they don't count anymore.

So maybe we should just stop counting pods with a termination timestamp altogether? That's the discussion for this UNRESOLVED section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh @erictune

I would appreciate your input here. There is one case where I think it might be worth reconsidering the semantics. Let's say a Node becomes unavailable and gets removed. In this case, the garbage collection also removes the Pod. Should this be considered a failure or not? Today, it's not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users ideally want failures caused by non-zero exit code (a problem within the user's control) to be counted separately from failures out of their control (preemption, node failure).

Does Failure today include both or just the former? If it is just the former, we should try to keep it that way. If it is a mixture, then needs a bit more thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, it only includes failures from non-zero exit code. This is because other failures get the Pod removed.

Then, it sounds like we should be skipping Pods that have a termination timestamp from any calculations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, but there's a problem with the above:
The Pod GC doesn't take into account finalizers. That means that Pods that this controller marks for deletion would also not be counted towards calculations.
Then I think the solution would be: not to count Pods that have a termination timestamp and modify the Pod GC to skip Pods with finalizers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impression I've gathered from what I've seen from user reports is that users get quite grumpy when a pod starts and doesn't complete for some reason, and would like to know about that. Would they like the two classes of failures conflated? I'm less sure about the answer to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said above, controller created pods that failed (removed, evicted, non-zero exit) should all be treated as failed. The fact that current controller doesn't do it is one of the limitation this proposal is trying to address. Iow. with improved status calculation we'll get more accurate signal of failure. Example, if there's a bad actor in the system consistently removing pods from a job an increased failure rate on a job might make the job author alarmed, and that's a good thing, imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this can be seen as a limitation of the legacy tracking.

Only when the job controller itself removes the Pod it should NOT be considered a failure. This happens when parallelism is reduced or when the Job controller created extra unnecessary Pods. Added a note.

Comment on lines 243 to 244
If a Job with `.status.uncountedPodUIDs.enabled = true` can adopt a Pod, this
might not have a finalizer. We can:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't read well. Also, isn't this possible only if the status is not yet set to Failed/Succeeded? if so, please clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The qualifications for Pod adoption don't change from current behavior. Clarified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current qualification criteria include pods that failed/succeeded? If the pod was in finished/succeeded status, the absence of the finalizer could be because the pod was already processed by the controller in a previous cycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that scenario, the Pod would already have an updated owner reference.

Copy link
Member

@ahg-g ahg-g Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what guarantees that? is it possible that an owner reference be cleared after the pod was counted for?

Copy link
Member Author

@alculquicondor alculquicondor Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is currently no criteria other than the Job selector, which is mutable.

In theory, a Pod could finish, be accounted for Job A, loose its finalizer and then be claimed by another Job B. At this point, I think it's fair to adopt this Pod and account for it in Job B.
But what if Job A claims the Pod again? Then we have no way of adopting it without counting it again.

@wojtek-t @lavalamp thoughts?

Perhaps we should actually only adopt a Pod if it has the finalizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should really not have this problem, because only one job can have the "controller reference"; you should not clear that, and no one else should either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, users should not, but they could. While unlikely, I think this is worth a note because in practice this is a regression compared to current situation.

Copy link
Member Author

@alculquicondor alculquicondor Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to this being documented as unexpected behavior? https://kubernetes.io/docs/concepts/workloads/controllers/job/#specifying-your-own-pod-selector

Although I'm proposing a scenario where ownership is transferred from Job A to Job B and then back to Job A by carefully updating the selectors so that there are no conflicts at a given time. But maybe we can document this as unexpected behavior too.

Steps 2 to 4 might deal with a potentially big number of Pods. Thus, status
updates can potentially stress the kube-apiserver. For this reason, the Job
controller repeats steps 2 to 4, dealing with up to 100 Pods each time, until
the controller processes all the Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you added this in response to Daniel's comment.
@lavalamp , were you worried about too frequent of updates (not batching) or about the size of the diff from a single update (which the above seems concerned with).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm concerned about either. I do think it's probably good to limit the size of the chunks we process just to guarantee progress (partial update is better than never completing a huge update). But I personally would probably have a time limit-- process all the pods you can in 1m. I would say 15s but we have to consider the possibility of apiserver taking the maximum amount of time to respond to a request (currently 60s). The tests probably need to test under the "apiserver is really slow" condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process all the pods you can in 1m.

Uhmm. That would have to look like:

  • repeatedly update K Pods in parallel while 1m hasn't passed.
  • setup a channel and K workers, continue processing for 1m.

So then we have to decide on K. Which exact method we use can probably be left for code reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wojtek-t since you were asking about batching updates in the other KEP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I don't understand why updating the status (a) with N pods and (b) with M pods is different from kube-apiserver perspective.

I think what we should do is: (a) take a time into account, (b) have a cap on number of pods addeded.

We need a limit to ensure that the object won't grow to super large case, e.g. if job-controller was down for some time and in the meantime gazillions pod finished.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I take this as: we need to cap on time and number. Updated the KEP

- were removed from the system.
The counts increment the `.status.failed` and `.status.succeeded` and clears
counted Pods from `.status.uncountedPodsUIDs` lists. The controller sends a
status update.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about if there are changes to the logic where the controller decides whether to start new pods.
Does it look at the current .status.succeeded and compare that to .spec.completions? Or does it look at .status.succeeded + len( status.uncountedPodUIDs.succeeded? The list could be non-empty due to the throttling mentioned below. Or does it block starting new pods if there are any uncounted pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to step 1.

I agree that the correct name should have terminated in it. What about uncountedTerminatedPodUIDs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, "uncounted" isn't quite right, it's "terminatedAndUncounted". Yes, that logic needs to look at 3 sources of information, a) the counts, b) the existing pods which are not in c) the terminatedAndUncounted list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@alculquicondor alculquicondor force-pushed the job-tracking branch 2 times, most recently from 410f451 to 4164b63 Compare January 27, 2021 15:33
the controller adds finished Pods to a list in the Job status. After removing
the Pod finalizers, the controller clears the list and updates the counters.

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is increasing object size a risk that we should mention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?

There are several problems with very large Job statuses:

  1. The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.
  2. In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.
  3. Using kubectl will be very painful if there are 40,000 UIDs in the list.

I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.

This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been discussing that too.

The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.

There are ways of mitigating stuff;

  • for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want
  • in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"

So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the KEP already mentions limiting the UIDs in the controller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which are you doing:

Option 1:

  • Say 100 pods recently finished.
  • make 100 calls to apiserver for each pods resource.
  • make 100 calls to apiserver for the job resource

Option 2:

  • Say 100 pods recently finished.
  • make 100 calls to apiserver for each pods resource.
  • make 1 call to apiserver for the job resource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 2, but it's actually 2 api calls to the job resource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added expected number of processed Pods per minute and a target of 5000 for beta graduation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and I see GA grad criteria has 10^5 for completions (for a single job I presume). So the job tracking takes about 20 minutes for a job of that size, in this plan, in the degenerate scenario where every pod just exits. Is that 5000 per job, or across all jobs? If there are 10 jobs of 100k pods running, accounting takes 200 minutes? Is that OK for our users? (really asking, I have no idea. it sounds reasonable to me).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified 5000 to be across any number of Jobs.

As for 10^5 criteria, clarified to be per Job and in the order of minutes. I think we can aim to be 10min, but we can tune that after alpha is implemented.

// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been
// accounted in Job status counters.
type UncountedPodUIDs struct {
// Enabled indicates whether Job tracking uses this struct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.

Comment on lines 215 to 216
- had no finalizer at the beginning of the sync, or
- had the patch in step 3 succeed, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.

controller repeats steps 2 to 4, capping the number of Pods each time, until
the controller processes all the Pods. The number of Pods is caped by:
- time: in each iteration, the job controller removes all the Pods' finalizers
it can in 1 minute. This allows to throttle the number of status updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to previous example, I'd suggest saying in a small unit of time, w/o actually specifying how small that is. This might change over time so it's better not to put it explicitly here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But 👍 for the idea itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "order of minutes". It's actually not a small unit of time.

controller tracks Pods using the legacy algorithm.

The kube-apiserver sets `.status.uncountedTerminatedPodUIDs.enabled = true` when
the feature gate `JobTrackingWithoutLingeringPods` is enabled, at Job creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is actually needed, the controller should be capable of reacting to the current state of the job and update its status accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this for any new API field to be able to handle partial upgrades in a HA control plane.

Comment on lines 160 to 161
However, the ability to handle higher number of Pods and Jobs at a given
point justifies the added cost of the new API calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the above two lines, can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


### Risks and Mitigations

1. The new implementation produces new API calls: one per Pod to remove
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an estimate of the number of extra qps updates that will be used to implement this approach?

Workloads that run many short running single-pod jobs may not benefit too much from this new approach, at the same time it will negatively impact throughput since we are using some of the controller qps limit towards adding and removing the finalizer from the pod and perhaps extra updates to the job status.

/cc @wojtek-t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what else I can say: there is one more API call for each Pod created by the controller. And one for each Job, each time a group of Pods finish.

Other than that, there is no other option for single-pod jobs. We still need to guarantee that the Pod is accounted, so the finalizer is also needed.

I considered just adding the Pod to the uncounted lists at Pod creation. But that doesn't work, because we don't know if the Pod will be successful or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can estimate the impact on throughput assuming an x qps limit for the following two extremes: 1) for one pod per job and 2) many pods per job. The former is probably the more important one especially for short running jobs.

How many updates per job do we currently send for a single pod job? two (one for creating the job and one for updating the status when the job completes)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more details and justification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @lavalamp gave me the idea of one alleviation: to skip updates sometimes. See latest commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for one pod per job we will see ~50% drop in throughput? used to be 1 query to create the pod and 1 to update job status, now we add two more: another job status update and 1 to remove the finalizer?

@alculquicondor alculquicondor changed the title Add KEP for job tracking in provisional status Add KEP for job tracking in implementable status Jan 28, 2021
@wojtek-t wojtek-t self-assigned this Jan 29, 2021
account for deleted Pods. This is a limitation that this KEP also wants to
solve.

However, if it's the Job controller what deletes the Pod (when parallelism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, if it's the Job controller what deletes the Pod (when parallelism
However, if the Job controller deletes the Pod (when parallelism

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2021
@alculquicondor
Copy link
Member Author

squashed

the controller adds finished Pods to a list in the Job status. After removing
the Pod finalizers, the controller clears the list and updates the counters.

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?

There are several problems with very large Job statuses:

  1. The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.
  2. In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.
  3. Using kubectl will be very painful if there are 40,000 UIDs in the list.

I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.

This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.

* **Will enabling / using this feature result in any new API calls?**

- PATCH Pods, to remove finalizers.
- estimated throughput: one per created Pod, when Pod finishes or is removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one per Pod created by the Job controller, correct? Not for every Pod created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, added clarification.

- Pod
- Estimated increase: new finalizer of 33 bytes.
- Job status
- Estimated increase: new array temporarily containing terminated Pod UIDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion above. Would like to see estimates on how big these arrays can get in both normal operation and failure modes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimating to 10kb for about 500 UUIDs

keps/prod-readiness/sig-apps/2307.yaml Show resolved Hide resolved
keps/prod-readiness/sig-apps/2307.yaml Show resolved Hide resolved
the controller adds finished Pods to a list in the Job status. After removing
the Pod finalizers, the controller clears the list and updates the counters.

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been discussing that too.

The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.

There are ways of mitigating stuff;

  • for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want
  • in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"

So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.

- time: in each iteration, the job controller removes all the Pods' finalizers
it can in a unit of time in the order of minutes. This allows to throttle the
number of Job status updates.
- count: Preventing big writes to etcd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to provide some estimate, e.g. no more than 500 UIDs can be stored in UncountedPods at any point in time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saying "order of hundreds"

// haven't been accounted in the counters.
// If nil, Job tracking doesn't make use of this structure.
// +optional
UncountedTerminatedPods *UncountedTerminatedPods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp @erictune - IIUC, in the past we were using a convention that status contains information, than can be computed based on the cluster state.
This is more or less, why things like "AcquireTime" or "RenewTime" are in LeaseSpec:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/coordination/v1/types.go#L40
[that was Brian`s suggestion back then].

Did we give up with this rule? If not, then all Succeeded, Failed, and UncountedTerminatedPods technically no longer belong to status...

[I don't have strong opinion here - just want to flag this for discussion.]

@liggitt - for his API reviewer/approver opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.

agreed.

Copy link
Member

@erictune erictune Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, the intent of this API Convention was that status could be stored in some less reliable storage for performance. 5+ years on from this intention, we haven't made use of that possibility. And I really doubt that all clients are prepared for the event where status disappears and then is later reconstructed. I am thinking of e.g. kubectl waiting for a Ready condition, and then all the status disappears. Is the client code going to do the right thing? What is the right thing? We never designed for that. So we should remove that API convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent is also that this rule forces requirements on the cluster introspection mechanism to be able to surface anything important enough to track in status. If we have the ability to reconstruct status, then we know resources are not leaked out; our reconciliation is fully closed, without drift. If we put something in status that we cannot discover from the cluster itself we no longer have that guarantee - status could diverge from reality.

That said, in the case of job tracking, there are no resources leaked or lost track of. We lose track of some job stats potentially, but not things that are consuming underlying resources. So, only issue is the slippery slope.


* **Are there any tests for feature enablement/disablement?**

Yes, there are integration tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there? I'm assuming these are planned, but let's not mention they already exist....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

- originating component: kube-controller-manager
- PUT Job status, to keep track of uncounted Pods.
- estimated throughput: at least one per Job sync. Extra calls are throttled
at 1 per minute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 per minute might be too aggressive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this throttling delay reflecting job completion in the API object? I didn't see tests explicitly for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to test plan as integration.

- Pod
- Estimated increase: new finalizer of 33 bytes.
- Job status
- Estimated increase: new array temporarily containing terminated Pod UIDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this modulo one small comment.

Please gather SIG approval and I will approve.

// haven't been accounted in the counters.
// If nil, Job tracking doesn't make use of this structure.
// +optional
UncountedTerminatedPods *UncountedTerminatedPods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.

- originating component: kube-controller-manager
- PUT Job status, to keep track of uncounted Pods.
- estimated throughput: at least one per Job sync. The job controller
throttles more calls at 1 per a few minutes (precise throughput TBD from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: per few seconds, not few minutes;

[or to be more specific I think what we should do is:

  • as soon as we gather N pods (to feed the 20kB limit you mentioned below)
  • but at least every X(=10?) seconds, to make it somewhat responsive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to seconds and added more detail in the Algorithm section.

Some reviewers requested that the KEP leaves room for experimentation before settling on specific numbers.

@alculquicondor
Copy link
Member Author

@janetkuo this is ready for SIG approval

@alculquicondor alculquicondor force-pushed the job-tracking branch 2 times, most recently from 3462c95 to 43dd18f Compare February 2, 2021 21:21
Signed-off-by: Aldo Culquicondor <acondor@google.com>
@erictune
Copy link
Member

erictune commented Feb 3, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2021
type UncountedTerminatedPods struct {
// Succeeded holds UIDs of succeeded Pods.
Succeeded []types.UID
// Succeeded holds UIDs of failed Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cycle.

1. The Job controller calculates the number of succeeded Pods as the sum of:
- `.status.succeeded`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the behavior of .status.succeeded is slightly different now. Does this break API compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo I don't think we are changing this significantly much, there are slight variances how this will get calculated during execution, but the final result will be the same. Nobody should ever rely on how this number is being updated, nor partial results. I'm not personally worried about that breaking API compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

1. The Job controller calculates the number of succeeded Pods as the sum of:
- `.status.succeeded`,
- the size of `job.status.uncountedTerminatedPods.succeeded` and
- the number of finished Pods that are not in `job.status.uncountedTerminatedPods.succeeded`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these overlap with the pods counted in .status.succeeded? Would you rephrase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, they should have a finalizer too, fixed.


However, if the Job controller deletes the Pod (when parallelism is decreased,
for example), the controller removes the finalizer before deleting it. Thus,
these deletions don't count towards the failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description here, it'd apply to suspended Jobs (KEP 2232) as well, I assume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but applies to any other future feature that might delete pods.

during API review.

### Algorithm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user deletes the Job, how are the finalizers of its Pods handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section.

#### Beta -> GA Graduation

- E2E test graduates to conformance.
- Job tracking scales to 10^5 completions per Job processed within an order of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that "order of minutes" is 5 minutes - that would mean O(300) QPS to delete finalizers.

I don't think we should be that aggressive - it's much more than we support now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more like 10min. Is that still too much? This would be assuming that this is the only Job though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in the situation where you have a job that's that big, 300 QPS is perfectly reasonable-- that's probably a large fraction of what you're running on the cluster anyway. Compare it to the writes it took to actually get all those pods through their lifecycle. It's not adding much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're saying how it should be in ideal world - sure, I believe even more than should be possible.
If you're asking where we are now - I doubt we will be able to do that (I mean, we can handle 300 qps of course, but we can't blindly bump qps limits until we have P&F supporting things like LISTS, and also some rate-limiting that would rate-limit watches - whatever that means).

Also note - that it highly depends on how long a particular pod takes to run - if those are very short-running, then it matters, if they are long-running, then it's not that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in the Beta -> GA graduation, so I think it should be fine to have this as tentative goal

Signed-off-by: Aldo Culquicondor <acondor@google.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2021

/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, erictune, janetkuo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2021

Adding lgtm, as Janet`s approval was the last thing in this PR.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5d98e09 into kubernetes:master Feb 8, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants