Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP update: Allow replacement pods in groups of pods #1338

Merged
merged 7 commits into from
Nov 29, 2023
Merged
91 changes: 58 additions & 33 deletions keps/976-plain-pods/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
- [Groups of Pods created beforehand](#groups-of-pods-created-beforehand)
- [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers)
- [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods)
- [Dynamically reclaiming Quota](#dynamically-reclaiming-quota)
- [Retrying Failed Pods](#retrying-failed-pods)
- [Metrics](#metrics)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
Expand Down Expand Up @@ -66,14 +68,11 @@ use of this API to implement queuing semantics for Pods.
- Support queueing of individual Pods.
- Support queueing of groups of Pods of fixed size, identified by a common label.
- Opt-in or opt-out Pods from specific namespaces from queueing.
- Support for [dynamic reclaiming of quota](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
for succeeded Pods.

### Non-Goals

- Support for [dynamic reclaiming quota](https://github.com/kubernetes-sigs/kueue/issues/78)

This feature is incompatible with supporting Pod replacements without knowing the behavior of a
parent controller for the Pods.

- Support for [partial-admission](https://github.com/kubernetes-sigs/kueue/issues/420).

Since all pods are already created, an implementation of partial admission would imply the
Expand Down Expand Up @@ -372,12 +371,12 @@ matchExpressions:
Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label,
the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic.

Note that the Workload cannot be owned by the Pod. Otherwise any cascade deletion of the Pod would
delete the Workload object, even before the Pod terminates (if it has a grace period).
This means that the controller needs to manually delete the Workload object once it has the
Finished condition (after we have determined that all pods have finished).
The Workload will be owned by all Pods. Once all the Pods that own the workload
are deleted (and their finalizers are removed), the Workload will be
automatically cleaned up.

A possible extension here is to add a TTL, which we can consider based on user feedback.
If individual Pods in the group fail and a replacement Pod comes in,
the replacement Pod will be added as an owner of the Workload as well.

#### Single Pods

Expand Down Expand Up @@ -538,39 +537,62 @@ spec:
Pods need to have finalizers so that we can reliably track how many of them run to completion and be
able to determine when the Workload is Finished.

A Pod-group reconciler would keep track of the groups of Pods and their respective Workload object,
based on the `jobframework.Reconciler`.
After a Workload is admitted, as the Pod-group reconciler ungates pods, it would keep an in-memory
cache of expected ungated pods: the number of ungated pods that are not reflected in the informers
yet, per Pod-group. This number decreases as the event handler observes Pods transition from being
gated to ungated.
The Pod reconciler will run in a "composable" mode: a mode where a Workload is
composed of multiple objects.
The `jobframework.Reconciler` will be reworked to accomodate this.

In the Pod event handler, we decrement the counter when we see a transition from having
the scheduling gate `kueue.x-k8s.io/admission` to not having it.
After a Workload is admitted, each Pod that owns the workload enters the
reconciliation loop.
The reconciliation loop collects all the Pods that are not Failed and constructs
an in-memory Workload. If there is an existing Workload in the cache and it
has smaller Pod counters than the in-memory Workload, then it is considered
unmatching and the Workload is evicted.

In the Pod-group reconciler:
1. If the Pod is not terminated,
create a Workload for the pod group if one does not exist.
2. If the Pod is terminated,
- If the Workloald doesn't exist or the workload is finished, remove the finalizer.
3. ungated_pods_in_client: the number of non-terminated pods in the client that are admitted.
We only look at non-terminated pods to allow for terminated pods to be replaced.
4. ungated_pods = ungated_pods_in_client + expected_ungated_pods. Note that this might temporarily
lead to double counting.
2. For gated pods:
- If ungated_pods < admission.count, remove the gate, set nodeSelector, an increase
expected_ungated_pods
- Else,
- If ungated_pods_in_informer < admission.count, we can't admit this Pod now to prevent
overbooking, but requeue this Pod for retry.
- Else, remove finalizer and delete the Pod, as it's beyond the allowed admission.
2. If the Pod is terminated and (the Workload doesn't exist or the workload is finished),
then remove the finalizer.
3. Build the in-memory Workload. If its podset counters are greater than the
stored Workload, then evict the Workload.
4. For gated pods:
- remove the gate, set nodeSelector
5. If the number of terminated pods with a finalizer is greater than or equal to the admission
count, and there are no non-terminated Pods, mark the Workload as Finished and remove the
finalizers from the Pods.

Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way
of managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after
terminated. In a future version, we can consider a better scheme similar to [Pod tracking in Jobs](https://kubernetes.io/blog/2022/12/29/scalable-job-tracking-ga/).
terminated.


### Dynamically reclaiming Quota

Succeeded Pods will note be considered replaceable. In other words, the quota
from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
in the Workload status.

### Retrying Failed Pods

The Pod group will generally only be considered finished if all the Pods finish
Copy link

@ahaysx ahaysx Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pod groups that do not replace failed pods? We have this use case where a user does not allow retries (and I imagine is a common batch case). So a pod group would be finished if all pods have exited, regardless of success or failure.

I think one way to do this is for every pod in the group to have the kueue.x-k8s.io/last-in-group: true annotation, if "group finished" does not mean the Workload is cleaned up or other running pods in the group are affected.

This seems weird given the name, but would this be the recommendation? Is it worth some kind of alternative annotation configuring this, like pod-group-mode: Batch or pod-group-retry: false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just having one pod with last-in-group has the semantics you want: The pod group will be considered Finished when there are no more Running Pods and there is at least one pod with the annotation.

But I agree that the name is maybe not the best for this use case. I'm thinking of alternatives.

Definitely pod-group-mode: Batch is not accurate, as batch doesn't imply that retries are not possible.

pod-group-retry doesn't fit the semantics I originally wanted that well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with retriable-in-group: false, similar to your proposal.
But I also added another mechanism for termination: just delete the Workload.

with a Succeeded phase.
This allows the user to send replacement Pods when a Pod in the group fails.
The Pods need to have a different name, because the Failed ones will stay
in the apiserver with a finalizer.

However, this implies that a group can only succeed. In other words, there is
no way for the user to declare the group failed and stop retrying.

This can be solved by adding an annotation to any Pod in the group:
`kueue.x-k8s.io/last-in-group: true`.
The annotation can be added to a Pod that was already terminated or it
can be added when the user creates the Pod.
When added on creation, this Pod will run, but it will be considered the last
attempt.

As a result, Kueue can consider a group finished if there are no running Pods,
and at least one terminated Pod (Failed or Succeeded) has the `last-in-group`
annotation.

### Metrics

Expand Down Expand Up @@ -621,7 +643,8 @@ The integration tests should cover the following scenarios:
- Multiple Pods with one shape:
- queued and admitted
- failed pods recreated can use the same quota
- Workload finished when all pods finish (failed or succeeded)
- Workload finished when all pods finish Successfully
- Workload finished when a Pod with `last-in-group` annotation finishes.
- Driver Pod creates workers:
- queued and admitted.
- worker pods beyond the count are rejected (deleted)
Expand Down Expand Up @@ -653,6 +676,8 @@ Major milestones might include:
- when the KEP was retired or superseded
-->

- Sep 29th: Implemented single Pod support (story 1) [#1103](https://github.com/kubernetes-sigs/kueue/pulls/1103).

## Drawbacks

The proposed labels and annotations for groups of pods can be complex to build manually.
Expand Down