Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2879: Add count of ready Pods in Job status #2880

Merged
merged 4 commits into from
Sep 6, 2021

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Aug 19, 2021

Ref #2879

Proposed the addition of field Job.status.ready.

Includes PRR

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 19, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 19, 2021
@alculquicondor
Copy link
Member Author

/assign @soltysh @ehashman

/api-review

@alculquicondor
Copy link
Member Author

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Aug 19, 2021
@alculquicondor alculquicondor force-pushed the job-running branch 2 times, most recently from 9bca8ec to afbd037 Compare August 19, 2021 19:48
@alculquicondor
Copy link
Member Author

@gaocegege I would appreciate your feedback

Copy link

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I think it is helpful.

when the Pod doesn't define a readiness probe.
Copy link

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks for the enhancement.

@wojtek-t
Copy link
Member

/assign

@lavalamp
Copy link
Member

API lgtm if you can not have the first / second version difference.

@alculquicondor alculquicondor force-pushed the job-running branch 2 times, most recently from 858e302 to a38669a Compare August 30, 2021 15:02
@kikisdeliveryservice kikisdeliveryservice changed the title Add count of ready Pods in Job status KEP-2879: Add count of ready Pods in Job status Aug 30, 2021
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits.
/lgtm
/approve
from sig-apps perspective

latest-milestone: "v1.23"

milestone:
beta: "v1.23"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.24

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2021
keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved
keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved
- The job controller is updating other status fields.
- The number of ready Pods equals `Job.spec.parallelism`.
- The increase of ready Pods is greater than or equal to 10% of
`Job.spec.parallelism`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that because of cluster capacity, 100% of my jobs will never have place to run. But let's say 99% has.
With the policies above it may happen that we update ready to 90%, but we will never change that to 99%. So the system isn't eventually consistent, which I think is problematic.

I think that you need another rule, i.e. when something changes, it will be applied with X seconds/minutes (i.e. we batch it for such period).

[FWIW - such logic is super simple to implement, e.g. https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go#L224 ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, being eventually consistent should be a requirement.

Could the solution for the job controller be the same? We would delay/accumulate any sync coming from Pod creation/updates/deletions. This might actually be good for the overall performance of the controller. The delay for endpoint slices is configurable. Should we do the same? I'm proposing a 100ms window otherwise.


###### Are there any tests for feature enablement/disablement?

Yes, at unit and integration level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming "they will be added", right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed wording.


###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The 99% percentile of Job status updates below 1s, when the controller doesn't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about API calls or about the processing logic in the controller? It's not clear to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the E2E latency of a sync. Reworded. Maybe it should be 2s, because just the API call is 1s.

@alculquicondor alculquicondor force-pushed the job-running branch 3 times, most recently from aa9e6fe to f86af30 Compare September 1, 2021 18:37
### Risks and Mitigations

- An increase in Job status updates. To mitigate this, the job controller holds
the Pod updates that happen in 100ms before syncing a Job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100ms is negligible imho;

I would use at least 1s as a batching period

Copy link
Member Author

@alculquicondor alculquicondor Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I leave it open until I have the integration tests to do some experiments?

1s starts to sound a bit too long considering that we have to hold any Pod updates. See updated KEP

@@ -0,0 +1,3 @@
kep-number: 2879
alpha:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beta - yes, but we need to make sure that this is properly linked with #2307 which expands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to add that ref in the kep.yaml, pls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to update the description: I'm no longer proposing to start at beta.

I don't see this KEP as an expansion of #2307.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think this should be alpha.


- An increase in Job status updates. To mitigate this, the job controller holds
the Pod updates that happen in X ms before syncing a Job. X will be determined
from experiments on integration tests, but we expect it to be between 100ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 100ms delay isn't meaningful as it's way below the SLO for API call.

I personally recommend not even considering anything lower than 0.5s.
TBH, in this particular case I would say that even couple seconds might be fine, but I can imagine counterarguments too, so let's maybe stick to 500ms-1s interval for now.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Updated to 500ms-1s.

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2021
@k8s-ci-robot k8s-ci-robot merged commit 61f92a6 into kubernetes:master Sep 6, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 6, 2021

## Implementation History

- 2021-08-19: Proposed KEP starting in beta status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be marked as starting at Alpha status? @alculquicondor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. I'll fix it in the next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: API review completed, 1.23
Development

Successfully merging this pull request may close these issues.