fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

demarna1 · 2024-11-07T21:35:35Z

Goal

Fix high ETCD usage of Kubeflow ScheduledWorkflows. Closes #8757

Context

Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.

Status:
  Conditions:
    Last Heartbeat Time:   2024-11-07T11:16:33Z
    Last Transition Time:  2024-11-07T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These heartbeat updates result in an infinite reconciliation loop:

SWF is added to controller work queue.
Controller processes the SWF and updates the status' LastProbeTime and LastTransitionTime to current time.
Object is re-written to ETCD and the resourceVersion is updated.
Shared informer detects that the resourceVersion has changed.
Controller event handler re-adds the SWF to the work queue.
This reconciliation loop occurs every 10 seconds for every SWF resource on the cluster. The reason it's 10s and not 1s is because the controller has a default queue backoff of 10s, so events are always queued for a minimum of 10s.

Description of the fix

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 for now in order to fix the ETCD performance issues (which for us has resulted in ETCD outages). By keeping these fields constant, the object can be reconciled and the writes to ETCD stop. The schedules continue to function as before. Verbose logging is significantly reduced in several pods. A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

ETCD performance before & after

I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage (see chart below).

The chart roughly agrees with the back-of-the-napkin math:

The average size of our SWF objects is 270kb.
Controller re-writes the object every 10 seconds (6x/min).
Bytes written to ETCD per minute = 270kb x 6/min = 1.6MB/minute per SWF.
Our cluster had 54 SWFs at the time of the analysis.
ETCD write throughput is 54*1.6mb/min = 86mb/min = 430MB every 5 min.

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <noah.demarco@gmail.com>

google-oss-prow · 2024-11-07T21:35:47Z

Hi @demarna1. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

droctothorpe · 2024-11-08T01:45:59Z

/ok-to-test

hbelmiro

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...)
A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

Any reason for not removing them right now?

hbelmiro · 2024-11-08T12:16:35Z

Also @demarna1, can you please link the PR to the issue?

demarna1 · 2024-11-08T14:50:29Z

@hbelmiro linked the PR to the issue.

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...)
A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

Any reason for not removing them right now?

My first priority is addressing the ETCD performance issue and I didn't want a CRD change to delay it. But I see no reason we can't remove them and I'd be happy to do that in a follow-on PR!

droctothorpe · 2024-11-08T16:57:45Z

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

hbelmiro · 2024-11-08T17:50:00Z

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

@droctothorpe I don't know :(
It seems like something has changed in the repo's permissions.
The following used to work for first-time contributors.

/rerun-all
/ok-to-test

droctothorpe · 2024-11-08T18:14:56Z

Thanks, @hbelmiro! @HumairAK @zijianjoy do you happen to know if this change was intentional? It's out of sync with this documentation about membership privileges.

droctothorpe · 2024-11-19T01:49:05Z

Bump.

Signed-off-by: demarna1 <noah.demarco@gmail.com>

demarna1 · 2024-11-25T16:40:26Z

@HumairAK can you re-run CI?

hbelmiro · 2024-11-26T11:53:11Z

@demarna1 can you please check the failing tests?

demarna1 · 2024-11-26T17:55:33Z

@hbelmiro I checked but doesn't appear to be related to my change. It looks like a timeout of some sort. Can we try re-running?

HumairAK · 2024-11-26T18:52:32Z

ETCD write throughput is 54*1.6mb/min = 86mb/min = 430MB every 5 min.

Oh. My. God. 🤦🏾

Awesome work folks! I agree we should either drop these fields, or only update these fields when actual non-status related updates occur. Can we get a follow up issue?

/lgtm
/approve

google-oss-prow · 2024-11-26T18:53:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~backend/OWNERS~~ [HumairAK]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix(backend): stop heartbeat status updates in ScheduledWorkflow cont…

23fcac3

…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <noah.demarco@gmail.com>

google-oss-prow bot requested review from hbelmiro and rimolive November 7, 2024 21:35

google-oss-prow bot added the size/XS label Nov 7, 2024

google-oss-prow bot added needs-ok-to-test ok-to-test and removed needs-ok-to-test labels Nov 7, 2024

hbelmiro reviewed Nov 8, 2024

View reviewed changes

demarna1 added 2 commits November 24, 2024 12:33

Update scheduled_workflow_test.go

c19e8f4

Signed-off-by: demarna1 <noah.demarco@gmail.com>

Merge branch 'kubeflow:master' into stop-heartbeat

cb01fd1

github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 26, 2024

google-oss-prow bot assigned HumairAK Nov 26, 2024

google-oss-prow bot added the lgtm label Nov 26, 2024

HumairAK added this to the KFP 2.4.0 milestone Nov 26, 2024

google-oss-prow bot added the approved label Nov 26, 2024

google-oss-prow bot merged commit 9ccec4c into kubeflow:master Nov 26, 2024
16 checks passed

demarna1 deleted the stop-heartbeat branch November 26, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

demarna1 commented Nov 7, 2024 •

edited

Loading

google-oss-prow bot commented Nov 7, 2024

droctothorpe commented Nov 8, 2024

hbelmiro left a comment

hbelmiro commented Nov 8, 2024 •

edited

Loading

demarna1 commented Nov 8, 2024

droctothorpe commented Nov 8, 2024

hbelmiro commented Nov 8, 2024 •

edited

Loading

droctothorpe commented Nov 8, 2024

droctothorpe commented Nov 19, 2024

demarna1 commented Nov 25, 2024

hbelmiro commented Nov 26, 2024

demarna1 commented Nov 26, 2024

HumairAK commented Nov 26, 2024

google-oss-prow bot commented Nov 26, 2024

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

Conversation

demarna1 commented Nov 7, 2024 • edited Loading

google-oss-prow bot commented Nov 7, 2024

droctothorpe commented Nov 8, 2024

hbelmiro left a comment

Choose a reason for hiding this comment

hbelmiro commented Nov 8, 2024 • edited Loading

demarna1 commented Nov 8, 2024

droctothorpe commented Nov 8, 2024

hbelmiro commented Nov 8, 2024 • edited Loading

droctothorpe commented Nov 8, 2024

droctothorpe commented Nov 19, 2024

demarna1 commented Nov 25, 2024

hbelmiro commented Nov 26, 2024

demarna1 commented Nov 26, 2024

HumairAK commented Nov 26, 2024

google-oss-prow bot commented Nov 26, 2024

demarna1 commented Nov 7, 2024 •

edited

Loading

hbelmiro commented Nov 8, 2024 •

edited

Loading

hbelmiro commented Nov 8, 2024 •

edited

Loading