Rechecking pending Pods (conflict resolved) #375

nicklesimba · 2023-08-09T21:49:40Z

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

This is a rebase of #195

maiqueb

Maybe we should re-write the reconciler (which is cron-triggered) to use informers instead of this retry when the pod is pending.

We would be able to re-queue that pod liveness check to a later date.

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

maiqueb · 2023-08-10T08:48:06Z

Maybe we should re-write the reconciler (which is cron-triggered) to use informers instead of this retry when the pod is pending.

We would be able to re-queue that pod liveness check to a later date.

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

EDIT: another thing that could be helpful is to use pagination when listing the resources. Of course this would only make sense if we manage to correlate the number of the returned API results to the OOM kills.

nicklesimba · 2023-08-14T19:53:58Z

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

@maiqueb can you point out where a blocking wait is happening? AFAIU, there's only a wait for 500ms at a time, and only three retries can happen, totalling 1.5s of waiting per pod. To me this seems very reasonable even if the issue were to crop up in several pods.

nicklesimba · 2023-08-16T17:51:17Z

Fixed a unit test (a test that was previously expected to fail should now pass) with latest force push. It was a bit too small of a change to keep as its own commit so I squashed it down.

maiqueb · 2023-08-17T08:21:24Z

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

@maiqueb can you point out where a blocking wait is happening? AFAIU, there's only a wait for 500ms at a time, and only three retries can happen, totalling 1.5s of waiting per pod. To me this seems very reasonable even if the issue were to crop up in several pods.

here: https://github.com/k8snetworkplumbingwg/whereabouts/pull/375/files#diff-8a16f9c8d1f2a9d01692f7cf9b2ee6a6ceceef840a9daaf4ee7e7e173aaf7ebfR133

It's a sleep: we block the thread for that duration.

Trying to say we should try to re-queue the request, and check if in the next iteration the pod we read is no longer in pending state.

Is there something preventing this approach ?

nicklesimba · 2023-08-31T17:03:53Z

To summarize some discussion: we have decided to proceed with merging this for now, and to track Miguel's suggested implementation as a separate task. The downside of the current implementation is that having a lot of pending pods at the same time will cause the reconcile cycle to take a long time. However, the current implementation still solves pods stuck in pending state, and is overall better than not having a fix.

To do things the "proper" way, we will need to keep a list of the pending pods in the reconcile looper struct, and retry for them. This would also need to be integrated with the ip-control-loop to sync retries.

dougbtv

thanks for working through this for consensus and glad we'll follow up on the blocking issues Miguel mentioned

nicklesimba requested review from maiqueb and dougbtv as code owners August 9, 2023 21:49

nicklesimba mentioned this pull request Aug 9, 2023

[WIP] Rechecking pending Pods (conflict resolved) #372

Closed

maiqueb reviewed Aug 10, 2023

View reviewed changes

Rechecking pending Pods (conflict resolved)

48c03a5

nicklesimba force-pushed the xagent-rebase-195 branch from 91eef1c to 48c03a5 Compare August 16, 2023 17:49

dougbtv approved these changes Aug 31, 2023

View reviewed changes

nicklesimba merged commit 16baf31 into k8snetworkplumbingwg:master Aug 31, 2023
10 checks passed

nicklesimba mentioned this pull request Sep 12, 2023

OCPBUGS-18893: Rechecking pending Pods (conflict resolved) openshift/whereabouts-cni#196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rechecking pending Pods (conflict resolved) #375

Rechecking pending Pods (conflict resolved) #375

nicklesimba commented Aug 9, 2023

maiqueb left a comment

maiqueb commented Aug 10, 2023

nicklesimba commented Aug 14, 2023

nicklesimba commented Aug 16, 2023

maiqueb commented Aug 17, 2023

nicklesimba commented Aug 31, 2023

dougbtv left a comment

Rechecking pending Pods (conflict resolved) #375

Rechecking pending Pods (conflict resolved) #375

Conversation

nicklesimba commented Aug 9, 2023

maiqueb left a comment

Choose a reason for hiding this comment

maiqueb commented Aug 10, 2023

nicklesimba commented Aug 14, 2023

nicklesimba commented Aug 16, 2023

maiqueb commented Aug 17, 2023

nicklesimba commented Aug 31, 2023

dougbtv left a comment

Choose a reason for hiding this comment