Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechecking pending Pods (conflict resolved) #375

Merged

Conversation

nicklesimba
Copy link
Collaborator

This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.

This is a rebase of #195

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should re-write the reconciler (which is cron-triggered) to use informers instead of this retry when the pod is pending.

We would be able to re-queue that pod liveness check to a later date.

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

@maiqueb
Copy link
Collaborator

maiqueb commented Aug 10, 2023

Maybe we should re-write the reconciler (which is cron-triggered) to use informers instead of this retry when the pod is pending.

We would be able to re-queue that pod liveness check to a later date.

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

EDIT: another thing that could be helpful is to use pagination when listing the resources. Of course this would only make sense if we manage to correlate the number of the returned API results to the OOM kills.

@nicklesimba
Copy link
Collaborator Author

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

@maiqueb can you point out where a blocking wait is happening? AFAIU, there's only a wait for 500ms at a time, and only three retries can happen, totalling 1.5s of waiting per pod. To me this seems very reasonable even if the issue were to crop up in several pods.

@nicklesimba
Copy link
Collaborator Author

Fixed a unit test (a test that was previously expected to fail should now pass) with latest force push. It was a bit too small of a change to keep as its own commit so I squashed it down.

@maiqueb
Copy link
Collaborator

maiqueb commented Aug 17, 2023

Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending state.

@maiqueb can you point out where a blocking wait is happening? AFAIU, there's only a wait for 500ms at a time, and only three retries can happen, totalling 1.5s of waiting per pod. To me this seems very reasonable even if the issue were to crop up in several pods.

here: https://github.com/k8snetworkplumbingwg/whereabouts/pull/375/files#diff-8a16f9c8d1f2a9d01692f7cf9b2ee6a6ceceef840a9daaf4ee7e7e173aaf7ebfR133

It's a sleep: we block the thread for that duration.

Trying to say we should try to re-queue the request, and check if in the next iteration the pod we read is no longer in pending state.

Is there something preventing this approach ?

@nicklesimba
Copy link
Collaborator Author

To summarize some discussion: we have decided to proceed with merging this for now, and to track Miguel's suggested implementation as a separate task. The downside of the current implementation is that having a lot of pending pods at the same time will cause the reconcile cycle to take a long time. However, the current implementation still solves pods stuck in pending state, and is overall better than not having a fix.

To do things the "proper" way, we will need to keep a list of the pending pods in the reconcile looper struct, and retry for them. This would also need to be integrated with the ip-control-loop to sync retries.

Copy link
Member

@dougbtv dougbtv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working through this for consensus and glad we'll follow up on the blocking issues Miguel mentioned

@nicklesimba nicklesimba merged commit 16baf31 into k8snetworkplumbingwg:master Aug 31, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants