Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize node configuration: add pod-max-pids to avoid PID exhaustion #2276

Closed
tdihp opened this issue Apr 18, 2021 · 14 comments
Closed

Customize node configuration: add pod-max-pids to avoid PID exhaustion #2276

tdihp opened this issue Apr 18, 2021 · 14 comments
Labels
feature-request Requested Features

Comments

@tdihp
Copy link

tdihp commented Apr 18, 2021

What happened:

Applications can allocate too many threads, triggering EAGAIN when kubelet/containerd tries to create new thread with pthread_create. We observe PLEG failures and nodes not ready due to some offending application.

What you expected to happen:

Add pod-pid-limits as an option for custom node configuration. Configure a smaller value for pods should provide safety for node readiness.

How to reproduce it (as minimally and precisely as possible):

Simply add a testing Python pod in 2 steps and wait for around 6 mins to wait for node not ready:

kubectl run -it --rm --restart=Never --image=python:3-slim python -- python
import time

def thread_burst(n=1000, t=600):
    import threading
    
    threads = []
    for i in range(n):
        thread = threading.Thread(target=time.sleep, args=(t,))
        try:
            thread.start()
        except Exception as e:
            print('got exception when starting thread: %s' % e)
            break
        threads.append(thread)
    return threads

def main():
    threads = []
    while True:
        print('bursting threads')
        threads.extend(thread_burst(n=1000))
        time.sleep(10)

main()

Environment:

  • Kubernetes version (use kubectl version): 1.19.7
@ghost ghost added the triage label Apr 18, 2021
@ghost
Copy link

ghost commented Apr 18, 2021

Hi tdihp, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@tdihp
Copy link
Author

tdihp commented Apr 18, 2021

related: #323

@ghost
Copy link

ghost commented Apr 20, 2021

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Apr 25, 2021

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Apr 25, 2021
@ghost
Copy link

ghost commented May 10, 2021

Issue needing attention of @Azure/aks-leads

2 similar comments
@ghost
Copy link

ghost commented May 26, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 10, 2021

Issue needing attention of @Azure/aks-leads

@justindavies
Copy link
Contributor

A PR has been raised to enable this in the azure cli for the next release, and the documentation will also be updated. I'll keep this open until this has been completed

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Jun 17, 2021
@justindavies justindavies added the feature-request Requested Features label Jun 17, 2021
@ghost ghost removed the triage label Jun 17, 2021
@zacharias33
Copy link

zacharias33 commented Aug 30, 2021

Hi I am experiencing a similar issue on Kubernetes version 1.19.11. The PID space of a random node gets exhausted and my only solution for now is to restart that node. Do we have any updates on this feature?

@thunter1000
Copy link

thunter1000 commented Sep 27, 2021

Hi this is also impacting us with Kubernetes version 1.191.11. When the PID space is exhausted this impacts our calico-node pod which impacts everything on the node. Restarting the node does seem to resolve the issue. Has this change been released yet?

@ghost ghost added the action-required label Mar 26, 2022
@ghost
Copy link

ghost commented Mar 31, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 31, 2022
@ghost
Copy link

ghost commented Apr 15, 2022

Issue needing attention of @Azure/aks-leads

@tdihp
Copy link
Author

tdihp commented Apr 19, 2022

For users affected by this, I'd suggest to identify the culprit application causing the exhaustion.

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Apr 19, 2022
@tdihp
Copy link
Author

tdihp commented Apr 19, 2022

Closing as custom node configuration is now GA with podMaxPids

@tdihp tdihp closed this as completed Apr 19, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request Requested Features
Projects
None yet
Development

No branches or pull requests

4 participants