Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC hanging on API client cleanup #1037

Closed
fabianvf opened this issue Dec 20, 2019 · 19 comments · Fixed by #1073
Closed

GC hanging on API client cleanup #1037

fabianvf opened this issue Dec 20, 2019 · 19 comments · Fixed by #1073
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@fabianvf
Copy link
Contributor

What happened (please include outputs or screenshots):
Python hangs during garbage collection of the API client, in the __del__ method.

What you expected to happen:
Garbage collection succeeds and the process exits

How to reproduce it (as minimally and precisely as possible):

import kubernetes

client = kubernetes.config.new_client_from_config()
thread = kubernetes.client.ApisApi(client).get_api_versions(async_req=True)
del client

Anything else we need to know?:
The hang doesn't actually occur until python exits. IE, with the following script:

import kubernetes

client = kubernetes.config.new_client_from_config()
thread = kubernetes.client.ApisApi(client).get_api_versions(async_req=True)
del client
print("done")

The output is as follows (I ctrl-ced twice after the process hung):

done



^CException ignored in: <bound method ApiClient.__del__ of <kubernetes.client.api_client.ApiClient object at 0x7f8f9ba67c50>>
Traceback (most recent call last):
  File "/home/fabian/Envs/test/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 81, in __del__
    self._pool.join()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 546, in join
    self._worker_handler.join()
  File "/usr/lib64/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
    finalizer()
  File "/usr/lib64/python3.6/multiprocessing/util.py", line 186, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 582, in _terminate_pool
    worker_handler.join()
  File "/usr/lib64/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Environment:

  • Kubernetes version (kubectl version): Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6+2e5ed54", GitCommit:"2e5ed54", GitTreeState:"clean", BuildDate:"2019-10-10T22:04:13Z", GoVersion:"go1.12.8", Compiler:"gc", Platform:"linux/amd64"}

  • OS (e.g., MacOS 10.13.6): Fedora 27 (Workstation Edition)

  • Python version (python --version): 2.7.15 and 3.6.6

  • Python client version (pip list | grep kubernetes): 10.0.1

@fabianvf fabianvf added the kind/bug Categorizes issue or PR as related to a bug. label Dec 20, 2019
@fabianvf
Copy link
Contributor Author

adding self._pool.terminate() before this line solved the issue for me, though I'm not familiar enough with the multiprocessing library to know if that's a good practice. Also it looks like this would likely need to be changed in the upstream generator?

@vinzent
Copy link

vinzent commented Dec 23, 2019

reading the different terminate() docs on https://docs.python.org/3.6/library/multiprocessing.html doesn't sound like calling terminate() is the thing we should do.

tried to get more debug output by extending your reproducer:


import multiprocessing, logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)

import kubernetes

client = kubernetes.config.new_client_from_config()
thread = kubernetes.client.ApisApi(client).get_api_versions(async_req=True)
del client

but now the reproduces doesn't reproduce anymore.

@vinzent
Copy link

vinzent commented Dec 23, 2019

also having bit of sleep like:

import kubernetes
import time

client = kubernetes.config.new_client_from_config()
thread = kubernetes.client.ApisApi(client).get_api_versions(async_req=True)
time.sleep(0.01)
del client

makes it exit clean.

@palnabarun
Copy link
Member

but now the reproduces doesn't reproduce anymore.

@vinzent I'm still able to reproduce the bug with the debug logging.

However, adding the sleep before delete works.

I think this needs to be investigated more.

@fabianvf
Copy link
Contributor Author

fabianvf commented Jan 2, 2020

I think it is possible for this to happen even if the request is not async, though I'm still trying to figure out how to reproduce it

@vinzent
Copy link

vinzent commented Jan 6, 2020

trying to run:

response = kubernetes.client.ApisApi(client).get_api_versions(_request_timeout=0.5)

it also blocks.

if i set timeout >= 1 second it retries 3 times and ends successfully.

at least according docs its possible to specify as float https://urllib3.readthedocs.io/en/latest/user-guide.html#using-timeouts

but I don't think this is the same issue. backtrace: https://gist.github.com/vinzent/151ef384ded68d86881a8a948bc9e410

@roycaihw
Copy link
Member

roycaihw commented Jan 6, 2020

/cc @roycaihw

@roycaihw
Copy link
Member

roycaihw commented Jan 6, 2020

I wonder if this is reproducible with https://github.com/OpenAPITools/openapi-generator/tree/master/samples/client/petstore/python/petstore_api. It looks like you can test a client against the server https://petstore.swagger.io/

@vinzent
Copy link

vinzent commented Jan 7, 2020

@roycaihw at least for a quick test I wasn't able to reproduce it with the petstore python client

import petstore_api

api_client = petstore_api.ApiClient()
api_instance = petstore_api.AnotherFakeApi(api_client)
body = petstore_api.Client() # Client | client model

api_response = api_instance.call_123_test_special_tags(body, async_req=True)
del api_client
print("done")

works for me with Python 3.7.4 (Fedora 31) and urllib 1.25.7 (vs. python 3.6 (RH UBI 8) and urllib 1.24.x)

@vinzent
Copy link

vinzent commented Jan 7, 2020

@roycaihw with kubernetes python client I can also reproduce with Python 3.7.4 (Fedora 31) and latest kubernetes python 10.0.1 (the ansible operator uses 8.0.2).

@fabianvf
Copy link
Contributor Author

fabianvf commented Jan 7, 2020

@vinzent does your version of urllib3 match for the petstore test and the kubernetes test?

@fabianvf
Copy link
Contributor Author

fabianvf commented Jan 22, 2020

It looks like it may be a swagger-codegen* bug: swagger-api/swagger-codegen#9991

First reported here, but it looks like the issue is in swagger-codegen: https://bugs.python.org/issue39360

edit: Not a CPython bug, just swagger-codegen

@gaborbernat
Copy link

@fabianvf it's not a CPython bug, it's a bug within swagger-api/swagger-codegen. Calling pool terminate within the garbage collector is not supported.

@fabianvf
Copy link
Contributor Author

@gaborbernat my apologies, I misread the issue on bugs.python.org. Updated

@vinzent
Copy link

vinzent commented Jan 23, 2020

I think openapi-generator is used, not swagger-codegen.

https://github.com/OpenAPITools/openapi-generator/blob/master/modules/openapi-generator/src/main/resources/python/api_client.mustache#L78

so probably we need to open up a issue there aswell?

@fabianvf
Copy link
Contributor Author

fabianvf commented Jan 23, 2020

@vinzent Good catch, I saw the familiar client code and didn't even think about it. I'll clone the issue and PR

edit: OpenAPITools/openapi-generator#5094

@fabianvf
Copy link
Contributor Author

The generator PR has merged, what's the process for regenerating the client with those changes (ideally back a few versions as well)? I'm happy to create a manual PR and backports if that's necessary

@gaborbernat
Copy link

Before regeneration they need to release not? Did that happen?🤔

@micw523
Copy link
Contributor

micw523 commented Jan 29, 2020

You might want to check out this to use a specific commit of the OpenAPI generator.

fabianvf added a commit to fabianvf/python that referenced this issue Feb 6, 2020
This removes the __del__ function from the generated Python client,
and replaces it with a cleanup function. When a ThreadPool is created,
the cleanup function is registered with the atexit module.

This PR also allows the client to be used as a context manager, which
will automatically clean up after itself rather than having to wait til
process exit.

This fixes kubernetes-client#1037, where the API client could hang indefinitely at
garbage collection.
fabianvf added a commit to fabianvf/python that referenced this issue Feb 6, 2020
This removes the __del__ function from the generated Python client,
and replaces it with a cleanup function. When a ThreadPool is created,
the cleanup function is registered with the atexit module.

This PR also allows the client to be used as a context manager, which
will automatically clean up after itself rather than having to wait til
process exit.

This fixes issue kubernetes-client#1037, where the API client could hang indefinitely at
garbage collection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants