Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

Closed
wjakob opened this issue Oct 13, 2022 · 19 comments
Closed

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

wjakob opened this issue Oct 13, 2022 · 19 comments
Labels
performance Performance or resource usage topic-typing type-feature A feature request or enhancement

Comments

@wjakob
Copy link
Contributor

wjakob commented Oct 13, 2022

Bug report

I would like to report a refleak issue involving typing.py. The issue is that it internally uses LRU caches to cache certain type-related lookups, and these caches are not cleaned up when the Python interpreter shuts down. This causes leaks that impede software development and debugging of refleaks in general.

This specific part of typing.py has already once been identified as a source of refleaks by @gvanrossum (context: https://bugs.python.org/issue28649).

The following provides a small reproducer via a trivial package (https://github.com/wjakob/typing_repro) that exposes a class named A using nanobind. Why nanobind? It is extremely paranoid about any leaks involving bound types, functions, and instances, and prints warning messages to tell the user about this after the interpreter has shut down (it performs checks following finalization using Py_AtExit()).

preparation:

$ pip install git+https://github.com/wjakob/typing_repro

Reproducer:

from typing_repro import A

import pandas
import typing

def test(t: typing.Optional[A] = None):
    print(t)

Running this yields

nanobind: leaked 1 types!                                                                                         
 - leaked type "A"                                                                                                
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

Note the import of pandas, which serves the role of a bigger package that uses the typing module and thereby populates the LRU caches. torch (PyTorch) or tensorflow also cause the issue, as does markupsafe, others likely affected as well.

EDIT: The problem that is common to all of these packages is that they leak some of their own types. For example, by Py_INCREFing references to heap types within extension modules. Because these types use typing.py and thereby reference the LRU caches (which are never cleaned up), it causes a flurry of refleaks that cascade into other packages.

Removing the test() function or removing the type annotation fixes the issue. The problem is that declaration causes cache entries to be created that are never cleaned up, even when the interpreter finalizes.

There is another way to avoid the issue: at the bottom of the script, insert

for f in typing._cleanups:
    f()

which clears the LRU caches in typing.py. Poof, errors gone. This leads me to suggest the following simple fix, to be added at the end of typing.py:

def _cleanup_handler():
    for f in _cleanups:
        f()

import atexit as _atexit
_atexit.register(_cleanup_handler)

This will clear the caches and ensure that interpreter finalization can avoid those type annotation-related leaks.

Your environment

  • CPython versions tested on: 3.8.10 and 3.10.;7
  • Operating system and architecture: Linux and macOS
@wjakob wjakob added the type-bug An unexpected behavior, bug, or error label Oct 13, 2022
@AlexWaygood AlexWaygood added performance Performance or resource usage topic-typing labels Oct 13, 2022
@JelleZijlstra
Copy link
Member

I'm not overly familiar with memory management around the shutdown sequence, but shouldn't these caches get collected automatically when the module object is deallocated and its dict is cleared? Or should every function that uses @lru_cache register an atexit handler to clear the cache?

@gvanrossum
Copy link
Member

I honestly have no idea, but if this really was a problem I'd also expect to have heard of this before. Possibly nanobind's check runs before the typing module is cleaned up? Note that the OP appears to be the author of nanobind.

@wjakob
Copy link
Contributor Author

wjakob commented Oct 14, 2022

What is potentially confusing is that there are two different kinds of atexit() at play in Python.

  1. nanobind's exit handler is installed via Py_AtExit(). This is a list of up to 32 function pointers that are run after the interpreter is already fully shut down. Most Python API calls including garbage collection are no longer possible at this point.

    Because nanobind maintains some internal data structures for every bound type outside of Python, it knows if its types have all been fully garbage collected. This has been invaluable in tracking down bugs in bindings, and the library is therefore intentionally very noisy upon noticing leaks.

  2. Then, there is the Python atexit module that can be registered from within Python. Those run while the interpreter is still alive. The hack/workaround I suggested above uses that principle to clear the LRU caches in typing.py.

In principle, I agree with @JelleZijlstra's sentiment: the references should be known to Python's cyclic GC, which can then clean things up (including LRU caches related to type annotations). But somehow this is not happening.

@wjakob
Copy link
Contributor Author

wjakob commented Oct 14, 2022

There really does seem to be something broken specifically with the LRU cache and garbage collection.

Python's codebase contains several places where code must manually clean the caches to avoid leaks when running the tests in refleak-hunting mode.

https://github.com/python/cpython/blob/main/Lib/test/libregrtest/utils.py#L211
https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L74
https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L5584

Some more context on these changes is given in the 2016 bugtracker issue: https://bugs.python.org/issue28649.

@rhettinger
Copy link
Contributor

@wjakob Do your observations persist when using the pure Python version of the lru_cache instead of the C version?

The C version fully participates in GC and should behave like any other container (i.e. we don't normally clear every list, dict, and set prior to shutdown). If the C version is suspected to be buggy, here are some leads that we can follow:

https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/
https://bugs.python.org/issue42536
https://bugs.python.org/issue35780

@wjakob
Copy link
Contributor Author

wjakob commented Oct 19, 2022

@rhettinger: I am not very familiar with the implementation of the LRU cache but tried the following -- please let me know if that's wrong. Specifically, I went into functools.py and commented out the lines

try:
    from _functools import _lru_cache_wrapper
except ImportError:
    pass

under the assumption that this is what it takes for the cache to switch over to the pure Python version. I observe that it then uses the Python version of the _lru_cache_wrapper function defined in the same file. However, this did not fix the leak:

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

The issue should be very easy to reproduce with the 5-LOC script in the first post. Does it happen on your end?

@JelleZijlstra
Copy link
Member

JelleZijlstra commented Oct 19, 2022

I can reproduce the nanobind message with your sample script, but the leak apparently no longer happens when I comment out import pandas. pandas is a big pile of code, so perhaps whatever leak nanobind detects is actually in pandas?

@wjakob
Copy link
Contributor Author

wjakob commented Oct 19, 2022

It happens with other big packages as well (for example, try PyTorch or Tensorflow). What I can see is that the LRU caches are filled with lots of data when pandas is imported, and that seems to play a role. I could see how pandas maybe has some reference counting issues internally, but I don't see how that would cause my own types to no longer be deallocated by typing.py.

The caches are definitely implicated in some form -- adding the following code to the reproducer fixes the issue for example:

import atexit

def cleanup():
    import typing
    for cleanup_func in typing._cleanups:
        cleanup_func()

atexit.register(cleanup)

@JelleZijlstra
Copy link
Member

I tried to narrow it down by removing parts of pandas's __init__.py, and found that the reported leak goes away when the pandas._libs.json module is no longer imported. That's a Cython module with no apparent relevance to types; I think its code is here.

@wjakob
Copy link
Contributor Author

wjakob commented Oct 20, 2022

Ok, I think I found a smoking gun.

Here is another, much smaller extension, that also produces a type leak:

from typing_repro import A

import markupsafe
import typing

def test(t: typing.Optional[A] = None):
    print(t)

produces the dreaded

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "__init__"
 - leaked function "add"
nanobind: this is likely caused by a reference counting issue in the binding code.

The markupsafe extension contains a pure Python portion defining a markupsafe.Markup class, which includes type annotations that presumably trigger the LRU cache. But there is also a native C extension markupsafe._speedups, which is tiny, self-contained, and doesn't even use Cython (phew!).

The problem can be tied down to a function that is called as part of the module initialization:

static PyObject* markup;

static int
init_constants(void)
{
	PyObject *module;

	/* import markup type so that we can mark the return value */
	module = PyImport_ImportModule("markupsafe");
	if (!module)
		return 0;
	markup = PyObject_GetAttrString(module, "Markup");
	Py_DECREF(module);

	return 1;
}

/* removed lots of stuff ... */

PyMODINIT_FUNC
PyInit__speedups(void)
{
	if (!init_constants())
		return NULL;

	return PyModule_Create(&module_definition);
}

What's the problem? This extension module internally stashes a reference to the Markup class so that it can be used for internal purposes later on. It also implicitly increases the reference count via the PyObject_GetAttrString method. Adding Py_DECREF(markup); causes the warning to go away. So this is at the end of the day a simple reference leak within the markupsafe package.

HOWEVER: I don't think it is reasonable that this also causes other heap types to leak all across the Python interpreter. And it is specifically typing and the builtin LRU cache that are to blame for this.

@wjakob
Copy link
Contributor Author

wjakob commented Oct 20, 2022

I see are potential workarounds. First, the LRU cache used by typing could potentially be modified so that it holds weak references that don't cause leaks across the program. However, I am not sure how practical that is.

The second one is the band-aid I suggested. The following could be added somewhere in typing.py and would take care of the leak at shutdown time.

def _cleanup_handler():
    for f in _cleanups:
        f()

import atexit as _atexit
_atexit.register(_cleanup_handler)

@JelleZijlstra
Copy link
Member

Good catch! So the leak chain is markupsafe.Markup -> one of its method has an Optional[str] annotation -> Optional[str] resolves to an instance of typing._UnionGenericAlias -> typing._UnionGenericAlias.__getitem__ holds on to its LRU cache -> that cache contains a reference to typing_repro.A.

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

@wjakob
Copy link
Contributor Author

wjakob commented Oct 21, 2022

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

It would not fix the issue in general – as my tests above showed (try, e.g. replacing markupsafe by pandas, tensorflow, pytorch, etc.) this pattern of extension modules holding a reference to a type is very common. It's a benign kind of leak that does not bother anyone else.

What the LRU cache does is to spread this benign leak into a web of leaks involving anything else that also uses typing. That, I think is the gist of the issue.

@wjakob
Copy link
Contributor Author

wjakob commented Oct 21, 2022

Here is a proposed change to typing.py that fixes this behavior, using yet another approach. The idea is to break the reference chain at the typing._UnionGenericAlias.__getitem__ -> lru_cache segment, by using the function as a key into a global dict storing all of the used LRU caches.

--- typing.py   2022-10-21 08:36:47.000000000 +0200
+++ typing.py   2022-10-21 08:36:54.000000000 +0200
@@ -293,6 +293,7 @@
 
 
 _cleanups = []
+_caches = { }
 
 
 def _tp_cache(func=None, /, *, typed=False):
@@ -300,13 +301,15 @@
     original function for non-hashable arguments.
     """
     def decorator(func):
-        cached = functools.lru_cache(typed=typed)(func)
-        _cleanups.append(cached.cache_clear)
+        cache = functools.lru_cache(typed=typed)(func)
+        _caches[func] = cache 
+        _cleanups.append(cache.cache_clear)
+        del cache
 
         @functools.wraps(func)
         def inner(*args, **kwds):
             try:
-                return cached(*args, **kwds)
+                return _caches[func](*args, **kwds)
             except TypeError:
                 pass  # All real errors (not unhashable args) are raised below.
             return func(*args, **kwds)

I tested this change, and it fixes the refleak issue on my end.

@Fidget-Spinner
Copy link
Member

Just to reiterate so that I don't get lost in the weeds here:

  1. typing.py's LRU cache itself doesn't leak references. It just holds onto other things which may leak references or is held by said other things.
  2. Those other leaky things may hold a circular reference to typing.py's LRU cache through indirect means that are naturally found in typing.py.
  3. So it's proposed to break the reference cycle on typing.py's end, instead of getting the leaky extension to do it.

Is this correct?

@wjakob
Copy link
Contributor Author

wjakob commented Oct 25, 2022

Yes, that summarizes things well. I would add that typing is an elementary part of the Python ecosystem, which includes many compiled extension with individually benign type leaks that now cause issues through the interaction with typing. Expecting them to be fixed instead would seem to me like an impossible game of whack-a-mole.

@Fidget-Spinner Fidget-Spinner changed the title typing.py: builtin LRU caches leak references typing.py: builtin LRU caches worsen leaks that exist in other code Oct 25, 2022
@Fidget-Spinner Fidget-Spinner added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Oct 29, 2022
wjakob added a commit to wjakob/cpython that referenced this issue Oct 31, 2022
wjakob added a commit to wjakob/cpython that referenced this issue Oct 31, 2022
@JelleZijlstra
Copy link
Member

Note that the repro no longer works at the moment because nanobind 0.0.8 contains a workaround for this leak. I proposed wjakob/typing_repro#3 to "fix" it. In the meantime, to use the repro, first install nanobind 0.0.7 and the package's other build requirements, then install typing_repro as in the OP above but with --no-build-isolation.

AlexWaygood pushed a commit that referenced this issue Nov 30, 2022
@AlexWaygood
Copy link
Member

Thanks for the PR and for being patient, @wjakob

carljm added a commit to carljm/cpython that referenced this issue Dec 1, 2022
* main: (112 commits)
  pythongh-99894: Ensure the local names don't collide with the test file in traceback suggestion error checking (python#99895)
  pythongh-99612: Fix PyUnicode_DecodeUTF8Stateful() for ASCII-only data (pythonGH-99613)
  Doc: Add summary line to isolation_level & autocommit sqlite3.connect params (python#99917)
  pythonGH-98906 ```re``` module: ```search() vs. match()``` section should mention ```fullmatch()``` (pythonGH-98916)
  pythongh-89189: More compact range iterator (pythonGH-27986)
  bpo-47220: Document the optional callback parameter of weakref.WeakMethod (pythonGH-25491)
  pythonGH-99905: Fix output of misses in summarize_stats.py execution counts (pythonGH-99906)
  pythongh-99845: PEP 670: Convert PyObject macros to functions (python#99850)
  pythongh-99845: Use size_t type in __sizeof__() methods (python#99846)
  pythonGH-99877)
  Fix typo in exception message in `multiprocessing.pool` (python#99900)
  pythongh-87092: move all localsplus preparation into separate function called from assembler stage (pythonGH-99869)
  pythongh-99891: Fix infinite recursion in the tokenizer when showing warnings (pythonGH-99893)
  pythongh-99824: Document that sqlite3.connect implicitly open a transaction if autocommit=False (python#99825)
  pythonGH-81057: remove static state from suggestions.c (python#99411)
  Improve zip64 limit error message (python#95892)
  pythongh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache (python#98591)
  pythongh-99127: Allow some features of syslog to the main interpreter only (pythongh-99128)
  pythongh-82836: fix private network check (python#97733)
  Docs: improve accuracy of socketserver reference (python#24767)
  ...
@wjakob
Copy link
Contributor Author

wjakob commented Dec 2, 2022

Thanks for accepting my patch :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-typing type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants