typing.py: builtin LRU caches worsen leaks that exist in other code #98253

wjakob · 2022-10-13T22:18:52Z

Bug report

I would like to report a refleak issue involving typing.py. The issue is that it internally uses LRU caches to cache certain type-related lookups, and these caches are not cleaned up when the Python interpreter shuts down. This causes leaks that impede software development and debugging of refleaks in general.

This specific part of typing.py has already once been identified as a source of refleaks by @gvanrossum (context: https://bugs.python.org/issue28649).

The following provides a small reproducer via a trivial package (https://github.com/wjakob/typing_repro) that exposes a class named A using nanobind. Why nanobind? It is extremely paranoid about any leaks involving bound types, functions, and instances, and prints warning messages to tell the user about this after the interpreter has shut down (it performs checks following finalization using Py_AtExit()).

preparation:

$ pip install git+https://github.com/wjakob/typing_repro

Reproducer:

from typing_repro import A

import pandas
import typing

def test(t: typing.Optional[A] = None):
    print(t)

Running this yields

nanobind: leaked 1 types!                                                                                         
 - leaked type "A"                                                                                                
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

Note the import of pandas, which serves the role of a bigger package that uses the typing module and thereby populates the LRU caches. torch (PyTorch) or tensorflow also cause the issue, as does markupsafe, others likely affected as well.

EDIT: The problem that is common to all of these packages is that they leak some of their own types. For example, by Py_INCREFing references to heap types within extension modules. Because these types use typing.py and thereby reference the LRU caches (which are never cleaned up), it causes a flurry of refleaks that cascade into other packages.

Removing the test() function or removing the type annotation fixes the issue. The problem is that declaration causes cache entries to be created that are never cleaned up, even when the interpreter finalizes.

There is another way to avoid the issue: at the bottom of the script, insert

for f in typing._cleanups:
    f()

which clears the LRU caches in typing.py. Poof, errors gone. This leads me to suggest the following simple fix, to be added at the end of typing.py:

def _cleanup_handler():
    for f in _cleanups:
        f()

import atexit as _atexit
_atexit.register(_cleanup_handler)

This will clear the caches and ensure that interpreter finalization can avoid those type annotation-related leaks.

Your environment

CPython versions tested on: 3.8.10 and 3.10.;7
Operating system and architecture: Linux and macOS

PR: gh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache #98591

The text was updated successfully, but these errors were encountered:

JelleZijlstra · 2022-10-14T01:57:44Z

I'm not overly familiar with memory management around the shutdown sequence, but shouldn't these caches get collected automatically when the module object is deallocated and its dict is cleared? Or should every function that uses @lru_cache register an atexit handler to clear the cache?

gvanrossum · 2022-10-14T03:29:57Z

I honestly have no idea, but if this really was a problem I'd also expect to have heard of this before. Possibly nanobind's check runs before the typing module is cleaned up? Note that the OP appears to be the author of nanobind.

wjakob · 2022-10-14T07:28:19Z

What is potentially confusing is that there are two different kinds of atexit() at play in Python.

nanobind's exit handler is installed via Py_AtExit(). This is a list of up to 32 function pointers that are run after the interpreter is already fully shut down. Most Python API calls including garbage collection are no longer possible at this point.

Because nanobind maintains some internal data structures for every bound type outside of Python, it knows if its types have all been fully garbage collected. This has been invaluable in tracking down bugs in bindings, and the library is therefore intentionally very noisy upon noticing leaks.
Then, there is the Python atexit module that can be registered from within Python. Those run while the interpreter is still alive. The hack/workaround I suggested above uses that principle to clear the LRU caches in typing.py.

In principle, I agree with @JelleZijlstra's sentiment: the references should be known to Python's cyclic GC, which can then clean things up (including LRU caches related to type annotations). But somehow this is not happening.

wjakob · 2022-10-14T07:38:04Z

There really does seem to be something broken specifically with the LRU cache and garbage collection.

Python's codebase contains several places where code must manually clean the caches to avoid leaks when running the tests in refleak-hunting mode.

https://github.com/python/cpython/blob/main/Lib/test/libregrtest/utils.py#L211
https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L74
https://github.com/python/cpython/blob/main/Lib/test/test_typing.py#L5584

Some more context on these changes is given in the 2016 bugtracker issue: https://bugs.python.org/issue28649.

rhettinger · 2022-10-15T22:14:12Z

@wjakob Do your observations persist when using the pure Python version of the lru_cache instead of the C version?

The C version fully participates in GC and should behave like any other container (i.e. we don't normally clear every list, dict, and set prior to shutdown). If the C version is suspected to be buggy, here are some leads that we can follow:

https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/
https://bugs.python.org/issue42536
https://bugs.python.org/issue35780

wjakob · 2022-10-19T22:25:28Z

@rhettinger: I am not very familiar with the implementation of the LRU cache but tried the following -- please let me know if that's wrong. Specifically, I went into functools.py and commented out the lines

try:
    from _functools import _lru_cache_wrapper
except ImportError:
    pass

under the assumption that this is what it takes for the cache to switch over to the pure Python version. I observe that it then uses the Python version of the _lru_cache_wrapper function defined in the same file. However, this did not fix the leak:

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "add"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

The issue should be very easy to reproduce with the 5-LOC script in the first post. Does it happen on your end?

JelleZijlstra · 2022-10-19T22:37:56Z

I can reproduce the nanobind message with your sample script, but the leak apparently no longer happens when I comment out import pandas. pandas is a big pile of code, so perhaps whatever leak nanobind detects is actually in pandas?

wjakob · 2022-10-19T22:52:12Z

It happens with other big packages as well (for example, try PyTorch or Tensorflow). What I can see is that the LRU caches are filled with lots of data when pandas is imported, and that seems to play a role. I could see how pandas maybe has some reference counting issues internally, but I don't see how that would cause my own types to no longer be deallocated by typing.py.

The caches are definitely implicated in some form -- adding the following code to the reproducer fixes the issue for example:

import atexit

def cleanup():
    import typing
    for cleanup_func in typing._cleanups:
        cleanup_func()

atexit.register(cleanup)

JelleZijlstra · 2022-10-19T23:28:46Z

I tried to narrow it down by removing parts of pandas's __init__.py, and found that the reported leak goes away when the pandas._libs.json module is no longer imported. That's a Cython module with no apparent relevance to types; I think its code is here.

wjakob · 2022-10-20T11:24:59Z

Ok, I think I found a smoking gun.

Here is another, much smaller extension, that also produces a type leak:

from typing_repro import A

import markupsafe
import typing

def test(t: typing.Optional[A] = None):
    print(t)

produces the dreaded

nanobind: leaked 1 types!
 - leaked type "typing_repro.typing_repro_ext.A"
nanobind: leaked 2 functions!
 - leaked function "__init__"
 - leaked function "add"
nanobind: this is likely caused by a reference counting issue in the binding code.

The markupsafe extension contains a pure Python portion defining a markupsafe.Markup class, which includes type annotations that presumably trigger the LRU cache. But there is also a native C extension markupsafe._speedups, which is tiny, self-contained, and doesn't even use Cython (phew!).

The problem can be tied down to a function that is called as part of the module initialization:

static PyObject* markup;

static int
init_constants(void)
{
	PyObject *module;

	/* import markup type so that we can mark the return value */
	module = PyImport_ImportModule("markupsafe");
	if (!module)
		return 0;
	markup = PyObject_GetAttrString(module, "Markup");
	Py_DECREF(module);

	return 1;
}

/* removed lots of stuff ... */

PyMODINIT_FUNC
PyInit__speedups(void)
{
	if (!init_constants())
		return NULL;

	return PyModule_Create(&module_definition);
}

What's the problem? This extension module internally stashes a reference to the Markup class so that it can be used for internal purposes later on. It also implicitly increases the reference count via the PyObject_GetAttrString method. Adding Py_DECREF(markup); causes the warning to go away. So this is at the end of the day a simple reference leak within the markupsafe package.

HOWEVER: I don't think it is reasonable that this also causes other heap types to leak all across the Python interpreter. And it is specifically typing and the builtin LRU cache that are to blame for this.

wjakob · 2022-10-20T14:37:25Z

I see are potential workarounds. First, the LRU cache used by typing could potentially be modified so that it holds weak references that don't cause leaks across the program. However, I am not sure how practical that is.

The second one is the band-aid I suggested. The following could be added somewhere in typing.py and would take care of the leak at shutdown time.

def _cleanup_handler():
    for f in _cleanups:
        f()

import atexit as _atexit
_atexit.register(_cleanup_handler)

JelleZijlstra · 2022-10-21T00:03:30Z

Good catch! So the leak chain is markupsafe.Markup -> one of its method has an Optional[str] annotation -> Optional[str] resolves to an instance of typing._UnionGenericAlias -> typing._UnionGenericAlias.__getitem__ holds on to its LRU cache -> that cache contains a reference to typing_repro.A.

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

wjakob · 2022-10-21T06:23:42Z

I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug?

It would not fix the issue in general – as my tests above showed (try, e.g. replacing markupsafe by pandas, tensorflow, pytorch, etc.) this pattern of extension modules holding a reference to a type is very common. It's a benign kind of leak that does not bother anyone else.

What the LRU cache does is to spread this benign leak into a web of leaks involving anything else that also uses typing. That, I think is the gist of the issue.

wjakob · 2022-10-21T06:39:30Z

Here is a proposed change to typing.py that fixes this behavior, using yet another approach. The idea is to break the reference chain at the typing._UnionGenericAlias.__getitem__ -> lru_cache segment, by using the function as a key into a global dict storing all of the used LRU caches.

--- typing.py   2022-10-21 08:36:47.000000000 +0200
+++ typing.py   2022-10-21 08:36:54.000000000 +0200
@@ -293,6 +293,7 @@
 
 
 _cleanups = []
+_caches = { }
 
 
 def _tp_cache(func=None, /, *, typed=False):
@@ -300,13 +301,15 @@
     original function for non-hashable arguments.
     """
     def decorator(func):
-        cached = functools.lru_cache(typed=typed)(func)
-        _cleanups.append(cached.cache_clear)
+        cache = functools.lru_cache(typed=typed)(func)
+        _caches[func] = cache 
+        _cleanups.append(cache.cache_clear)
+        del cache
 
         @functools.wraps(func)
         def inner(*args, **kwds):
             try:
-                return cached(*args, **kwds)
+                return _caches[func](*args, **kwds)
             except TypeError:
                 pass  # All real errors (not unhashable args) are raised below.
             return func(*args, **kwds)

I tested this change, and it fixes the refleak issue on my end.

Fidget-Spinner · 2022-10-25T12:29:48Z

Just to reiterate so that I don't get lost in the weeds here:

typing.py's LRU cache itself doesn't leak references. It just holds onto other things which may leak references or is held by said other things.
Those other leaky things may hold a circular reference to typing.py's LRU cache through indirect means that are naturally found in typing.py.
So it's proposed to break the reference cycle on typing.py's end, instead of getting the leaky extension to do it.

Is this correct?

wjakob · 2022-10-25T13:28:55Z

Yes, that summarizes things well. I would add that typing is an elementary part of the Python ecosystem, which includes many compiled extension with individually benign type leaks that now cause issues through the interaction with typing. Expecting them to be fixed instead would seem to me like an impossible game of whack-a-mole.

…py lru_cache (python#98253)

JelleZijlstra · 2022-11-05T16:47:08Z

Note that the repro no longer works at the moment because nanobind 0.0.8 contains a workaround for this leak. I proposed wjakob/typing_repro#3 to "fix" it. In the meantime, to use the repro, first install nanobind 0.0.7 and the package's other build requirements, then install typing_repro as in the OP above but with --no-build-isolation.

…by typing.py lru_cache (#98591)

AlexWaygood · 2022-11-30T08:34:45Z

Thanks for the PR and for being patient, @wjakob

* main: (112 commits) pythongh-99894: Ensure the local names don't collide with the test file in traceback suggestion error checking (python#99895) pythongh-99612: Fix PyUnicode_DecodeUTF8Stateful() for ASCII-only data (pythonGH-99613) Doc: Add summary line to isolation_level & autocommit sqlite3.connect params (python#99917) pythonGH-98906 ```re``` module: ```search() vs. match()``` section should mention ```fullmatch()``` (pythonGH-98916) pythongh-89189: More compact range iterator (pythonGH-27986) bpo-47220: Document the optional callback parameter of weakref.WeakMethod (pythonGH-25491) pythonGH-99905: Fix output of misses in summarize_stats.py execution counts (pythonGH-99906) pythongh-99845: PEP 670: Convert PyObject macros to functions (python#99850) pythongh-99845: Use size_t type in __sizeof__() methods (python#99846) pythonGH-99877) Fix typo in exception message in `multiprocessing.pool` (python#99900) pythongh-87092: move all localsplus preparation into separate function called from assembler stage (pythonGH-99869) pythongh-99891: Fix infinite recursion in the tokenizer when showing warnings (pythonGH-99893) pythongh-99824: Document that sqlite3.connect implicitly open a transaction if autocommit=False (python#99825) pythonGH-81057: remove static state from suggestions.c (python#99411) Improve zip64 limit error message (python#95892) pythongh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache (python#98591) pythongh-99127: Allow some features of syslog to the main interpreter only (pythongh-99128) pythongh-82836: fix private network check (python#97733) Docs: improve accuracy of socketserver reference (python#24767) ...

wjakob · 2022-12-02T13:35:47Z

Thanks for accepting my patch :-)

wjakob added the type-bug An unexpected behavior, bug, or error label Oct 13, 2022

This was referenced Oct 13, 2022

[BUG]: Default member function argument causes leak wjakob/nanobind#69

Closed

leak report wjakob/nanobind#19

Closed

AlexWaygood added performance Performance or resource usage topic-typing labels Oct 13, 2022

JelleZijlstra mentioned this issue Oct 21, 2022

BUG: Reference leak in pandas/_libs/src/ujson/python/objToJSON.c pandas-dev/pandas#49222

Closed

3 tasks

wjakob added a commit to wjakob/cpython that referenced this issue Oct 24, 2022

fix refleak issue in typing.py (python#98253)

5bb22b0

wjakob added a commit to wjakob/cpython that referenced this issue Oct 24, 2022

fix refleak issue in typing.py (python#98253)

0ac2917

bedevere-bot mentioned this issue Oct 24, 2022

gh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache #98591

Merged

wjakob added a commit to wjakob/cpython that referenced this issue Oct 24, 2022

fix refleak issue in typing.py (python#98253)

7291ac0

Fidget-Spinner changed the title ~~typing.py: builtin LRU caches leak references~~ typing.py: builtin LRU caches worsen leaks that exist in other code Oct 25, 2022

Fidget-Spinner added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Oct 29, 2022

wjakob added a commit to wjakob/cpython that referenced this issue Oct 31, 2022

break potential reference cycles in external code worsened by typing.…

1ce6651

…py lru_cache (python#98253)

wjakob added a commit to wjakob/cpython that referenced this issue Oct 31, 2022

break potential reference cycles in external code worsened by typing.…

2fed985

…py lru_cache (python#98253)

AlexWaygood pushed a commit that referenced this issue Nov 30, 2022

gh-98253: Break potential reference cycles in external code worsened …

c314198

…by typing.py lru_cache (#98591)

AlexWaygood closed this as completed Nov 30, 2022

ncullen93 mentioned this issue May 26, 2024

potential bug: reference leak when torch is loaded and a function errors ANTsX/ANTsPy#666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

wjakob commented Oct 13, 2022 •

edited by bedevere-bot

Loading

JelleZijlstra commented Oct 14, 2022

gvanrossum commented Oct 14, 2022

wjakob commented Oct 14, 2022 •

edited

Loading

wjakob commented Oct 14, 2022 •

edited

Loading

rhettinger commented Oct 15, 2022

wjakob commented Oct 19, 2022

JelleZijlstra commented Oct 19, 2022 •

edited

Loading

wjakob commented Oct 19, 2022

JelleZijlstra commented Oct 19, 2022

wjakob commented Oct 20, 2022

wjakob commented Oct 20, 2022 •

edited

Loading

JelleZijlstra commented Oct 21, 2022

wjakob commented Oct 21, 2022

wjakob commented Oct 21, 2022 •

edited

Loading

Fidget-Spinner commented Oct 25, 2022

wjakob commented Oct 25, 2022

JelleZijlstra commented Nov 5, 2022

AlexWaygood commented Nov 30, 2022

wjakob commented Dec 2, 2022

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

typing.py: builtin LRU caches worsen leaks that exist in other code #98253

Comments

wjakob commented Oct 13, 2022 • edited by bedevere-bot Loading

Bug report

Your environment

JelleZijlstra commented Oct 14, 2022

gvanrossum commented Oct 14, 2022

wjakob commented Oct 14, 2022 • edited Loading

wjakob commented Oct 14, 2022 • edited Loading

rhettinger commented Oct 15, 2022

wjakob commented Oct 19, 2022

JelleZijlstra commented Oct 19, 2022 • edited Loading

wjakob commented Oct 19, 2022

JelleZijlstra commented Oct 19, 2022

wjakob commented Oct 20, 2022

wjakob commented Oct 20, 2022 • edited Loading

JelleZijlstra commented Oct 21, 2022

wjakob commented Oct 21, 2022

wjakob commented Oct 21, 2022 • edited Loading

Fidget-Spinner commented Oct 25, 2022

wjakob commented Oct 25, 2022

JelleZijlstra commented Nov 5, 2022

AlexWaygood commented Nov 30, 2022

wjakob commented Dec 2, 2022

wjakob commented Oct 13, 2022 •

edited by bedevere-bot

Loading

wjakob commented Oct 14, 2022 •

edited

Loading

wjakob commented Oct 14, 2022 •

edited

Loading

JelleZijlstra commented Oct 19, 2022 •

edited

Loading

wjakob commented Oct 20, 2022 •

edited

Loading

wjakob commented Oct 21, 2022 •

edited

Loading