-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
typing.py: builtin LRU caches worsen leaks that exist in other code #98253
Comments
I'm not overly familiar with memory management around the shutdown sequence, but shouldn't these caches get collected automatically when the module object is deallocated and its dict is cleared? Or should every function that uses |
I honestly have no idea, but if this really was a problem I'd also expect to have heard of this before. Possibly nanobind's check runs before the typing module is cleaned up? Note that the OP appears to be the author of nanobind. |
What is potentially confusing is that there are two different kinds of
In principle, I agree with @JelleZijlstra's sentiment: the references should be known to Python's cyclic GC, which can then clean things up (including LRU caches related to type annotations). But somehow this is not happening. |
There really does seem to be something broken specifically with the LRU cache and garbage collection. Python's codebase contains several places where code must manually clean the caches to avoid leaks when running the tests in refleak-hunting mode. https://github.com/python/cpython/blob/main/Lib/test/libregrtest/utils.py#L211 Some more context on these changes is given in the 2016 bugtracker issue: https://bugs.python.org/issue28649. |
@wjakob Do your observations persist when using the pure Python version of the lru_cache instead of the C version? The C version fully participates in GC and should behave like any other container (i.e. we don't normally clear every list, dict, and set prior to shutdown). If the C version is suspected to be buggy, here are some leads that we can follow: https://mail.python.org/archives/list/python-dev@python.org/thread/C4ILXGPKBJQYUN5YDMTJOEOX7RHOD4S3/ |
@rhettinger: I am not very familiar with the implementation of the LRU cache but tried the following -- please let me know if that's wrong. Specifically, I went into
under the assumption that this is what it takes for the cache to switch over to the pure Python version. I observe that it then uses the Python version of the
The issue should be very easy to reproduce with the 5-LOC script in the first post. Does it happen on your end? |
I can reproduce the nanobind message with your sample script, but the leak apparently no longer happens when I comment out |
It happens with other big packages as well (for example, try PyTorch or Tensorflow). What I can see is that the LRU caches are filled with lots of data when The caches are definitely implicated in some form -- adding the following code to the reproducer fixes the issue for example:
|
I tried to narrow it down by removing parts of pandas's |
Ok, I think I found a smoking gun. Here is another, much smaller extension, that also produces a type leak: from typing_repro import A
import markupsafe
import typing
def test(t: typing.Optional[A] = None):
print(t) produces the dreaded
The The problem can be tied down to a function that is called as part of the module initialization: static PyObject* markup;
static int
init_constants(void)
{
PyObject *module;
/* import markup type so that we can mark the return value */
module = PyImport_ImportModule("markupsafe");
if (!module)
return 0;
markup = PyObject_GetAttrString(module, "Markup");
Py_DECREF(module);
return 1;
}
/* removed lots of stuff ... */
PyMODINIT_FUNC
PyInit__speedups(void)
{
if (!init_constants())
return NULL;
return PyModule_Create(&module_definition);
} What's the problem? This extension module internally stashes a reference to the HOWEVER: I don't think it is reasonable that this also causes other heap types to leak all across the Python interpreter. And it is specifically |
I see are potential workarounds. First, the LRU cache used by The second one is the band-aid I suggested. The following could be added somewhere in
|
Good catch! So the leak chain is I suppose we could put in the cleanup you recommend, but wouldn't it be better to fix the third-party packages that have this bug? |
It would not fix the issue in general – as my tests above showed (try, e.g. replacing What the LRU cache does is to spread this benign leak into a web of leaks involving anything else that also uses |
Here is a proposed change to --- typing.py 2022-10-21 08:36:47.000000000 +0200
+++ typing.py 2022-10-21 08:36:54.000000000 +0200
@@ -293,6 +293,7 @@
_cleanups = []
+_caches = { }
def _tp_cache(func=None, /, *, typed=False):
@@ -300,13 +301,15 @@
original function for non-hashable arguments.
"""
def decorator(func):
- cached = functools.lru_cache(typed=typed)(func)
- _cleanups.append(cached.cache_clear)
+ cache = functools.lru_cache(typed=typed)(func)
+ _caches[func] = cache
+ _cleanups.append(cache.cache_clear)
+ del cache
@functools.wraps(func)
def inner(*args, **kwds):
try:
- return cached(*args, **kwds)
+ return _caches[func](*args, **kwds)
except TypeError:
pass # All real errors (not unhashable args) are raised below.
return func(*args, **kwds) I tested this change, and it fixes the refleak issue on my end. |
Just to reiterate so that I don't get lost in the weeds here:
Is this correct? |
Yes, that summarizes things well. I would add that |
Note that the repro no longer works at the moment because nanobind 0.0.8 contains a workaround for this leak. I proposed wjakob/typing_repro#3 to "fix" it. In the meantime, to use the repro, first install nanobind 0.0.7 and the package's other build requirements, then install |
…by typing.py lru_cache (#98591)
Thanks for the PR and for being patient, @wjakob |
* main: (112 commits) pythongh-99894: Ensure the local names don't collide with the test file in traceback suggestion error checking (python#99895) pythongh-99612: Fix PyUnicode_DecodeUTF8Stateful() for ASCII-only data (pythonGH-99613) Doc: Add summary line to isolation_level & autocommit sqlite3.connect params (python#99917) pythonGH-98906 ```re``` module: ```search() vs. match()``` section should mention ```fullmatch()``` (pythonGH-98916) pythongh-89189: More compact range iterator (pythonGH-27986) bpo-47220: Document the optional callback parameter of weakref.WeakMethod (pythonGH-25491) pythonGH-99905: Fix output of misses in summarize_stats.py execution counts (pythonGH-99906) pythongh-99845: PEP 670: Convert PyObject macros to functions (python#99850) pythongh-99845: Use size_t type in __sizeof__() methods (python#99846) pythonGH-99877) Fix typo in exception message in `multiprocessing.pool` (python#99900) pythongh-87092: move all localsplus preparation into separate function called from assembler stage (pythonGH-99869) pythongh-99891: Fix infinite recursion in the tokenizer when showing warnings (pythonGH-99893) pythongh-99824: Document that sqlite3.connect implicitly open a transaction if autocommit=False (python#99825) pythonGH-81057: remove static state from suggestions.c (python#99411) Improve zip64 limit error message (python#95892) pythongh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache (python#98591) pythongh-99127: Allow some features of syslog to the main interpreter only (pythongh-99128) pythongh-82836: fix private network check (python#97733) Docs: improve accuracy of socketserver reference (python#24767) ...
Thanks for accepting my patch :-) |
Bug report
I would like to report a refleak issue involving
typing.py
. The issue is that it internally uses LRU caches to cache certain type-related lookups, and these caches are not cleaned up when the Python interpreter shuts down. This causes leaks that impede software development and debugging of refleaks in general.This specific part of
typing.py
has already once been identified as a source of refleaks by @gvanrossum (context: https://bugs.python.org/issue28649).The following provides a small reproducer via a trivial package (https://github.com/wjakob/typing_repro) that exposes a class named
A
usingnanobind
. Whynanobind
? It is extremely paranoid about any leaks involving bound types, functions, and instances, and prints warning messages to tell the user about this after the interpreter has shut down (it performs checks following finalization usingPy_AtExit()
).preparation:
Reproducer:
Running this yields
Note the import of
pandas
, which serves the role of a bigger package that uses thetyping
module and thereby populates the LRU caches.torch
(PyTorch) ortensorflow
also cause the issue, as doesmarkupsafe
, others likely affected as well.EDIT: The problem that is common to all of these packages is that they leak some of their own types. For example, by
Py_INCREF
ing references to heap types within extension modules. Because these types usetyping.py
and thereby reference the LRU caches (which are never cleaned up), it causes a flurry of refleaks that cascade into other packages.Removing the
test()
function or removing the type annotation fixes the issue. The problem is that declaration causes cache entries to be created that are never cleaned up, even when the interpreter finalizes.There is another way to avoid the issue: at the bottom of the script, insert
which clears the LRU caches in
typing.py
. Poof, errors gone. This leads me to suggest the following simple fix, to be added at the end oftyping.py
:This will clear the caches and ensure that interpreter finalization can avoid those type annotation-related leaks.
Your environment
The text was updated successfully, but these errors were encountered: