Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C implementation of @implementer leaks objects #216

Closed
richvdh opened this issue Sep 27, 2020 · 41 comments · Fixed by #217
Closed

C implementation of @implementer leaks objects #216

richvdh opened this issue Sep 27, 2020 · 41 comments · Fixed by #217

Comments

@richvdh
Copy link

richvdh commented Sep 27, 2020

It appears that the C implementation of @implementer has a reference leak. This leads to a gradual increase in GC time, in a long-running application.

The following test script:

#!/usr/bin/env python

import gc
from zope.interface.declarations import implementer
from zope.interface.interface import Interface

for i in range(20000):
    @implementer(Interface)
    class TestClass:
        pass

    if i % 1000 == 0:
        gc.collect()
        n = len(gc.get_objects())
        print("After %i cycles: %i objects" % (i, n))

produces the following output:

After 0 cycles: 8199 objects
After 1000 cycles: 45199 objects
After 2000 cycles: 82199 objects
After 3000 cycles: 119199 objects
After 4000 cycles: 156199 objects
After 5000 cycles: 193199 objects
After 6000 cycles: 230199 objects
After 7000 cycles: 267199 objects
After 8000 cycles: 304199 objects
After 9000 cycles: 341199 objects
After 10000 cycles: 378199 objects
After 11000 cycles: 415199 objects
After 12000 cycles: 452199 objects
After 13000 cycles: 489199 objects
After 14000 cycles: 526199 objects
After 15000 cycles: 563199 objects
After 16000 cycles: 600199 objects
After 17000 cycles: 637199 objects
After 18000 cycles: 674199 objects
After 19000 cycles: 711199 objects

In other words, each invocation of @implementer leads to new objects which cannot be garbage-collected.

Disabling the C implementations (by moving aside the _zope_interface_coptimizations.cpython-36m-x86_64-linux-gnu.so file) resolves the problem, as does downgrading to zope.interface 4.7.2.

@richvdh
Copy link
Author

richvdh commented Sep 27, 2020

git bisect suggests the problem was introduced in ddf670a.

@jamadden
Copy link
Member

Thanks for the report. I’ll look into it tomorrow unless someone sees the problem first.

@richvdh
Copy link
Author

richvdh commented Sep 27, 2020

This patch seems to make the problem go away. I'm not really familiar enough with either zope.interfaces or the Python C API to say for certain if it is correct (or complete):

diff --git a/src/zope/interface/_zope_interface_coptimizations.c b/src/zope/interface/_zope_interface_coptimizations.c
index 346e7f2..0764bde 100644
--- a/src/zope/interface/_zope_interface_coptimizations.c
+++ b/src/zope/interface/_zope_interface_coptimizations.c
@@ -359,6 +359,7 @@ Spec_traverse(Spec* self, visitproc visit, void* arg)
 static int
 Spec_clear(Spec* self)
 {
+    Py_CLEAR(self->_bases);
     Py_CLEAR(self->_implied);
     Py_CLEAR(self->_dependents);
     Py_CLEAR(self->_v_attrs);

Suffice it to say that SpecificationBase._bases seems to confuse the garbage collector.

@jamadden
Copy link
Member

Thanks, that does indeed look like the problem. Spec->_bases was added in ab0466e (#155), but was simply overlooked in both Spec_traverse and Spec_clear.

@richvdh I'm curious. My group has been running applications with uptimes measured in weeks and hasn't noticed this issue; they heavily use zope.interface/zope.component/zope.site and Pyramid. Would you say your application is a typical user of zope.interface, or is it somewhat unusual in the amount of dynamic specifications it creates?

jamadden added a commit that referenced this issue Sep 28, 2020
…bases

Previously they did not, leading to a reference leak of a tuple.

Fixes #216
@jensens
Copy link
Member

jensens commented Sep 28, 2020

I have a Plone 5.2.2 site using the latest zope.interface, running with some unexpected memory behaviour, growing slowly over several days from initial 400-500MB to 3GB (my limit when it restarts the process). There are that many possible factors what's wrong, it is difficult to tell if this could be the reason. Also, I need to enhance my memory-leak debugging skills...

But when fixed I can update the service immediately using a branch and report back (needs some days).

@richvdh
Copy link
Author

richvdh commented Sep 28, 2020

@jamadden good stuff, seems like an easy fix.

Our application is based on top of Twisted, which uses zope.interface internally. I couldn't really say exactly what it is doing that is causing the problem here; a couple of things came to mind:

Again, I'm not sure if those are the actual problem here, because I struggled to reproduce the problem outside our application. But they do look like dynamic behaviour that Zope might not expect, but that all Twisted-based applications are likely to hit.

@richvdh
Copy link
Author

richvdh commented Sep 28, 2020

@jensens An interesting artifact of our experince was that, because the retained objects were tiny (1- or 2-tuples), the increase in memory usage was barely perceptible. We (or rather, our users) noticed it via increased GC time despite a relatively flat memory consumption.

@d-maurer
Copy link
Contributor

d-maurer commented Sep 28, 2020 via email

@jamadden
Copy link
Member

Thanks everyone, this all makes sense. Given a scenario like this:

# Static interface definition, static class definition
class IFoo(Interface):
    pass
class Bar(object):
    pass

# Something done per-request/connection/etc
bar = Bar()
alsoProvides(bar, IFoo)

The bar object now has a __provides__ attribute that is-a SpecificationBase (the problematic object). Such usage is not at all uncommon.

The contents of the leaking SpecificationBase.__bases__ will be (IFoo, <implementedBy Bar>) — two static objects. The only thing that was newly allocated is the tuple itself. So memory growth would be quite small (56 bytes on my system), while the total number of objects (tuples) grows per-request/whatever.

It's possible our application is subject to this but we've tuned GC to run rarely, accepting the occasional large pause. So we may not have noticed a slow growth in something that was already large. It's also possible our monitoring or metrics alerting just weren't capturing this, I haven't double-checked [/cc @sajones @cutz].

I think #217 should be the fix. If reviewed and accepted I can immediately release the patched version to PyPI.

@jensens
Copy link
Member

jensens commented Sep 28, 2020

In Plone we use z3c.form and other mechanisms to dynamically assign (even per request) interfaces. Also our tuples might be heavier than usual. So, this could be the reason for my observed memory leaks. I started a Plone buildout coredev test-run with the fixes. If this goes green I would consider the patch at least w/o side effects. My next check will be using the branch in my live site with the memory leaking problems.

@d-maurer
Copy link
Contributor

d-maurer commented Sep 28, 2020 via email

@jensens
Copy link
Member

jensens commented Sep 28, 2020

In addition, alsoProvides is applied to any request
(to support the "browserlayer" functionality).

Indeed, and this together with lineage subsites (additional own browserlayers) and plone.subrequest (Mosaic). That is a lot alsoProvides per request.

@jamadden
Copy link
Member

Indeed, each intermediate alsoProvides() replaces the __provides__ attribute, it doesn't augment it. Thus each individual call to alsoProvides would leak a tuple. Likewise for noLongerProvides or anything else built ondirectlyProvides.

@jensens
Copy link
Member

jensens commented Sep 29, 2020

@jamadden
Copy link
Member

jamadden commented Sep 29, 2020

I'm not sure what that report is telling us, the stack trace seems to have left out the actual error. But guessing from <weakref at 0x7fa19db030b0; to 'InterfaceClass' at 0x7fa19e39a9e0 (IBlank)>, something was relying on a weak ref still existing that no longer does. That makes a certain amount of sense as we are now clearing a reference to whatever was in __bases__ (of previous specifications) where before they were allowed to leak. Is it a KeyError?

Note that setting __bases__ did not change in this PR. All that changed was the behaviour when an object was deallocated. Setting _bases, and managing the refcounts for that, is handled entirely by CPython (we expose the field of the struct as a getset member, and CPython essentially generates the @property code to handle all conversion and refcounting).

@jensens
Copy link
Member

jensens commented Sep 30, 2020

Either I'm not sure what it tells me. Maybe a wrong implementation on Plone side, may be test specific. This is the the pain code, where zope.interface is used in a very dynamic way. But strange, it only appears in Python 3.7?

@d-maurer
Copy link
Contributor

d-maurer commented Sep 30, 2020 via email

@jensens
Copy link
Member

jensens commented Sep 30, 2020

Update:

  • retry tests on Jenkins now: fails
  • local testing, on Python 3.7.8 running plone.dexterity tests only: works

Jenkins uses 3.7.6, so I installed this version on my machine.

  • local testing, on Python 3.7.6 running plone.dexterity tests only: works

Maybe the test are failing only in combination with other tests, as Jenkins on Python 3.7.6 does:

  • local testing all tests, on Python 3.7.6 with all tests: fails!
  • local testing all tests, on Python 3.7.8 with all tests: fails!
  • local testing all tests, on Python 3.8.6 with all tests: works!

So I have still no clue why, but it is a problem on Python 3.7 only running the tests in combination.

@richvdh
Copy link
Author

richvdh commented Sep 30, 2020

Forgive my naivety, but this sounds like a problem specific to plone which is being uncovered by this fix in zope.interface. Shouldn't it be tracked in an issue on that repository?

@jensens
Copy link
Member

jensens commented Sep 30, 2020

@richvdh I dont think this is Plone specific. Before the fix all worked w/o any problems. Other projects can be affected as well.

@jensens
Copy link
Member

jensens commented Sep 30, 2020

Full traceback

Error in test test_dynamic_schema_refreshed_on_modify_schema_policy (plone.dexterity.tests.test_fti.TestFTIEvents)
Traceback (most recent call last):
  File "/srv/python3.7/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/srv/python3.7/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/home/jenkins/.buildout/eggs/cp37m/plone.dexterity-2.9.8-py3.7.egg/plone/dexterity/tests/test_fti.py", line 750, in test_dynamic_schema_refreshed_on_modify_schema_policy
    DexterityFTIModificationDescription('schema_policy', '')
  File "/home/jenkins/.buildout/eggs/cp37m/plone.dexterity-2.9.8-py3.7.egg/plone/dexterity/fti.py", line 561, in ftiModified
    syncSchema(model.schema, schema, overwrite=True, sync_bases=sync_bases)
  File "/home/jenkins/.buildout/eggs/cp37m/plone.supermodel-1.6.2-py3.7.egg/plone/supermodel/utils.py", line 338, in syncSchema
    dest.__bases__ = tuple(bases)
  File "/home/jenkins/workspace/pull-request-6.0-3.7@2/src/zope.interface/src/zope/interface/interface.py", line 402, in __setBases
    b.unsubscribe(self)
  File "/home/jenkins/workspace/pull-request-6.0-3.7@2/src/zope.interface/src/zope/interface/interface.py", line 389, in unsubscribe
    n = self._dependents[dependent]
  File "/home/jenkins/shiningpanda/jobs/69a88cbb/virtualenvs/0dd3bfc1/lib/python3.7/weakref.py", line 396, in __getitem__
    return self.data[ref(key)]
KeyError: <weakref at 0x7fa19db030b0; to 'InterfaceClass' at 0x7fa19e39a9e0 (IBlank)>

self._dependents is a WeakKeyDictionary. It assumes on unsubscribe that the key exists. For some reason its gone on Python 3.7.

@mgedmin
Copy link
Member

mgedmin commented Sep 30, 2020

Perhaps adding a gc.collect() in the right place would make the test fail more reliably (and perhaps also on other Python versions)?

@richvdh
Copy link
Author

richvdh commented Sep 30, 2020

@jensens fair enough. that does look odd.

@jamadden
Copy link
Member

What's really interesting is that it's self that seems to have been collected!

    def unsubscribe(self, dependent):
        try:
            n = self._dependents[dependent] # Line 389
        except TypeError:
            raise KeyError(dependent)

    def __setBases(self, bases):
        # Remove ourselves as a dependent of our old bases
        for b in self.__bases__:
            b.unsubscribe(self) # Line 402

So b.unsubscribe(self) turns into b._dependents[self], which raises a KeyError.

Nominally, self could not have been collected while we're running methods of self and passing self as an argument...

@jamadden
Copy link
Member

This is complicated by the fact that many functions in the test_fti module create an interface called IBlank.

Interfaces are hashed based on just their name and module. So every one of these local IBlank interfaces will hash the same way, and be treated the same for purposes of _dependents.

If one of these IBlank was involved in a cycle that delayed its collection (and removal from _dependents) then later when GC ran, an active IBlank could be removed from _dependents. When we attempt to change the bases and remove IBlank, because the dead IBlank has just been collected and removed from _dependents, we would get this KeyError.

It's highly dependent on when GC runs, and assumes the existence of a cycle and that some base object persists between test functions so that GC can clear out the living entry. I'm working on a simple reproducer, but I suspect that the problem could be solved by either (a) adding gc.collect() to tearDown() or (b) using a different name for IBlank in some of these functions.

Why is this issue showing up only now? Before the PR, the contents of __bases__ were not subject to garbage collection, so the leak would prevent their weak refs from being cleaned up.

@jamadden
Copy link
Member

Ok, this demonstrates it pretty easily:

import gc
from zope.interface import Interface

class IRoot(Interface):
    pass

class IBlank(Interface):
    pass

IBlank.__bases__ = (IRoot,)
# IRoot now has IBlank in its _dependents
other_ref = IBlank # Keep alive
class IBlank(Interface): # Replace the original IBlank; this will hash identically
    pass

# Register the new IBlank as a dependent of IRoot.
IBlank.__bases__ = (IRoot,)

# Really, there should be two
print(list(IRoot._dependents.keys())) # -> [IBlank]

# Delete the original IBlank, which should remove it
# from _dependents. It takes a gc.collect() to do this,
# so there's a cycle somewhere.
del other_ref
gc.collect()
print(list(IRoot._dependents.keys())) # -> []

IBlank.__bases__ = (IRoot,) # KeyError
$ python foo.py
[<InterfaceClass __main__.IBlank>]
[]
Traceback (most recent call last):
  File "/tmp/foo.py", line 29, in <module>
    IBlank.__bases__ = (IRoot,)
  File "//src/zope/interface/interface.py", line 402, in __setBases
    b.unsubscribe(self)
  File "//src/zope/interface/interface.py", line 389, in unsubscribe
    n = self._dependents[dependent]
  File "//lib/python3.8/weakref.py", line 383, in __getitem__
    return self.data[ref(key)]
KeyError: <weakref at 0x10ce7ed10; to 'InterfaceClass' at 0x10ce71dd0 (IBlank)>

@jensens
Copy link
Member

jensens commented Sep 30, 2020

Well, since we hit a very specific edge case, I can grep through our code and rename those occurrences. This is probably in tests only. I would prefer to have it it fixed, but at least it needs a line in the changelog and some lines in the documentation.

@jamadden
Copy link
Member

I recall running into this exact problem before, with older versions of zope.interface, so it's not new; it just took a brief hiatus in zope.interface 5 because of this bug. (It was more frequent on PyPy because of its non-refcount GC. I seem to recall adding fixes for that to some public zopefoundation repositories before...) It's for that reason that I learned to be careful about using local interface classes and especially how they're named 😄

@jamadden
Copy link
Member

Given that we can't change equality/sorting/hashing of these objects without major BWC and persistence problems, the only way to fix this with minimum compatibility issues that I've been able to think of is to make _dependents identity based, e.g., WeakValueDictionary of self._dependents[id(dependent)] = dependent.

That has its own BWC concerns though. All of subscribe(), unsubscribe() and dependents are public methods or attributes. They're not documented as part of the interface, nor do they have docstrings, but given how handy they are, I'm sure they must be used outside this project, so changing the semantics of subscribe/unsubscribe away from hashed based could break…something…through the inverse of what's going on here. More likely, changing dependents from {dependent: number of times subscribed} to {id: dependent} would break the way anything finds actual dependent objects. We could use a @property to provide the old interface, but it would be disconnected from the backing data store, so direct changes to it wouldn't be reflected in the actual subscribers.

(I don't fully understand why dependents tracks the number of times something has subscribed. It must be intended to support nested calls, or subscribing the same object from multiple places. But it's not clear to me why that would actually happen. That complicates using a WeakValueDictionary, and we'd probably have to get into a custom datastructure to make it work.)

@jensens
Copy link
Member

jensens commented Sep 30, 2020

I renamed the interfaces to have unique names in plone/plone.dexterity#135 and tests are passing.

@jamadden
Copy link
Member

5.1.1 is on PyPI (it may be awhile before binary wheels get built as Travis is busy.)

@cutz
Copy link

cutz commented Sep 30, 2020

It's possible our application is subject to this but we've tuned GC to run rarely, accepting the occasional large pause. So we may not have noticed a slow growth in something that was already large. It's also possible our monitoring or metrics alerting just weren't capturing this, I haven't double-checked [/cc @sajones @cutz].

@jamadden @sajones I'm late to the party but I wanted to shed some light on why we may not have noticed this. Our application has had large organic usage growth this year and when we couple that with the large zodb caches we run, we've been attributing any memory growth towards caches filling up. Combine that with what is likely slower growth in our, relatively speaking, large memory footprint application and I think it just went unnoticed. That said, we had been starting to key in on some hot processes that were showing growth which didn't make sense to attribute to ZODB caches given our database sizes. We just hadn't tracked down any details yet.

Looking forward to seeing what sort of impact this has. Thanks All.

@d-maurer
Copy link
Contributor

d-maurer commented Sep 30, 2020 via email

@jamadden
Copy link
Member

Sort of? But zope.interface is hardly unique in this. Using weakrefs with objects that define non-identity hash functions can easily lead to this problem in many places in an application. Here's an example from the standard library:

>>> from uuid import UUID
>>> from weakref import WeakKeyDictionary
>>> UUID('{12345678-1234-5678-1234-567812345678}')
UUID('12345678-1234-5678-1234-567812345678')
>>> uuid1 = UUID('{12345678-1234-5678-1234-567812345678}')
>>> uuid2 = UUID('{12345678-1234-5678-1234-567812345678}')
>>> uuid1 == uuid2
True
>>> hash(uuid1) == hash(uuid2)
True
>>> d = WeakKeyDictionary()
>>> d[uuid1] = 42
>>> d[uuid2]
42
>>> del uuid1
>>> d[uuid2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "//python3.8/weakref.py", line 383, in __getitem__
    return self.data[ref(key)]
KeyError: <weakref at 0x10ecca470; to 'UUID' at 0x10ec6f0f0>

I outlined some possible solutions and their drawbacks above. As rarely as this is actually a problem, I'm not sure it's worth the costs though.

It's worth noting that more than just hashing depends on interface names. Pickling is done by name too. No matter where in a module, or how many times, you define IFoo, when you pickle and unpickle it, you're going to get the module global named IFoo:

>>> import pickle
>>> from zope.interface import Interface
>>> class IFoo(Interface):
...     pass
...
>>> s = pickle.dumps(IFoo)
>>> IFoo = 42
>>> pickle.loads(s)
42

@d-maurer
Copy link
Contributor

d-maurer commented Sep 30, 2020 via email

@jamadden
Copy link
Member

That's correct and expected. Only __name__ and __module__ are considered in comparisons.

One reason for interfaces to be comparable is so that they can be used as BTree keys — identity comparisons don't work in that case — in fact, they must be totally orderable, independent of process or current definition (because definitions change; methods and attributes are added or removed, but the comparison must remain valid).

mister-roboto pushed a commit to plone/buildout.coredev that referenced this issue Sep 30, 2020
Branch: refs/heads/master
Date: 2020-09-30T16:45:33+02:00
Author: Jens W. Klein (jensens) <jk@kleinundpartner.at>
Commit: plone/plone.dexterity@03da1ef

fix to use with fixed zope.interface

see zopefoundation/zope.interface#216 (comment)

Files changed:
A news/135.bugfix
M plone/dexterity/tests/test_fti.py
Repository: plone.dexterity

Branch: refs/heads/master
Date: 2020-09-30T21:05:28+02:00
Author: Jens W. Klein (jensens) <jk@kleinundpartner.at>
Commit: plone/plone.dexterity@b9385b7

Merge pull request #135 from plone/fix-tests-ziface

fix to use with fixed zope.interface

Files changed:
A news/135.bugfix
M plone/dexterity/tests/test_fti.py
@d-maurer
Copy link
Contributor

d-maurer commented Sep 30, 2020 via email

@jamadden
Copy link
Member

I think a big part of the reasoning goes to persistence. The only sane way to pickle an interface is as a global object — just like that's the only sane way to pickle a class. That's the only way that an instance of a class that @implementer(ISomething) and some other arbitrary instance that alsoProvides(instance, ISomething) can mean the same thing when unpickled. Once you understand that, then it's only logical that the name and module are the things considered in comparisons. After all, that's all that actually matters to persistence.

This comparison behaviour has been the same all the way back to at least 3.4.0.

That early version didn't correctly define __eq__ and __hash__, so it was inconsistent. Interfaces weren't totally ordered, and could compare differently depending on which operators you used (or worse, which internal operators a data structure used). It wasn't long before someone realised this, and 10 years ago the present definition of equality was added.

Because this has been the behaviour for more than ten years, I suspect that this is all handled correctly by packages doing dynamic interfaces (IIRC, the "menu" support works be creating dynamic interfaces and making them global objects so they can be unpickled; it understands that names matter). It's only rare corner cases like the one in plone.dexterity — which only showed up because of a quirk of the garbage collector on a particular machine — where it becomes a problem.

@d-maurer
Copy link
Contributor

d-maurer commented Sep 30, 2020 via email

@jamadden
Copy link
Member

jamadden commented Sep 30, 2020

I believe that the correct perspective is that conceptually, an interface is a global object. By definition, there is only one global object at a time that can be referred to by __module__.__name__. Methods are irrelevant because there is only one definition. They may change from one invocation of the program to the next, but again, that doesn't matter. The stable identifier is the global identifier. (Anything else gets into something like "content based addressing" or "content-addressed storage": Very cool, but also very different than Python.)

Trying to define two global objects with the same module and name and use them at the same time as if they were distinct is conceptually quite wrong in this framework. Just as you can't define two class A objects and fully expect for them to work correctly at the same time (e.g., the two-argument super(A, self) is going to be broken; un/pickling will produce the most recently defined, not both).

At any rate, I feel like I've veered very far off what this issue was about and so I should stop annoying those getting notifications for this issue with unrelated chatter.

@d-maurer
Copy link
Contributor

d-maurer commented Oct 1, 2020 via email

netbsd-srcmastr referenced this issue in NetBSD/pkgsrc Oct 18, 2020
5.1.2 (2020-10-01)
==================

- Make sure to call each invariant only once when validating invariants.
  Previously, invariants could be called multiple times because when an
  invariant is defined in an interface, it's found by in all interfaces
  inheriting from that interface.  See `pull request 215
  <https://github.com/zopefoundation/zope.interface/pull/215/>`_.

5.1.1 (2020-09-30)
==================

- Fix the method definitions of ``IAdapterRegistry.subscribe``,
  ``subscriptions`` and ``subscribers``. Previously, they all were
  defined to accept a ``name`` keyword argument, but subscribers have
  no names and the implementation of that interface did not accept
  that argument. See `issue 208
  <https://github.com/zopefoundation/zope.interface/issues/208>`_.

- Fix a potential reference leak in the C optimizations. Previously,
  applications that dynamically created unique ``Specification``
  objects (e.g., used ``@implementer`` on dynamic classes) could
  notice a growth of small objects over time leading to increased
  garbage collection times. See `issue 216
  <https://github.com/zopefoundation/zope.interface/issues/216>`_.

  .. caution::

     This leak could prevent interfaces used as the bases of
     other interfaces from being garbage collected. Those interfaces
     will now be collected.

     One way in which this would manifest was that ``weakref.ref``
     objects (and things built upon them, like
     ``Weak[Key|Value]Dictionary``) would continue to have access to
     the original object even if there were no other visible
     references to Python and the original object *should* have been
     collected. This could be especially problematic for the
     ``WeakKeyDictionary`` when combined with dynamic or local
     (created in the scope of a function) interfaces, since interfaces
     are hashed based just on their name and module name. See the
     linked issue for an example of a resulting ``KeyError``.

     Note that such potential errors are not new, they are just once
     again a possibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants