-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👌 IMPROVE: Garbage collect on process termination #4767
Conversation
EDIT: if you change To debug, after from pympler import summary, muppy, refbrowser
import pprint
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
dicts = [
o
for o in all_objects
if hasattr(o, "__class__") and isinstance(o, dict)
]
print("Large dicts:", len([d for d in dicts if len(d) > 1000]))
from aiida_sleep.sleep_job import SleepCalculation
calcs = [
o
for o in all_objects
if hasattr(o, "__class__") and isinstance(o, SleepCalculation)
]
print("SleepCalculations:", len(calcs))
print(calcs[0])
print()
cb = refbrowser.ConsoleBrowser(calcs[0], maxdepth=14)
tree = cb.get_tree()
cb.print_tree(tree)
print("\nSleepCalculation attributes:")
pprint.pprint(calcs[0].__dict__)
print("\nRmqSubscriber attributes:")
pprint.pprint(calcs[0]._communicator._communicator._communicator._message_subscriber.__dict__)
print("\nRmqTaskSubscriber attributes:")
pprint.pprint(calcs[0]._communicator._communicator._communicator._default_task_queue._subscriber.__dict__)
print("\nRmqTaskPublisher attributes:")
pprint.pprint(calcs[0]._communicator._communicator._communicator._default_task_queue._publisher.__dict__) Then run You can see that the So the
|
Codecov Report
@@ Coverage Diff @@
## develop #4767 +/- ##
===========================================
- Coverage 79.37% 79.36% -0.01%
===========================================
Files 485 485
Lines 36154 36160 +6
===========================================
+ Hits 28694 28695 +1
- Misses 7460 7465 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
with asyncio.sleep(1) then:
|
@ltalirz what do you think? |
Thanks @chrisjsewell for figuring this out! Do we know which tasks are the culprit here, i.e. which tasks are preventing automatic garbage collection without the Of course it would be best to figure this out.. on the other hand, we could even schedule a periodic garbage collection call using |
Hmm, I think that's what Python does anyway. Of course we then don't have control over when it happens. I haven't followed the context here - but at first glance, it seems that if Of course the question then becomes how we test that this remains the case - but maybe the sleeping and collecting should go into the test code, not the production code. |
The problem here is likely with cyclic references, which aren't garbage-collected automatically |
They should be.. Python memory management has two components:
The [1] Unless someone called |
Sorry, you are right. I even once looked into the default parameters for this... Of course, it can still be useful to run garbage collection after a process is completed to avoid that the corresponding objects stay in memory until the next process starts (and, ideally, we would just get rid of the cyclic references if possible). |
Nice, just looked at these parameters right now 😄
Yeah, if we know when the cyclic references should turn stale it makes sense to run manually. But that wouldn't fix any memory leaks, just improve usage in general. I think the real question here is what is keeping the objects in memory that makes the |
Oh, we should also check if any of the objects in cycles have Found that via another answer on the page you linked. EDIT: Hmm, since PEP 442 I think this should be less of a problem. |
@greschd are you sure about this? Nothing in https://docs.python.org/3/library/gc.html#gc.set_threshold suggests that it runs on a timer, just that it will trigger more easily with a different threshold. |
Good point, it's not a timer per se, it's a function of how many objects are allocated / deallocated:
from https://docs.python.org/3/library/gc.html#gc.set_threshold. So if all we do is sleep, I guess that wouldn't cause a GC run - but "normal" operation should. The "allocations - deallocations" is a measure for how the total number of objects grows -- you would expect that to keep growing if there's a memory leak due to cyclic references, right? |
But yeah, I'm sure we can come up with a scenario where the number of objects is large, then GC runs (but they're still alive), then the number of objects drops and the GC doesn't run again for a long time. Still, I think the main thing to figure out here is who keeps the process objects alive, which makes the Ideally, we could get rid of the cyclic references here, because ref-counting is much more well-behaved. |
If everything has finished running then I wouldn't expect the GC to ever run again, because there would be nothing to trigger it.
see #4767 (comment), its because the broadcast subscriber has not yet been removed
See aiidateam/plumpy#205 (which is the last cyclic not addressed from #4603), but that's quite an "aggressive" change, which I'm worried could have side-effects so don't want to rush through. That is also only for the actuall Process, there may well be other things in memory. |
Right, but that is a "testing" scenario. In a production daemon it should eventually run, no? So to test for the memory leak we can just run We can also put in a On a fresh In [1]: import gc
In [2]: %timeit gc.collect()
100 loops, best of 3: 8.76 ms per loop On an fresh In [1]: import gc
In [2]: %timeit gc.collect()
47.7 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
not until you submit more processes
yeh fair, well its triggered once a process completes To note, I'm certainly not suggesting that this is the complete solution, but I think it could compliment other efforts, to try to reduce peaks in memory usage. |
Yeah, I think that's sensible. Looking at https://github.com/aiidateam/plumpy/blob/develop/plumpy/process_comms.py, the
👍 |
see my lovely new diagram 😉 #4766, essentially all processes on daemon runners are re-created and run with |
Nice 👍 So when is |
cheers! |
Partially addresses #4603
After completion of
aiida-sleep calc -n 1 -t 1 -p 500000 -o 500000 --submit
(on https://github.com/chrisjsewell/aiida-integration-tests):without:
with:
so it definitely makes a difference, but see below for more debugging
EDIT:
ooo actually, if you change
asyncio.sleep(0)
toasyncio.sleep(1)
:and the process is gone 🎉