-
-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression in test_weakref_cache #6033
Conversation
distributed/tests/test_spill.py
Outdated
# Surprisingly, even on CPython this is needed to ensure that the object is garbage | ||
# collected, even if there are no obvious circular references going on | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does concern me a bit. I don't think the GC should be required here.
I can reproduce this on py3.10. Maybe there is a CPython bug? Below is what objgraph tells me
import weakref
ref = weakref.ref(x)
del x
import objgraph
obj = ref()
if obj:
objgraph.show_backrefs(obj)
cc @graingert in case something rings a bell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, at this point, x is supposed to be in memory since the spill was not, yet, triggered. This only happens a few lines below.
I cannot reproduce this any longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into similar issues in dask-ms a while back and found I could never rely on the del x
. If it's an issue of "encouraging" the decref/garbage collector, you might try allocating x
inside an inner function like so:
# Run this test twice:
# - x is smaller than target and is evicted by y;
# - x is individually larger than target and it never touches fast
def _inner():
x = cls(size)
buf["x"] = x
if size < 100:
buf["y"] = cls(60) # spill x
assert "x" in buf.slow
# Test that we update the weakref cache on setitem
assert (buf["x"] is x) == expect_cached
# Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused
# by PyMalloc when you descope objects, so a brand new object might end up having
# the same id as a deleted one
id_x = x.id
del x
return id_x
id_x = _inner()
This test has been failing for the first time on 2022-03-17, i.e. before #6029 https://dask.org/distributed/test_report.html I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should |
the test passes for me locally with the gc disabled: distributed/distributed/tests/test_spill.py Lines 327 to 367 in 4da3b8c
|
shutting down the profile threads before running test_weakref_cache seems to fix it for me (locally): distributed/distributed/profile.py Lines 278 to 341 in a896a01
distributed/distributed/tests/test_spill.py Line 347 in a896a01
edit: Seems to have fixed it on CI too https://github.com/graingert/distributed/actions/runs/2081825815 |
commenting out the distributed/distributed/profile.py Lines 297 to 302 in 5f2ffab
https://github.com/graingert/distributed/runs/5806180032?check_suite_focus=true#step:11:5355 |
I've commented the various tests that call gc.collect() to reflect your latest findings |
I don't think this is correct - I think a sleep will be just as good as a |
Turning off profile threads during testing seems reasonable to me.
…On Mon, Apr 4, 2022 at 6:06 AM Thomas Grainger ***@***.***> wrote:
I've commented the various tests that call gc.collect() to reflect your
latest findings
I don't think this is correct - I think a sleep will be just as good as a
gc.collect()
—
Reply to this email directly, view it on GitHub
<#6033 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTBTI5RQI6RGFBBARMLVDLEL5ANCNFSM5SE6KFKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Having a whole extra thread, running at all times during production and which tampers with references, which is just not there during testing - this seems quite dangerous to me. |
These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads |
I was not talking about test_weakref_cache in particular, but instead in general about all those tests that look like del future
gc.collect() I've reworked the PR - let's see if it's stable or not |
the I added: --- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -459,7 +459,10 @@ class LoopRunner:
finally:
done_evt.set()
- thread = threading.Thread(target=run_loop, name="IO loop")
+ thread = threading.Thread(
+ target=run_loop,
+ name=f"IO loop for {os.environ.get('PYTEST_CURRENT_TEST')}",
+ )
thread.daemon = True
thread.start()
which lets me see where the loops are staying alive from:
|
@graingert that list of threads is an extremely useful thing to have permanently - mind opening a separate PR to get it in, as well as proper cleanup for LocalCluster? I think this PR should continue on the direction of resiliency to the profiler regardless of your cleanup. Also, even after the cleanup, all tests that have a cluster running will still have the issue. |
while _watch_running: | ||
sleep(0.0001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure how I feel about this. It is not thread safe so while this function is returning another profile thread might again sample the frames. At the same time, if a profile thread is actually running, this thing will burn CPU hard.
For this specific use case, that's likely not a problem but I'm wondering if a sleep(0.1)
would not have an equal power without us putting this kind of instrumentation in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is OK only as long as its usage is limited to unit tests
I don't like this watcher but I don't want to block this PR any longer. I opened #6075 to discuss the future of the profiler in our test suite |
@graingert is that a different type of failure, or what you would have expected this to fix? I haven't looked at this PR at all, so I'm not sure. |
The new failure has nothing to do with test_weakref_cache; the log is showing that the problem is that profile_lock has remained acquired by some other test before it. |
Fix regression introduced in #6029:
https://github.com/dask/distributed/runs/5768355142?check_suite_focus=true