Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lock to distributed.profile for better concurrency control #6421

Merged
merged 6 commits into from
May 25, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented May 23, 2022

Adds a Lock to distributed.profile to enable better concurrency control. In particular, it allows running garbage collection without a profiling thread holding references to objects, which is necessary for #6250.

  • Tests added / passed
  • Passes pre-commit run --all-files

@@ -314,16 +317,13 @@ def traverse(state, start, stop, height):
}


_watch_running: set[int] = set()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of the lock to avoid inconsistencies.

distributed/profile.py Outdated Show resolved Hide resolved
if time() > last + cycle:
if time() > last + cycle:
recent = create()
with lock:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this down to just above the try: block? My sense is that we're really trying to lock around the creation and deletion of the frame object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for what we're looking for we need to lock for the entire context the frame exists, i.e. if external code has the lock, there is no frame.

In the context of GC, the frame is the offender that holds references to some objects we want to ensure are collected.

Copy link
Member Author

@hendrikmakait hendrikmakait May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a trade-off between more accurate timestamps and less locking.

My thought was that in case we have some thread hogging the lock for a bit (e.g., a particularly expensive run of garbage collection), this would ensure that timestamps are only created once we have actually acquired the lock. However, in practice, this should be an edge case.

At the same time, the reduction in lock time that we gain from moving the lock should be marginal, so I think I'd prefer more accurate timestamps. What's your take on this?

Copy link
Member

@graingert graingert May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is that we're really trying to lock around the creation and deletion of the frame object.

The issue I found in #6033 (comment) was not that the profiler was keeping hold of the frame object it, it appeared as if the GIL was released while the thread owned the frame from the main_thread.

            try:
                # thread_id is for some <Thread(IO loop, started daemon 140502607329024)>
                frame = sys._current_frames() [thread_id]
                # ~~~~~~GIL released here~~~~^ ??
            except KeyError:
                return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1           0 RESUME                   0

  2           2 NOP

  4           4 LOAD_GLOBAL              0 (sys)
             16 LOAD_METHOD              1 (_current_frames)
             38 PRECALL                  0
             42 CALL                     0
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ profile thread owns all frames
             52 LOAD_FAST                0 (thread_id)
             54 BINARY_SUBSCR
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             64 STORE_FAST               1 (frame)
             66 LOAD_CONST               0 (None)
             68 RETURN_VALUE
        >>   70 PUSH_EXC_INFO

  6          72 LOAD_GLOBAL              4 (KeyError)
             84 CHECK_EXC_MATCH
             86 POP_JUMP_FORWARD_IF_FALSE     4 (to 96)
             88 POP_TOP

  7          90 POP_EXCEPT
             92 LOAD_CONST               0 (None)
             94 RETURN_VALUE

  6     >>   96 RERAISE                  0
        >>   98 COPY                     3
            100 POP_EXCEPT
            102 RERAISE                  1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6364 (comment) for the motivation to do this. We essentially want a means to avoid that a profiler holds references to frames while gc.collect() runs. I wasn't aware of #6033 before, so I'm not sure if this is related to the particular issue you found there.

@github-actions
Copy link
Contributor

github-actions bot commented May 23, 2022

Unit Test Results

       15 files  +       15         15 suites  +15   6h 15m 19s ⏱️ + 6h 15m 19s
  2 811 tests +  2 811    2 730 ✔️ +  2 730    79 💤 +  79  2 +2 
20 840 runs  +20 840  19 908 ✔️ +19 908  930 💤 +930  2 +2 

For more details on these failures, see this check.

Results for commit 1974e26. ± Comparison against base commit d84485b.

♻️ This comment has been updated with latest results.

@hendrikmakait
Copy link
Member Author

gpuCI failure is unrelated, see #6429

@mrocklin
Copy link
Member

This seems like an improvement to me. Merging.

@mrocklin mrocklin merged commit dea9ef2 into dask:main May 25, 2022
@mrocklin
Copy link
Member

mrocklin commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants