Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling of slow coroutines #7446

Merged
merged 6 commits into from
May 31, 2023

Conversation

kozlovsky
Copy link
Contributor

@kozlovsky kozlovsky commented May 30, 2023

This PR does a bunch of changes to slow the coroutine detection mechanism:

  1. It adds a duration for each line of the main stack trace so it becomes possible to see how long we are already into this function call, like function started 1.268 seconds ago.
  2. It limits the number of displayed stack frames to show only the relevant function calls, so, at the last line of the traceback, you can see the actual function that causes the slowness (and not some arbitrary nested functions that happen to be executed at the moment of the periodic snapshot).
  3. It adds a @profile decorator that logs statistics for the function and allows a developer to find the reason for its slowness and adds to the traceback a tip to use this decorator, which includes mentioning the function it should be applied to.

This is how the current traceback looks after the PR changes:

[tribler-core PID:23024] 2023-05-30 17:16:41,890 - ERROR <watching_thread:91> tribler.core.utilities.slow_coro_detection._report_freeze(): Slow coroutine is occupying the loop for 0.618 seconds already: <Task pending name='GigaChannelCommunity:take step' coro=<interval_runner() running at C:\dev\py-ipv8\ipv8\taskmanager.py:18> cb=[TaskManager.register_task.<locals>.done_cb() at C:\dev\py-ipv8\ipv8\taskmanager.py:128]>
Traceback (most recent call last):
  File "C:\dev\tribler\src\tribler\core\start_core.py", line 205, in run_core (function started 27.965 seconds ago)
    exit_code = run_tribler_core_session(api_port, api_key, state_dir, gui_test_mode=parsed_args.gui_test_mode)
  File "C:\dev\tribler\src\tribler\core\start_core.py", line 170, in run_tribler_core_session (function started 27.924 seconds ago)
    exit_code = loop.run_until_complete(core_session(config, components=list(components_gen(config))))
  File "C:\Python39\lib\asyncio\base_events.py", line 629, in run_until_complete (function started 27.912 seconds ago)
    self.run_forever()
  File "C:\Python39\lib\asyncio\base_events.py", line 596, in run_forever (function started 27.912 seconds ago)
    self._run_once()
  File "C:\Python39\lib\asyncio\base_events.py", line 1890, in _run_once (function started 0.618 seconds ago)
    handle._run()
  File "C:\dev\tribler\src\tribler\core\utilities\slow_coro_detection\patch.py", line 37, in patched_handle_run (function started 0.618 seconds ago)
    _original_handle_run(self)
  File "C:\Python39\lib\asyncio\events.py", line 80, in _run (function started 0.618 seconds ago)
    self._context.run(self._callback, *self._args)
  File "C:\dev\py-ipv8\ipv8\taskmanager.py", line 18, in interval_runner (function started 0.618 seconds ago)
    await task(*args)
  File "C:\dev\py-ipv8\ipv8\util.py", line 35, in call_async (function started 0.618 seconds ago)
    return func(*args, **kwargs)
  File "C:\dev\tribler\src\tribler\core\components\ipv8\discovery_booster.py", line 84, in take_step (function started 0.618 seconds ago)
    self.walker.take_step()

Tip: by applying the `@profile()` decorator to the `take_step` function, you can obtain statistics for its internal calls and see the reason for slowness

You can see the duration in each line, a much shorter stack, and a tip at the end.

  1. If function execution continues for too long, the periodic logging of the current freeze now includes a limited stack (2 frames only)
[tribler-core PID:23024] 2023-05-30 17:16:42,540 - ERROR <watching_thread:96> tribler.core.utilities.slow_coro_detection._report_freeze(): Still executing <Task pending name='GigaChannelCommunity:take step' coro=<interval_runner() running at C:\dev\py-ipv8\ipv8\taskmanager.py:18> cb=[TaskManager.register_task.<locals>.done_cb() at C:\dev\py-ipv8\ipv8\taskmanager.py:128]>
Traceback (most recent call last):
  File "C:\dev\py-ipv8\ipv8\util.py", line 35, in call_async (function started 1.268 seconds ago)
    return func(*args, **kwargs)
  File "C:\dev\tribler\src\tribler\core\components\ipv8\discovery_booster.py", line 84, in take_step (function started 1.268 seconds ago)
    self.walker.take_step()

[tribler-core PID:23024] 2023-05-30 17:16:43,070 - ERROR <watching_thread:96> tribler.core.utilities.slow_coro_detection._report_freeze(): Still executing <Task pending name='GigaChannelCommunity:take step' coro=<interval_runner() running at C:\dev\py-ipv8\ipv8\taskmanager.py:18> cb=[TaskManager.register_task.<locals>.done_cb() at C:\dev\py-ipv8\ipv8\taskmanager.py:128]>
Traceback (most recent call last):
  File "C:\dev\py-ipv8\ipv8\util.py", line 35, in call_async (function started 1.798 seconds ago)
    return func(*args, **kwargs)
  File "C:\dev\tribler\src\tribler\core\components\ipv8\discovery_booster.py", line 84, in take_step (function started 1.798 seconds ago)
    self.walker.take_step()

Internally, both main thread stack tracing and function profiler use the same mechanism for tracking the execution, so they are written in a compatible way. When the profiler decorator starts profiling the specific function, it temporarily turns off the main thread stack tracing and turns it on again after the profiling is done.

The statistics generated by the profiler go through important cosmetic updates before it is printed to the logs. First, some functions in Tribler are auto-generated, and their filenames actually contain the entire source code of the function, which destroys the profiler output. The cleanup fixes the names for such functions so they can be displayed in a table alongside other functions. Second, internal calls that a very fast are removed from the resulting statistics, so only functions that affect the execution time are shown in the statistics output; the function that makes the most slowdown is shown first.

@kozlovsky kozlovsky requested review from a team and xoriole and removed request for a team May 30, 2023 14:45
@kozlovsky kozlovsky requested a review from drew2a May 30, 2023 16:26
@kozlovsky kozlovsky changed the title [WIP] Profiling of slow coroutines Profiling of slow coroutines May 30, 2023
@kozlovsky kozlovsky changed the base branch from main to release/7.13 May 30, 2023 16:49
@kozlovsky kozlovsky merged commit e9aea86 into Tribler:release/7.13 May 31, 2023
@kozlovsky kozlovsky deleted the slow_coroutine_profiling branch May 31, 2023 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants