-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine performance metrics for execute, gather_dep, etc. #7586
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 26 files ± 0 26 suites ±0 12h 31m 51s ⏱️ + 15m 10s For more details on these failures, see this check. Results for commit ad45e4a. ± Comparison against base commit 700f14a. This pull request removes 3 and adds 33 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
892cb11
to
0f74274
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I only had a very high level look. I have a couple of concerns
- We're wrapping many things in ctx managers. I understand that this is context aware but I wonder if we cannot (sometimes) use decorators instead. This would feel less invasive and would be easier to remove/modify later on
- This is added in many places at the same time. This is not only instrumenting parts of the worker but even parts of our protocol. While I couldn't see any hot for loops, this all makes me worry a bit about overhead. Both in terms of memory overhead when retaining the metrics but also in terms of plain runtime overhead, e.g. when there are multiple callbacks registered.
- This feels a bit like we're introducing our own instrumentation/tracing framework which makes me wonder if we shouldn't instead look into existing frameworks for this kind of thing. cc @hendrikmakait you spent some time with OTel tracing. Do you think we could move from this
ContextMeter
to an OTel Span or are there fundamental differences? I do not want to suggest to make this switch immediately but I would like us to be aware of this, the proposed callback based API feels a bit special. (I actually think what we're doing here is a mixture between tracing and collecting metrics so this might not map perfectly)
DigestMetric.match(stimulus_id="s5", name="gather-dep-failed-seconds"), | ||
DigestMetric.match(stimulus_id="s6", name="execute-other-seconds"), | ||
DigestMetric.match(name="compute-duration", stimulus_id="s6"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the POV of this test, I don't care that these digests are emitted here. They make the test more brittle and harder to understand.
Can we keep the test somehow as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a way to remove DigestMetric altogether 🥳
distributed/worker.py
Outdated
|
||
assert ts.annotations is not None | ||
executor = ts.annotations.get("executor", "default") | ||
with context_meter.add_callback(metrics_callback): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wraps almost the entire method such that the indentation creates a very large diff. Can this be introduced a bit less invasive, e.g. using a decorator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
distributed/protocol/serialize.py
Outdated
merged_frames.append(merged) | ||
|
||
return deserialize(header, merged_frames, deserializers=deserializers) | ||
with context_meter.meter("deserialize"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also an example where I would consider a decorator less invasive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I think the main difference is around the data we collect and the overhead generated by collecting or deriving the relevant data. From what I understand, we are exclusively interested in collecting performance metrics. At the moment, these seem to be the latency of some operation. With tracing, we'd collect more information, essentially the nested call traces of our operations. This would include the information about their latency, also all sorts of other attributes and creating metrics from that data requires additional work. There is the SpanMetricsProcessor, which allows us to generate Rate, Errors, Duration metrics from our tracing data. We may also need some additional thoughts to tie group those metrics to the relevant levels defined in this PR by the callbacks. Once again, there seem to be some ways to do so. This RP seems to try to generate an API that combines that with tracing. The runtime overhead I could measure in a small experiment was around FWIW, I've been working on a small worker plugin that hooks into our |
b230744
to
6d1dd32
Compare
The end-to-end runtime for a trivial execute() call is 640us for sync tasks and 230us for async ones. |
This reverts commit f6d868ad00dac5b7e6b437e4934446b9b442ea24.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the current state of this instrumentation is sufficiently well encapsulated that I am not concerned. I could easily see us replacing this Meter
and DelayedMetricsLedger
again with something else that is rather close to a Span
. The actual instrumentation in-code is fairly minimal which is what I wanted to see.
While I like the Span object of #7619 this PR is more advanced and I don't consider added investment in #7619 would significantly improve this instrumentation system.
I believe, besides nomenclature, the most notable difference between the two PRs is where ownership of certain responsibilities is attached. This PR tangles the intstrumentation closer to the BaseWorker while the other PR tries harder to isolate this. Both approaches come with pros and cons. There are also a couple of other difference that I consider mostly syntactic sugar.
For instance, I like Span.flat
which is a really nice API but I doubt that we actually need it. If so, we could likely offer something similar in this architecture.
I suggest to move forward with this PR and focus our time on exposing these metrics as fast as possible.
distributed/worker_state_machine.py
Outdated
"start": ev.user_code_start, | ||
"stop": ev.user_code_stop, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This has a non-trivial and subtle side effect (I think positive).
The startstops
information is used to set TaskGroup
and therefore TaskPrefix
duration estimates/measurements, i.e. by no longer counting in disk, etc. the measurements will more accurately reflect the actual tasks compute time.
However, this in turn also impacts occupancy calculations and therefore even impacts scheduling.
I believe overall this makes scheduling decisions more robust and I guess we will not even see this matter in reality. However, the connection is not necessarily apparent and I wanted to highlight this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never counted disk etc. In main this is already using the in-thread measure; I just changed the naming to be less ambiguous. There was a moment where I accidentally set this to the end-to-end runtime and a couple of tests blew up.
distributed/worker_memory.py
Outdated
# Work around bug with Tornado PeriodicCallback, which does not properly | ||
# insulate contextvars | ||
async def _() -> None: | ||
with context_meter.add_callback(metrics_callback): | ||
# Measure delta between the measures from the SpillBuffer and the total | ||
# end-to-end duration of _spill | ||
await self._spill(worker, memory) | ||
|
||
await asyncio.create_task(_(), name="memory-monitor-spill") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? The create_task thing looks a bit odd and I would likely refactor this if I stumbled over it. What of this construction is required to work around the bug?
Do you need a "clean" context in _
or what is this construction doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_task automatically duplicates and insulates the current context. PeriodicCallback doesn't (official support for async callbacks is very new fwiw). Without this workaround, I saw metrics clearly emitted by Worker.execute show up here.
I've updated the comment to clarify.
def _finalize_metrics( | ||
self, | ||
stim: StateMachineEvent, | ||
ledger: DelayedMetricsLedger, | ||
) -> None: | ||
activity: tuple[str, ...] | ||
coarse_time: str | Literal[False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is a big switch statement. I personally would likely implement this as a method of the various events. attaching the ownership of this to the event moves the implementation slightly closer to the TracedEvent
of #7619
However, I don't have a strong preference and don't mind having this as a method here. I think there is a case for both, I just wanted to pass along this observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I thought about it but felt that putting it in methods would be
- too dispersive - you'd need to go through all of those classes to see the fine differences between their implementation
- it would require a back pointer to WorkerStateMachine in the parameters, which while technically sound is not a pattern we've used anywhere else
- We would need to implement a dummy in
RetryBusyWorkerEvent
, just in virtue of it being the exit event of an async instruction - or alternatively we'd need to insert an extra level of parent classesclass EventAtEndOfAsyncInstruction(StateMachineEvent)
just for this.
@fjetter all code review comments have been addressed; this is ready for final review and merge. |
very excited about seeing this in action! |
Worker.gather_dep
#7217This PR introduces finely-grained performance measures in every call to execute, gather_dep, get_data, and memory_monitor.
In
Worker.state.stimulus_log
, everyGatherDep*Event
andExecute*Event
now includes ametrics
attribute, with the following elements:GatherDep<Success,Busy,NetworkFailure,Failure>Event
ExecuteSuccessEvent, ExecuteFailureEvent, RescheduleEvent
All metrics are broken down by task prefix.
time.thread_time
- user metrics)time.time
-time.thread_time
- user metrics)The other time is the difference between the elapsed time as seen from the WorkerStateMachine (from the emission of the GatherDep or Execute Instruction to the final Event) and the sum of all other time metrics. It highlights, for example, a slow event loop or a large overhead in
loop.run_in_executor
.The above metrics are aggregated to per-worker totals (or in case of execute, per-task prefix totals) and posted to
Worker.digests
,Worker.digests_total
, andWorker.digests_max
.In case of unsuccessful execution or transfer, time metrics are aggregated into a single, separate event, so that one can appreciate how much time is being wasted in failed attempts.
Equivalent digests are also produced in other resource-intensive activities:
Memory Monitor
get_data
Spill-specific activity is also recorded under
Worker.data.cumulative_metrics
and separately posted to Prometheus.Besides convenience, this allows separating (de)serialization and (de)compression caused by network transfers from those caused by spilling/unspilling.
Out of scope
Demo
https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666