-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Track DB scheduling delay per-request #2775
Conversation
It turns out that the only thing we use the __dict__ of LoggingContext for is `request`, and given we create lots of LoggingContexts and then copy them every time we do a db transaction or log line, using the __dict__ seems a bit redundant. Let's try to optimise things by making the request attribute explicit.
... to reduce the amount of floating-point foo we do.
- to pull the timing stuff up out of _new_transaction. Sadly this means that we lose the timing info on TXN END but we'll have to live with that for now.
For each request, track the amount of time spent waiting for a db connection. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.
5fe44bc
to
39b2998
Compare
self._txn_perf_counters.update( | ||
desc, txn_start_time_ms, txn_end_time_ms, | ||
) | ||
sql_txn_timer.inc_by(txn_duration, desc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this mean that any transactions created by calling runWithConnection directly won't be measured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, yes, but they aren't currently measured as part of db_txn_duration
.
Looks like the only thing that uses runWithConnection
other than the background updates is the event_fetch
code, which is not specific to any http requests. I'd like for time spent waiting for events to be fetched to be tracked somehow, but tracking transaction duration here wouldn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They may not be picked up by db_txn_duration
, but they are picked up by sql_txn_timer
. Losing metrics for fetching event txn duration doesn't sound ideal.
# seconds spent waiting for a db connection, when processing this request | ||
# | ||
# it's a counter rather than a distribution, because the count would always | ||
# be the same as that of all the other distributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a distribution here.
Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a distribution here.
er, oops.
Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)
It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments. IMHO what we ought to be doing is rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count)
, which feels much more intuitive, but will take a bit of work to get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments.
I would be surprised if they're not completely dwarfed by transaction overhead.
IMHO what we ought to be doing is
rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count)
, which feels much more intuitive, but will take a bit of work to get there.
Possibly, but having things consistent seems more intuitive than having a couple that don't fit.
like what the comment says
I'm going to replace this with more specific PRs |
This PR consists of a series of patches which could stand as separate PRs, but since they build on one another it's easier to lump them together.
The net effect is to track the amount of time spent waiting for a db connection for each request. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.