Track DB scheduling delay per-request #2775

richvdh · 2018-01-12T11:38:26Z

This PR consists of a series of patches which could stand as separate PRs, but since they build on one another it's easier to lump them together.

The net effect is to track the amount of time spent waiting for a db connection for each request. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.

It turns out that the only thing we use the __dict__ of LoggingContext for is `request`, and given we create lots of LoggingContexts and then copy them every time we do a db transaction or log line, using the __dict__ seems a bit redundant. Let's try to optimise things by making the request attribute explicit.

... to reduce the amount of floating-point foo we do.

- to pull the timing stuff up out of _new_transaction. Sadly this means that we lose the timing info on TXN END but we'll have to live with that for now.

For each request, track the amount of time spent waiting for a db connection. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.

erikjohnston · 2018-01-12T11:55:56Z

synapse/storage/_base.py

+                    self._txn_perf_counters.update(
+                        desc, txn_start_time_ms, txn_end_time_ms,
+                    )
+                    sql_txn_timer.inc_by(txn_duration, desc)


Won't this mean that any transactions created by calling runWithConnection directly won't be measured?

well, yes, but they aren't currently measured as part of db_txn_duration.

Looks like the only thing that uses runWithConnection other than the background updates is the event_fetch code, which is not specific to any http requests. I'd like for time spent waiting for events to be fetched to be tracked somehow, but tracking transaction duration here wouldn't help.

They may not be picked up by db_txn_duration, but they are picked up by sql_txn_timer. Losing metrics for fetching event txn duration doesn't sound ideal.

erikjohnston · 2018-01-12T11:57:52Z

synapse/http/server.py

+# seconds spent waiting for a db connection, when processing this request
+#
+# it's a counter rather than a distribution, because the count would always
+# be the same as that of all the other distributions.


It is a distribution here.

Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)

It is a distribution here.

er, oops.

Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)

It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments. IMHO what we ought to be doing is rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count), which feels much more intuitive, but will take a bit of work to get there.

It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments.

I would be surprised if they're not completely dwarfed by transaction overhead.

IMHO what we ought to be doing is rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count), which feels much more intuitive, but will take a bit of work to get there.

Possibly, but having things consistent seems more intuitive than having a couple that don't fit.

like what the comment says

richvdh · 2018-01-16T16:37:12Z

I'm going to replace this with more specific PRs

richvdh added 4 commits January 12, 2018 11:42

Track db txn time in millisecs

411350f

... to reduce the amount of floating-point foo we do.

Reshuffle store.runInteraction

ca8c4cd

- to pull the timing stuff up out of _new_transaction. Sadly this means that we lose the timing info on TXN END but we'll have to live with that for now.

Track DB scheduling delay per-request

39b2998

For each request, track the amount of time spent waiting for a db connection. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.

richvdh force-pushed the rav/better_metrics branch from 5fe44bc to 39b2998 Compare January 12, 2018 11:42

richvdh assigned erikjohnston Jan 12, 2018

erikjohnston reviewed Jan 12, 2018

View reviewed changes

erikjohnston assigned richvdh and unassigned erikjohnston Jan 12, 2018

Actually make response_db_sched_duration a counter

06caba6

like what the comment says

richvdh assigned erikjohnston and unassigned richvdh Jan 12, 2018

erikjohnston assigned richvdh and unassigned erikjohnston Jan 12, 2018

This was referenced Jan 12, 2018

Reorganise request and block metrics #2779

Closed

Optimise LoggingContext creation and copying #2792

Merged

richvdh closed this Jan 16, 2018

This was referenced Jan 16, 2018

Track db txn time in millisecs #2793

Merged

rework runInteraction in terms of runConnection #2794

Merged

richvdh deleted the rav/better_metrics branch August 2, 2018 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track DB scheduling delay per-request #2775

Track DB scheduling delay per-request #2775

richvdh commented Jan 12, 2018

erikjohnston Jan 12, 2018

richvdh Jan 12, 2018

erikjohnston Jan 12, 2018

erikjohnston Jan 12, 2018

richvdh Jan 12, 2018

erikjohnston Jan 12, 2018

richvdh commented Jan 16, 2018

Track DB scheduling delay per-request #2775

Track DB scheduling delay per-request #2775

Conversation

richvdh commented Jan 12, 2018

erikjohnston Jan 12, 2018

Choose a reason for hiding this comment

richvdh Jan 12, 2018

Choose a reason for hiding this comment

erikjohnston Jan 12, 2018

Choose a reason for hiding this comment

erikjohnston Jan 12, 2018

Choose a reason for hiding this comment

richvdh Jan 12, 2018

Choose a reason for hiding this comment

erikjohnston Jan 12, 2018

Choose a reason for hiding this comment

richvdh commented Jan 16, 2018