Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Track DB scheduling delay per-request #2775

Closed
wants to merge 5 commits into from
Closed

Conversation

richvdh
Copy link
Member

@richvdh richvdh commented Jan 12, 2018

This PR consists of a series of patches which could stand as separate PRs, but since they build on one another it's easier to lump them together.

The net effect is to track the amount of time spent waiting for a db connection for each request. This entails adding it to the LoggingContext and we may as well add metrics for it while we are passing.

It turns out that the only thing we use the __dict__ of LoggingContext for is
`request`, and given we create lots of LoggingContexts and then copy them every
time we do a db transaction or log line, using the __dict__ seems a bit
redundant. Let's try to optimise things by making the request attribute
explicit.
... to reduce the amount of floating-point foo we do.
- to pull the timing stuff up out of _new_transaction. Sadly this means that we
lose the timing info on TXN END but we'll have to live with that for now.
For each request, track the amount of time spent waiting for a db
connection. This entails adding it to the LoggingContext and we may as well add
metrics for it while we are passing.
self._txn_perf_counters.update(
desc, txn_start_time_ms, txn_end_time_ms,
)
sql_txn_timer.inc_by(txn_duration, desc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this mean that any transactions created by calling runWithConnection directly won't be measured?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, yes, but they aren't currently measured as part of db_txn_duration.

Looks like the only thing that uses runWithConnection other than the background updates is the event_fetch code, which is not specific to any http requests. I'd like for time spent waiting for events to be fetched to be tracked somehow, but tracking transaction duration here wouldn't help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may not be picked up by db_txn_duration, but they are picked up by sql_txn_timer. Losing metrics for fetching event txn duration doesn't sound ideal.

# seconds spent waiting for a db connection, when processing this request
#
# it's a counter rather than a distribution, because the count would always
# be the same as that of all the other distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a distribution here.

Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a distribution here.

er, oops.

Honestly I think it makes things clearer to always include the count, as otherwise it loos a bit odd to do: rate(metric_one:total)/rate(another_metric:count)

It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments. IMHO what we ought to be doing is rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count), which feels much more intuitive, but will take a bit of work to get there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems really silly to me to maintain six identical copies of the same counter here. That's a lot of pointless objects, hash lookups, and integer increments.

I would be surprised if they're not completely dwarfed by transaction overhead.

IMHO what we ought to be doing is rate(synapse_http_server_response_db_sched_duration)/rate(synapse_http_server_response_count), which feels much more intuitive, but will take a bit of work to get there.

Possibly, but having things consistent seems more intuitive than having a couple that don't fit.

@erikjohnston erikjohnston assigned richvdh and unassigned erikjohnston Jan 12, 2018
@richvdh
Copy link
Member Author

richvdh commented Jan 16, 2018

I'm going to replace this with more specific PRs

@richvdh richvdh closed this Jan 16, 2018
@richvdh richvdh deleted the rav/better_metrics branch August 2, 2018 11:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants