Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Share some metrics between the Prometheus exporter and the phone home stats #13671

Merged
merged 13 commits into from
Sep 5, 2022

Conversation

babolivier
Copy link
Contributor

@babolivier babolivier commented Aug 30, 2022

This PR adds a way to share some data between the prometheus exporter and the phone home stats (and wherever else we might want to share these stats with in the future, eg with modules). For now this is only used to expose the daily active users count, but can be expanded to cover other metrics.

The new test doesn't pass yet so I need to figure that out, but I've verified that the change works manually so I don't think it should block review. edit: fixed now

@babolivier babolivier added the Z-Time-Tracked Element employees should track their time spent on this issue/PR. label Aug 30, 2022
@babolivier babolivier requested a review from a team as a code owner August 30, 2022 17:18
Comment on lines 51 to 53
async def update(self) -> None:
"""Updates the shared usage metrics."""
await self.update_daily_active_users()
Copy link
Contributor Author

@babolivier babolivier Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very boilerplatey currently but the idea is that we can add to this method if we add new metrics to this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to separate the stages of:

  • collecting the metrics (c.f. other comment; return a SharedUsageMetrics dataclass)
  • updating the gauges, given a SharedUsageMetrics object

One function per metric does feel a bit too boilerplatey to me.

as an idea

async def _collect(self) -> SharedUsageMetrics:
    return SharedUsageMetrics(
		daily_active_users=await self._store.count_daily_users()
    )

async def _update_gauges(self, metrics: SharedUsageMetrics) -> None:
    current_dau_gauge.set(metrics.daily_active_users)

Just my opinion, but I like that a bit better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would generate a new object each time we collect metrics? That sounds fairly inefficient to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented this idea in a slightly different way from your proposal, lmk what you think.

Copy link
Contributor

@reivilibre reivilibre Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would generate a new object each time we collect metrics? That sounds fairly inefficient to me.

I think it's not a big cost for the clarity it gives. Remember this is Python where basically everything, even integers, are objects, so one more object isn't going to be much harm (especially if it uses slots).
The object is being removed in such a way that should be friendly with Python's refcounting so I don't think there's much of a garbage collection cost here either.

I would prefer the way I proposed as each metric is more self-contained in the code; the current proposal has the metric being updated differently in two branches and stored in an intermediate variable; it feels clunkier.

Even supposing objects were more inefficient, this is infrequently run that I don't think it's worth premature optimisation, better to simplify the code to reduce footgun potential I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Even though, as you mention, the gain in optimising this is pretty negligible (for now at least, there's always the possibility that something else starts using it, such as modules), to me creating a new object everytime something requests it is a bad code smell. Plus I think the code would be more confusing that way, because then we'd be updating the prometheus gauge every 5 minutes by reading the count from the database and then when the phone home stats ask for them we'd do the whole dance again. It feels both more logical and more efficient to me to have the CommonUsageMetricsManager keep a copy of the metrics in memory, keep that updated, and serve it whenever necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you substantiate why you think it's a code smell? Creating an object or struct to hold associated values is such an accepted pattern elsewhere that I'm surprised it's objectionable. (In fact, mutating the existing objects that you've already handed out is usually considered an awful crime, to the point that some disciplines insist on immutability in many places.)

I think the worse code smell here is not keeping the logic for one metric in one place, e.g. if you have metrics A B and C then the code locality looks like:

1st proposal:

self.stats = Stats(
    a=count_a(),
    b=count_b(),
    c=count_c(),
)

2nd proposal:

a = count_a()
b = count_b()
c = count_c()
if ..:
    self.stats = Stats(a=a, b=b, c=c)
else:
    self.stats.a = a
    self.stats.b = b
    self.stats.c = c

If we annotate the two proposals' lines with which metric they concern, you can see my point here that the locality of behaviour is improved in the first suggestion (which I believe is better for its readability):

1st proposal:

self.stats = Stats(
    a=count_a(), # A
    b=count_b(), # B
    c=count_c(), # C
)

2nd proposal:

a = count_a() # A
b = count_b() # B
c = count_c() # C
if ..:
    self.stats = Stats(a=a, b=b, c=c) # A, B, C
else:
    self.stats.a = a # A
    self.stats.b = b # B
    self.stats.c = c # C

I also think the branch here is just more clutter than it's worth. Pessimistically, this is now twice as much to go wrong and twice as much to test.

the gain in optimising this is pretty negligible (for now at least, there's always the possibility that something else starts using it, such as modules)

I will note that I don't think the gain is pretty negligible — I think it's highly dubious; we create objects all the time in Synapse and in Python generally, whether they're integers, tuples, or instances of classes.

Of course, if we really want to talk about such tiny optimisations, then we might instead consider whether introducing an if here is bad because branching is inefficient, or we might consider that increasing the size of the code in the function is bad because you're going to increase the size of this function's code, therefore making caches ineffective.
I don't think any of these are seriously relevant here, but I do think that code clarity and keeping it easier to maintain is a big bonus generally.

Plus I think the code would be more confusing that way, because then we'd be updating the prometheus gauge every 5 minutes by reading the count from the database and then when the phone home stats ask for them we'd do the whole dance again.

I think you misunderstand my statement here (hopefully having spelled out the proposals above clarifies it?). You don't have to recalculate each and every time we request it, just the style of handing back the record can be different; just always create a new record when updating the metrics rather than mutating the previous one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstand my statement here (hopefully having spelled out the proposals above clarifies it?). You don't have to recalculate each and every time we request it, just the style of handing back the record can be different; just always create a new record when updating the metrics rather than mutating the previous one.

Ah, I was misunderstanding it indeed. You initial proposal of _update_gauges taking metrics as an argument somehow made me assume something else than the CommonUsageMetricsManager would be handing over a metrics object to use to update the gauges, which made me super confused as to where it's supposed to be coming from. I think I see what you have in mind now, and it does make sense.

@babolivier babolivier changed the title Expose daily active users to prometheus metrics Share some metrics between the Prometheus exporter and the phone home stats Aug 30, 2022
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall idea seems good, just minor thoughts

self._store = hs.get_datastores().main
self._clock = hs.get_clock()

self.daily_active_users = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of having each field have a sentinel value; it gives the impression that we may end up reporting -1 (although I see that you update upon initialisation). IMO we shouldn't report anything rather than report negative values. I can see this being potentially troublesome for the phone-home stats when we sum DAUs.

I think I would be tempted to restructure this a bit:

  • rename to SharedUsageMetricsManager or equivalent (not sure what the conventional term is in Synapse ... is it Controller?)
  • make SharedUsageMetrics a plain dataclass/attrs object, with daily_active_users as an int field
  • put a latest(): Optional[SharedUsageMetrics] method (or field maybe?) on SharedUsageMetricsManager

(N.B. I was going to suggest that the metrics be lazily computed, either on-phone-home or on-prometheus-scrape, but I realise we don't want to block the prometheus scrape endpoint on Twisted's reactor in case everything is going awry)

This sort of design prevents having individual metrics having sentinel values like -1, which may escape.

Comment on lines 51 to 53
async def update(self) -> None:
"""Updates the shared usage metrics."""
await self.update_daily_active_users()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to separate the stages of:

  • collecting the metrics (c.f. other comment; return a SharedUsageMetrics dataclass)
  • updating the gauges, given a SharedUsageMetrics object

One function per metric does feel a bit too boilerplatey to me.

as an idea

async def _collect(self) -> SharedUsageMetrics:
    return SharedUsageMetrics(
		daily_active_users=await self._store.count_daily_users()
    )

async def _update_gauges(self, metrics: SharedUsageMetrics) -> None:
    current_dau_gauge.set(metrics.daily_active_users)

Just my opinion, but I like that a bit better

synapse/server.py Outdated Show resolved Hide resolved
@babolivier babolivier enabled auto-merge (squash) September 5, 2022 09:41
@babolivier babolivier merged commit 898fef2 into develop Sep 5, 2022
@babolivier babolivier deleted the babolivier/dau_prom branch September 5, 2022 10:26
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 15, 2022
Synapse 1.67.0 (2022-09-13)
===========================

This release removes using the deprecated direct TCP replication configuration
for workers. Server admins should use Redis instead. See the [upgrade
notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670).

The minimum version of `poetry` supported for managing source checkouts is now
1.2.0.

**Notice:** from the next major release (1.68.0) installing Synapse from a source
checkout will require a recent Rust compiler. Those using packages or
`pip install matrix-synapse` will not be affected. See the [upgrade
notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670).

**Notice:** from the next major release (1.68.0), running Synapse with a SQLite
database will require SQLite version 3.27.0 or higher. (The [current minimum
 version is SQLite 3.22.0](https://github.com/matrix-org/synapse/blob/release-v1.67/synapse/storage/engines/sqlite.py#L69-L78).)
See [matrix-org#12983](matrix-org#12983) and the [upgrade notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670) for more details.

No significant changes since 1.67.0rc1.

Synapse 1.67.0rc1 (2022-09-06)
==============================

Features
--------

- Support setting the registration shared secret in a file, via a new `registration_shared_secret_path` configuration option. ([\matrix-org#13614](matrix-org#13614))
- Change the default startup behaviour so that any missing "additional" configuration files (signing key, etc) are generated automatically. ([\matrix-org#13615](matrix-org#13615))
- Improve performance of sending messages in rooms with thousands of local users. ([\matrix-org#13634](matrix-org#13634))

Bugfixes
--------

- Fix a bug introduced in Synapse 1.13 where the [List Rooms admin API](https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#list-room-api) would return integers instead of booleans for the `federatable` and `public` fields when using a Sqlite database. ([\matrix-org#13509](matrix-org#13509))
- Fix bug that user cannot `/forget` rooms after the last member has left the room. ([\matrix-org#13546](matrix-org#13546))
- Faster Room Joins: fix `/make_knock` blocking indefinitely when the room in question is a partial-stated room. ([\matrix-org#13583](matrix-org#13583))
- Fix loading the current stream position behind the actual position. ([\matrix-org#13585](matrix-org#13585))
- Fix a longstanding bug in `register_new_matrix_user` which meant it was always necessary to explicitly give a server URL. ([\matrix-org#13616](matrix-org#13616))
- Fix the running of [MSC1763](matrix-org/matrix-spec-proposals#1763) retention purge_jobs in deployments with background jobs running on a worker by forcing them back onto the main worker. Contributed by Brad @ Beeper. ([\matrix-org#13632](matrix-org#13632))
- Fix a long-standing bug that downloaded media for URL previews was not deleted while database background updates were running. ([\matrix-org#13657](matrix-org#13657))
- Fix [MSC3030](matrix-org/matrix-spec-proposals#3030) `/timestamp_to_event` endpoint to return the correct next event when the events have the same timestamp. ([\matrix-org#13658](matrix-org#13658))
- Fix bug where we wedge media plugins if clients disconnect early. Introduced in v1.22.0. ([\matrix-org#13660](matrix-org#13660))
- Fix a long-standing bug which meant that keys for unwhitelisted servers were not returned by `/_matrix/key/v2/query`. ([\matrix-org#13683](matrix-org#13683))
- Fix a bug introduced in Synapse v1.20.0 that would cause the unstable unread counts from [MSC2654](matrix-org/matrix-spec-proposals#2654) to be calculated even if the feature is disabled. ([\matrix-org#13694](matrix-org#13694))

Updates to the Docker image
---------------------------

- Update docker image to use a stable version of poetry. ([\matrix-org#13688](matrix-org#13688))

Improved Documentation
----------------------

- Improve the description of the ["chain cover index"](https://matrix-org.github.io/synapse/latest/auth_chain_difference_algorithm.html) used internally by Synapse. ([\matrix-org#13602](matrix-org#13602))
- Document how ["monthly active users"](https://matrix-org.github.io/synapse/latest/usage/administration/monthly_active_users.html) is calculated and used. ([\matrix-org#13617](matrix-org#13617))
- Improve documentation around user registration. ([\matrix-org#13640](matrix-org#13640))
- Remove documentation of legacy `frontend_proxy` worker app. ([\matrix-org#13645](matrix-org#13645))
- Clarify documentation that HTTP replication traffic can be protected with a shared secret. ([\matrix-org#13656](matrix-org#13656))
- Remove unintentional colons from [config manual](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html) headers. ([\matrix-org#13665](matrix-org#13665))
- Update docs to make enabling metrics more clear. ([\matrix-org#13678](matrix-org#13678))
- Clarify `(room_id, event_id)` global uniqueness and how we should scope our database schemas. ([\matrix-org#13701](matrix-org#13701))

Deprecations and Removals
-------------------------

- Drop support for calling `/_matrix/client/v3/rooms/{roomId}/invite` without an `id_access_token`, which was not permitted by the spec. Contributed by @Vetchu. ([\matrix-org#13241](matrix-org#13241))
- Remove redundant `_get_joined_users_from_context` cache. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13569](matrix-org#13569))
- Remove the ability to use direct TCP replication with workers. Direct TCP replication was deprecated in Synapse v1.18.0. Workers now require using Redis. ([\matrix-org#13647](matrix-org#13647))
- Remove support for unstable [private read receipts](matrix-org/matrix-spec-proposals#2285). ([\matrix-org#13653](matrix-org#13653), [\matrix-org#13692](matrix-org#13692))

Internal Changes
----------------

- Extend the release script to wait for GitHub Actions to finish and to be usable as a guide for the whole process. ([\matrix-org#13483](matrix-org#13483))
- Add experimental configuration option to allow disabling legacy Prometheus metric names. ([\matrix-org#13540](matrix-org#13540))
- Cache user IDs instead of profiles to reduce cache memory usage. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13573](matrix-org#13573), [\matrix-org#13600](matrix-org#13600))
- Optimize how Synapse calculates domains to fetch from during backfill. ([\matrix-org#13575](matrix-org#13575))
- Comment about a better future where we can get the state diff between two events. ([\matrix-org#13586](matrix-org#13586))
- Instrument `_check_sigs_and_hash_and_fetch` to trace time spent in child concurrent calls for understandable traces in Jaeger. ([\matrix-org#13588](matrix-org#13588))
- Improve performance of `@cachedList`. ([\matrix-org#13591](matrix-org#13591))
- Minor speed up of fetching large numbers of push rules. ([\matrix-org#13592](matrix-org#13592))
- Optimise push action fetching queries. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13597](matrix-org#13597))
- Rename `event_map` to `unpersisted_events` when computing the auth differences. ([\matrix-org#13603](matrix-org#13603))
- Refactor `get_users_in_room(room_id)` mis-use with dedicated `get_current_hosts_in_room(room_id)` function. ([\matrix-org#13605](matrix-org#13605))
- Use dedicated `get_local_users_in_room(room_id)` function to find local users when calculating `join_authorised_via_users_server` of a `/make_join` request. ([\matrix-org#13606](matrix-org#13606))
- Refactor `get_users_in_room(room_id)` mis-use to lookup single local user with dedicated `check_local_user_in_room(...)` function. ([\matrix-org#13608](matrix-org#13608))
- Drop unused column `application_services_state.last_txn`. ([\matrix-org#13627](matrix-org#13627))
- Improve readability of Complement CI logs by printing failure results last. ([\matrix-org#13639](matrix-org#13639))
- Generalise the `@cancellable` annotation so it can be used on functions other than just servlet methods. ([\matrix-org#13662](matrix-org#13662))
- Introduce a `CommonUsageMetrics` class to share some usage metrics between the Prometheus exporter and the phone home stats. ([\matrix-org#13671](matrix-org#13671))
- Add some logging to help track down matrix-org#13444. ([\matrix-org#13679](matrix-org#13679))
- Update poetry lock file for v1.2.0. ([\matrix-org#13689](matrix-org#13689))
- Add cache to `is_partial_state_room`. ([\matrix-org#13693](matrix-org#13693))
- Update the Grafana dashboard that is included with Synapse in the `contrib` directory. ([\matrix-org#13697](matrix-org#13697))
- Only run trial CI on all python versions on non-PRs. ([\matrix-org#13698](matrix-org#13698))
- Fix typechecking with latest types-jsonschema. ([\matrix-org#13712](matrix-org#13712))
- Reduce number of CI checks we run for PRs. ([\matrix-org#13713](matrix-org#13713))

# -----BEGIN PGP SIGNATURE-----
#
# iQFEBAABCgAuFiEEBTGR3/RnAzBGUif3pULk7RsPrAkFAmMgR2QQHGVyaWtAbWF0
# cml4Lm9yZwAKCRClQuTtGw+sCfG7B/94PwW1ChsaI8hkz/3e+93PEl/mNJ6YFaEB
# 5pP4Dh/0dipP/iKbpgNuj5xz/JFnIi8D49A8sKNnku3jk0/8AZHgqDiBgOkrN76z
# Y3awo5Q9ag4xww/105V3bhdnX1NrX8Avf6F2jchDv6/9q8wQHGBPg6DMgfZ/m/BL
# SB4dypbbNpgLykuwtWxx6YMUYH+trsXJOn/MoAqld3QcZsqkDR25wXCt9+Dr+6AT
# dPd/czi8kV8ruU59tf2K5HB7XKzBW9S3Qb3dJJmGOTTJ7ccUkN/XuTwqnII950Mo
# bSlMXjY2hqk8rKUNhGZpi9bqUkwNhMgOkZl9A0Y1XtsXx6yjy0T/
# =zSGi
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue Sep 13 10:03:32 2022 BST
# gpg:                using RSA key 053191DFF4670330465227F7A542E4ED1B0FAC09
# gpg:                issuer "erik@matrix.org"
# gpg: Can't check signature: No public key

# Conflicts:
#	synapse/config/experimental.py
#	synapse/push/bulk_push_rule_evaluator.py
#	synapse/storage/databases/main/event_push_actions.py
#	synapse/util/caches/descriptors.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Z-Time-Tracked Element employees should track their time spent on this issue/PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants