Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Strip number suffix from instance name to consolidate services that traces are spread over #13729

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 6, 2022

Strip number suffix from instance name to consolidate services that traces are spread over.

The problem with many services is that it makes it hard to find which service has the trace you want, see jaegertracing/jaeger-ui#985

Previously, we split traces out into services based on their instance name like matrix.org client_reader-1, etc but there are many worker instances of the same client_reader so there is a lot to click through.

With this PR, all of the traces are just collected under the worker type like client_reader, event_persister 😇

Note: A Synapse worker instance name is an opaque string with the number convention only being our own thing for the matrix.org deployment. But seems pretty sensible to group things this way.

Before After
A bunch of `matrix.org` services split by instance name `matrix.org` services split by worker type

Dev notes

import re
STRIP_INSTANCE_NUMBER_SUFFIX_REGEX = re.compile(r"[_-]?\d+$")
re.sub(STRIP_INSTANCE_NUMBER_SUFFIX_REGEX, "", "matrix.org client_reader-8")
# Result: "matrix.org client_reader"

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@@ -443,7 +443,7 @@ def init_tracer(hs: "HomeServer") -> None:

config = JaegerConfig(
config=hs.config.tracing.jaeger_config,
service_name=f"{hs.config.server.server_name} {hs.get_instance_name()}",
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we get any benefit from splitting up the traces by instance or instance type before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I've been holding off from doing this is that we lose the "colouring" of the traces, i.e. it becomes much harder to tell when processing moves to a different process. Not sure if there's another way of achieving the same effect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I would second Erik's sort-of objection here; although having to find the right service is a minor pain, I really find the colouring useful because it helps to spot the points where the processing moves over replication.

Perhaps something else to consider: I wonder if e.g. Grafana's interface for viewing traces is better for finding traces regardless of service?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do something like change the service name to the worker type? I.e. the instance name without the index. That way it'd be easier to find stuff (as you don't have to figure out the instance), and for the most part instances don't talk to the same type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do something like change the service name to the worker type?

This was going to be my suggestion!

We could also add a tag which is the entire instance name if you did want to filter by that. I think the only time that is useful is if you want to see what happens when an instance is under load or something, but...I don't think this is the right interface for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I've been holding off from doing this is that we lose the "colouring" of the traces, i.e. it becomes much harder to tell when processing moves to a different process.

I really find the colouring useful because it helps to spot the points where the processing moves over replication.

Ahhh, makes sense. I have been mostly looking at the RoomMessageListRestServlet which is all turqoise in one process.

For reference, an example that spans multiple processes is RoomStateEventRestServlet

just note that 'the instance name without the index' is not a generic Synapse concept; it's our matrix.org Ansible stuff that has this convention. To Synapse, the instance name is opaque and just so happens to include a number.,,

Any tips for how to accomplish because this is what I also saw looking at this yesterday and just went with - Synapse?

We could strip number suffixes from the opaque worker name. Can we consider that a safe-enough convention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be too offended if we took an educated guess and stripped "[_-]?[0-9]+$" off the end of the worker name. I think it's mainly just us/developers that would be using Jaeger anyway, so I don't consider it too harmful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to the sensible default of stripping off the number suffix. Our matrix.org worker naming convention seems sane and we can avoid the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a tag which is the entire instance name if you did want to filter by that. I think the only time that is useful is if you want to see what happens when an instance is under load or something, but...I don't think this is the right interface for that.

BTW +1 to including the real instance name as a tag. I've found it useful to filter out a specific process that was struggling before now. It also gives you a place to look for the right logs when viewing a trace.

Split that out to #13761

@MadLittleMods MadLittleMods added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Sep 6, 2022
@MadLittleMods MadLittleMods marked this pull request as ready for review September 6, 2022 22:28
@MadLittleMods MadLittleMods requested a review from a team as a code owner September 6, 2022 22:28
@MadLittleMods MadLittleMods changed the title Stop splitting traces across many services (by instance name) Strip number suffix from instance name to consolidate services that traces are spread over Sep 8, 2022
MadLittleMods added a commit that referenced this pull request Sep 8, 2022
@MadLittleMods MadLittleMods mentioned this pull request Sep 8, 2022
4 tasks
@clokep clokep removed their request for review September 9, 2022 11:20
@MadLittleMods MadLittleMods merged commit f694bb7 into develop Sep 9, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/do-not-separate-tracing-for-synapse-across-many-services branch September 9, 2022 16:30
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @erikjohnston, @clokep, and @reivilibre 🕊

MadLittleMods added a commit that referenced this pull request Sep 9, 2022
We tag the Synapse instance name so that it's an easy jumping off point into the logs. Can also be used to filter for an instance that is under load.

As suggested by @clokep and @reivilibre in,

 - #13729 (comment)
 - #13729 (comment)
realtyem pushed a commit to realtyem/synapse-unraid that referenced this pull request Sep 10, 2022
We tag the Synapse instance name so that it's an easy jumping off point into the logs. Can also be used to filter for an instance that is under load.

As suggested by @clokep and @reivilibre in,

 - matrix-org/synapse#13729 (comment)
 - matrix-org/synapse#13729 (comment)
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 30, 2022
Synapse 1.68.0 (2022-09-27)
===========================

Please note that Synapse will now refuse to start if configured to use a version of SQLite older than 3.27.

In addition, please note that installing Synapse from a source checkout now requires a recent Rust compiler.
Those using packages will not be affected. On most platforms, installing with `pip install matrix-synapse` will not be affected.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.68/upgrade.html#upgrading-to-v1680).

Bugfixes
--------

- Fix packaging to include `Cargo.lock` in `sdist`. ([\matrix-org#13909](matrix-org#13909))

Synapse 1.68.0rc2 (2022-09-23)
==============================

Bugfixes
--------

- Fix building from packaged sdist. Broken in v1.68.0rc1. ([\matrix-org#13866](matrix-org#13866))

Internal Changes
----------------

- Fix the release script not publishing binary wheels. ([\matrix-org#13850](matrix-org#13850))
- Lower minimum supported rustc version to 1.58.1. ([\matrix-org#13857](matrix-org#13857))
- Lock Rust dependencies' versions. ([\matrix-org#13858](matrix-org#13858))

Synapse 1.68.0rc1 (2022-09-20)
==============================

Features
--------

- Keep track of when we fail to process a pulled event over federation so we can intelligently back off in the future. ([\matrix-org#13589](matrix-org#13589), [\matrix-org#13814](matrix-org#13814))
- Add an [admin API endpoint to fetch messages within a particular window of time](https://matrix-org.github.io/synapse/v1.68/admin_api/rooms.html#room-messages-api). ([\matrix-org#13672](matrix-org#13672))
- Add an [admin API endpoint to find a user based on their external ID in an auth provider](https://matrix-org.github.io/synapse/v1.68/admin_api/user_admin_api.html#find-a-user-based-on-their-id-in-an-auth-provider). ([\matrix-org#13810](matrix-org#13810))
- Cancel the processing of key query requests when they time out. ([\matrix-org#13680](matrix-org#13680))
- Improve validation of request bodies for the following client-server API endpoints: [`/account/3pid/msisdn/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidmsisdnrequesttoken), [`/org.matrix.msc3720/account_status`](https://github.com/matrix-org/matrix-spec-proposals/blob/babolivier/user_status/proposals/3720-account-status.md#post-_matrixclientv1account_status), [`/account/3pid/add`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidadd), [`/account/3pid/bind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidbind), [`/account/3pid/delete`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3piddelete) and [`/account/3pid/unbind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidunbind). ([\matrix-org#13687](matrix-org#13687), [\matrix-org#13736](matrix-org#13736))
- Document the timestamp when a user accepts the consent, if [consent tracking](https://matrix-org.github.io/synapse/latest/consent_tracking.html) is used. ([\matrix-org#13741](matrix-org#13741))
- Add a `listeners[x].request_id_header` configuration option to specify which request header to extract and use as the request ID in order to correlate requests from a reverse proxy. ([\matrix-org#13801](matrix-org#13801))

Bugfixes
--------

- Fix a bug introduced in Synapse 1.41.0 where the `/hierarchy` API returned non-standard information (a `room_id` field under each entry in `children_state`). ([\matrix-org#13506](matrix-org#13506))
- Fix a long-standing bug where previously rejected events could end up in room state because they pass auth checks given the current state of the room. ([\matrix-org#13723](matrix-org#13723))
- Fix a long-standing bug where Synapse fails to start if a signing key file contains an empty line. ([\matrix-org#13738](matrix-org#13738))
- Fix a long-standing bug where Synapse would fail to handle malformed user IDs or room aliases gracefully in certain cases. ([\matrix-org#13746](matrix-org#13746))
- Fix a long-standing bug where device lists would remain cached when remote users left and rejoined the last room shared with the local homeserver. ([\matrix-org#13749](matrix-org#13749), [\matrix-org#13826](matrix-org#13826))
- Fix a long-standing bug that could cause stale caches in some rare cases on the first startup of Synapse with replication. ([\matrix-org#13766](matrix-org#13766))
- Fix a long-standing spec compliance bug where Synapse would accept a trailing slash on the end of `/get_missing_events` federation requests. ([\matrix-org#13789](matrix-org#13789))
- Delete associated data from `event_failed_pull_attempts`, `insertion_events`, `insertion_event_extremities`, `insertion_event_extremities`, `insertion_event_extremities` when purging the room. ([\matrix-org#13825](matrix-org#13825))

Improved Documentation
----------------------

- Note that `libpq` is required on ARM-based Macs. ([\matrix-org#13480](matrix-org#13480))
- Fix a mistake in the config manual introduced in Synapse 1.22.0: the `event_cache_size` _is_ scaled by `caches.global_factor`. ([\matrix-org#13726](matrix-org#13726))
- Fix a typo in the documentation for the login ratelimiting configuration. ([\matrix-org#13727](matrix-org#13727))
- Define Synapse's compatability policy for SQLite versions. ([\matrix-org#13728](matrix-org#13728))
- Add docs for the common fix of deleting the `matrix_synapse.egg-info/` directory for fixing Python dependency problems. ([\matrix-org#13785](matrix-org#13785))
- Update request log format documentation to mention the format used when the authenticated user is controlling another user. ([\matrix-org#13794](matrix-org#13794))

Deprecations and Removals
-------------------------

- Synapse will now refuse to start if configured to use SQLite < 3.27. ([\matrix-org#13760](matrix-org#13760))
- Don't include redundant `prev_state` in new events. Contributed by Denis Kariakin (@dakariakin). ([\matrix-org#13791](matrix-org#13791))

Internal Changes
----------------

- Add a stub Rust crate. ([\matrix-org#12595](matrix-org#12595), [\matrix-org#13734](matrix-org#13734), [\matrix-org#13735](matrix-org#13735), [\matrix-org#13743](matrix-org#13743), [\matrix-org#13763](matrix-org#13763), [\matrix-org#13769](matrix-org#13769), [\matrix-org#13778](matrix-org#13778))
- Bump the minimum dependency of `matrix_common` to 1.3.0 to make use of the `MXCUri` class. Use `MXCUri` to simplify media retention test code. ([\matrix-org#13162](matrix-org#13162))
- Add and populate the `event_stream_ordering` column on the `receipts` table for future optimisation of push action processing. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13703](matrix-org#13703))
- Rename the `EventFormatVersions` enum values so that they line up with room version numbers. ([\matrix-org#13706](matrix-org#13706))
- Update trial old deps CI to use Poetry 1.2.0. ([\matrix-org#13707](matrix-org#13707), [\matrix-org#13725](matrix-org#13725))
- Add experimental configuration option to allow disabling legacy Prometheus metric names. ([\matrix-org#13714](matrix-org#13714), [\matrix-org#13717](matrix-org#13717), [\matrix-org#13718](matrix-org#13718))
- Fix typechecking with latest types-jsonschema. ([\matrix-org#13724](matrix-org#13724))
- Strip number suffix from instance name to consolidate services that traces are spread over. ([\matrix-org#13729](matrix-org#13729))
- Instrument `get_metadata_for_events` for understandable traces in Jaeger. ([\matrix-org#13730](matrix-org#13730))
- Remove old queries to join room memberships to current state events. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13745](matrix-org#13745))
- Avoid raising an error due to malformed user IDs in `get_current_hosts_in_room`. Malformed user IDs cannot currently join a room, so this error would not be hit. ([\matrix-org#13748](matrix-org#13748))
- Update the docstrings for `get_users_in_room` and `get_current_hosts_in_room` to explain the impact of partial state. ([\matrix-org#13750](matrix-org#13750))
- Use an additional database query when persisting receipts. ([\matrix-org#13752](matrix-org#13752))
- Preparatory work for storing thread IDs for notifications and receipts. ([\matrix-org#13753](matrix-org#13753))
- Re-type hint some collections as read-only. ([\matrix-org#13754](matrix-org#13754))
- Remove unused Prometheus recording rules from `synapse-v2.rules` and add comments describing where the rest are used. ([\matrix-org#13756](matrix-org#13756))
- Add a check for editable installs if the Rust library needs rebuilding. ([\matrix-org#13759](matrix-org#13759))
- Tag traces with the instance name to be able to easily jump into the right logs and filter traces by instance. ([\matrix-org#13761](matrix-org#13761))
- Concurrently fetch room push actions when calculating badge counts. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13765](matrix-org#13765))
- Update the script which makes full schema dumps. ([\matrix-org#13770](matrix-org#13770))
- Deduplicate `is_server_notices_room`. ([\matrix-org#13780](matrix-org#13780))
- Simplify the dependency DAG in the tests workflow. ([\matrix-org#13784](matrix-org#13784))
- Remove an old, incorrect migration file. ([\matrix-org#13788](matrix-org#13788))
- Remove unused method in `synapse.api.auth.Auth`. ([\matrix-org#13795](matrix-org#13795))
- Fix a memory leak when running the unit tests. ([\matrix-org#13798](matrix-org#13798))
- Use partial indices on SQLite. ([\matrix-org#13802](matrix-org#13802))
- Check that portdb generates the same postgres schema as that in the source tree. ([\matrix-org#13808](matrix-org#13808))
- Fix Docker build when Rust .so has been built locally first. ([\matrix-org#13811](matrix-org#13811))
- Complement: Initialise the Postgres database directly inside the target image instead of the base Postgres image to fix building using Buildah. ([\matrix-org#13819](matrix-org#13819))
- Support providing an index predicate clause when doing upserts. ([\matrix-org#13822](matrix-org#13822))
- Minor speedups to linting in CI. ([\matrix-org#13827](matrix-org#13827))

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE1508oLYUKainYFJakD7OEIo53t0FAmMy4FoACgkQkD7OEIo5
# 3t009g/6A4S26H6NG4GM44JD9+OB25fO59m9UAWWLrePmOKsBaGXVp86scPq9epI
# vQbr4Czi8WEqCJlMRxIWLXv7BL3TLXnLF1vC0wSE6YiJqrPU9vMZ0UYxWNErl8Sr
# eFBpuHXDlfppQUXs903iNmXVbdTpXCVjdTEwaZmgU1/FKydgU0o90PQseb/hnegX
# hcJrepL6xhcs37fP2zdlixissLQ85WE10x6h7FX+SkCEHGvkiKrqSvXsS4ZN0Scn
# vGCy3GD69j/ZpRu7RczdDwwzCRerg6r7spokRK/b6pzz9sXmCyY8SrrUyEEdzveN
# uEEe4A8vmR3v0sR4Ao2cGZ7zy/jq6WyrWjmfOd+hfYD9tXx/g8190RFkRQrrkYVU
# jTNdD6Zom0rtENEgHuFQ8joD96MNsaq4dvDefYYpcXDOh3YZnA/fiwfmG9XZRq6u
# B42RZEtUZ3sjZ3VdRb3AvhPTTrY4kiwEqVBFSUTBKEAKXQtdrsiqv2QPvQTSC5BJ
# YFXMwj32E7Zi3mTYjl60yggtBAxYt49KsHL8qAq75t6A38HpnYEyEW1R3S6VDmtp
# rIR5ktxyzoGvxpJ4YPUdC4A2hbaOztwPvGE3iEDiPgUdkfb6m8x3MhaWhShid4Df
# v2BQu+SFIl1wXPLxjlX2rmkQMhUGD2RGYmkUgYfoWnjdTtQV10w=
# =4eYj
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue Sep 27 12:36:58 2022 BST
# gpg:                using RSA key D79D3CA0B61429A8A760525A903ECE108A39DEDD
# gpg: Can't check signature: No public key

# Conflicts:
#	docker/Dockerfile
#	poetry.lock
#	synapse/push/bulk_push_rule_evaluator.py
#	synapse/push/push_tools.py
#	synapse/storage/databases/main/event_push_actions.py
#	synapse/storage/databases/main/events_worker.py
#	tests/replication/slave/storage/test_events.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants