Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Strip number suffix from instance name to consolidate services that traces are spread over #13729

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13729.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Strip number suffix from instance name to consolidate services that traces are spread over.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
13 changes: 12 additions & 1 deletion synapse/logging/opentracing.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,9 @@ def set_fates(clotho, lachesis, atropos, father="Zues", mother="Themis"):

# Helper class

# Matches the number suffix in an instance name like "matrix.org client_reader-8"
STRIP_INSTANCE_NUMBER_SUFFIX_REGEX = re.compile(r"[_-]?\d+$")


class _DummyTagNames:
"""wrapper of opentracings tags. We need to have them if we
Expand Down Expand Up @@ -441,9 +444,17 @@ def init_tracer(hs: "HomeServer") -> None:

from jaeger_client.metrics.prometheus import PrometheusMetricsFactory

# Instance names are opaque strings but by stripping off the number suffix,
# we can get something that looks like a "worker type", e.g.
# "client_reader-1" -> "client_reader" so we don't spread the traces across
# so many services.
instance_name_by_type = re.sub(
STRIP_INSTANCE_NUMBER_SUFFIX_REGEX, "", hs.get_instance_name()
)

config = JaegerConfig(
config=hs.config.tracing.jaeger_config,
service_name=f"{hs.config.server.server_name} {hs.get_instance_name()}",
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we get any benefit from splitting up the traces by instance or instance type before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I've been holding off from doing this is that we lose the "colouring" of the traces, i.e. it becomes much harder to tell when processing moves to a different process. Not sure if there's another way of achieving the same effect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I would second Erik's sort-of objection here; although having to find the right service is a minor pain, I really find the colouring useful because it helps to spot the points where the processing moves over replication.

Perhaps something else to consider: I wonder if e.g. Grafana's interface for viewing traces is better for finding traces regardless of service?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do something like change the service name to the worker type? I.e. the instance name without the index. That way it'd be easier to find stuff (as you don't have to figure out the instance), and for the most part instances don't talk to the same type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do something like change the service name to the worker type?

This was going to be my suggestion!

We could also add a tag which is the entire instance name if you did want to filter by that. I think the only time that is useful is if you want to see what happens when an instance is under load or something, but...I don't think this is the right interface for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I've been holding off from doing this is that we lose the "colouring" of the traces, i.e. it becomes much harder to tell when processing moves to a different process.

I really find the colouring useful because it helps to spot the points where the processing moves over replication.

Ahhh, makes sense. I have been mostly looking at the RoomMessageListRestServlet which is all turqoise in one process.

For reference, an example that spans multiple processes is RoomStateEventRestServlet

just note that 'the instance name without the index' is not a generic Synapse concept; it's our matrix.org Ansible stuff that has this convention. To Synapse, the instance name is opaque and just so happens to include a number.,,

Any tips for how to accomplish because this is what I also saw looking at this yesterday and just went with - Synapse?

We could strip number suffixes from the opaque worker name. Can we consider that a safe-enough convention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be too offended if we took an educated guess and stripped "[_-]?[0-9]+$" off the end of the worker name. I think it's mainly just us/developers that would be using Jaeger anyway, so I don't consider it too harmful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to the sensible default of stripping off the number suffix. Our matrix.org worker naming convention seems sane and we can avoid the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a tag which is the entire instance name if you did want to filter by that. I think the only time that is useful is if you want to see what happens when an instance is under load or something, but...I don't think this is the right interface for that.

BTW +1 to including the real instance name as a tag. I've found it useful to filter out a specific process that was struggling before now. It also gives you a place to look for the right logs when viewing a trace.

Split that out to #13761

service_name=f"{hs.config.server.server_name} {instance_name_by_type}",
scope_manager=LogContextScopeManager(),
metrics_factory=PrometheusMetricsFactory(),
)
Expand Down