Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a non-global metrics registry in Teleport #50913

Merged
merged 3 commits into from
Jan 10, 2025

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Jan 9, 2025

This PR adds a new non-global per-process metrics registry in Teleport.

Using the global registry and global metrics causes conflicts in tests as we are starting multiple Teleport processes and/or other non-teleport processes (tbot, the operator, ...).

Having a new per-process metrics registry will allow Teleport services to register metrics scoped to their Teleport process. This will reduce the conflicts happening in tests.

To ensure backward compatibility, the Teleport metrics server serves both the process-scoped registry and the global registry.

Required for the autoupdate controller metrics PR.

@hugoShaka
Copy link
Contributor Author

I didn't want to add a metrics RFD, but it would be good to start using the process registry instead of the global one for the next features we build/metrics we add.

@hugoShaka hugoShaka added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v15 backport/branch/v16 backport/branch/v17 labels Jan 9, 2025
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll let the experts approve first.

lib/service/service.go Outdated Show resolved Hide resolved
lib/service/service.go Outdated Show resolved Hide resolved
lib/service/servicecfg/config.go Outdated Show resolved Hide resolved
// and the global registry (used by some Teleport services and many dependencies).
gatherers := prometheus.Gatherers{
prometheus.DefaultGatherer,
process.metricsRegistry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conflicting metrics are registered I assume they'll be dropped, but unaffected metrics will keep working. Do you know if that's correct?

Copy link
Contributor Author

@hugoShaka hugoShaka Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conflicting metrics are registered I assume they'll be dropped

Currently, registration conflicts in the global registry can cause:

  • hard failure / error returned
  • panics
  • silent failure (metric does not get registered and we don't know about it)

Adding a local registry will not change the failure modes in case of conflict in the same registry. However, we are adding a new failure mode: metrics conflicting between the local and global registry. In this case, the global will prevail (I did this for backward compatibility reasons as everything is using the global registry today) the local registry will take precedence.

As we start using the local registry more, we might create such hard to detect conflicts. The situation is not strictly worse than today (we already have some racy metric registration with silent failure going on 😬). To ensure no conflict happen we can prefix new metrics by wrapping the registry when passing it to the service.

I think we would benefit from metrics guideline, setting the teleport component in the metric subsystem would reduce the probability of conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very informative.

lib/service/service_test.go Outdated Show resolved Hide resolved
// and the global registry (used by some Teleport services and many dependencies).
gatherers := prometheus.Gatherers{
prometheus.DefaultGatherer,
process.metricsRegistry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very informative.

@hugoShaka hugoShaka force-pushed the hugo/teleport-use-non-global-metrics-registry branch from 853ac49 to 7be0540 Compare January 9, 2025 20:28
@hugoShaka hugoShaka force-pushed the hugo/teleport-use-non-global-metrics-registry branch from d153fda to fa31b4a Compare January 9, 2025 22:46
lib/service/service.go Show resolved Hide resolved
Comment on lines +3447 to +3448
// As we move more things to the local registry, especially in other tools like tbot, we will have less
// conflicts in tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it advantageous to move all of our current global metrics to the local registry? If so what kind of migration strategy should we have to eliminate global metrics?

Copy link
Contributor Author

@hugoShaka hugoShaka Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it advantageous to move all of our current global metrics to the local registry?

I think so because we will:

  • stop having conflicts between tbot, teleport and other programs when running in the same test
  • start having accurate metrics when running multiple teleport components together (e.g. in tests, or embedded tbot)
  • stop picking up random metrics declared by dependencies we have

If so what kind of migration strategy should we have to eliminate global metrics?

I've not thought this yet, but by supporting both we can take our time with the transition. I'd like to get a few new metrics using the local registry before chosing a recommended pattern. Once we know how we want metrics to be declared and collected, we can write a short metrics RFD and start passing the local registry to the different services.

Migrating might not be trivial because we are heavily relying on package-scoped metrics and global registries. We will need to:

  • propagate the registerer from the main process to every service registering metrics
  • start putting metrics in structs instead of a package-scoped var
  • get rid of the sync.Once and other hacks in place currently avoiding double-registration

I think tbot is a very good starting point because of its limited scope, the conflicts caused by embeddedtbot, and the conflicts it causes in integration tests.

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
@hugoShaka hugoShaka enabled auto-merge January 10, 2025 15:38
@hugoShaka hugoShaka added this pull request to the merge queue Jan 10, 2025
Merged via the queue into master with commit 5b5bab9 Jan 10, 2025
41 checks passed
@hugoShaka hugoShaka deleted the hugo/teleport-use-non-global-metrics-registry branch January 10, 2025 16:03
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed
branch/v17 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants