Consider using OpenTelemetry for metrics instead of Prometheus #305

grantr · 2019-01-24T17:13:39Z

Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.

Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.

The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).

Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.

And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.

It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.

DirectXMan12 · 2019-01-29T22:59:01Z

/kind feature

I do kind-of like this idea. Want to put together a PoC?

DirectXMan12 · 2019-01-29T22:59:12Z

/priority backlog

DirectXMan12 · 2019-03-06T22:48:17Z

/good-first-issue

grantr · 2019-03-06T23:12:21Z

Working on this now!
/assign

fejta-bot · 2019-06-04T23:55:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12 · 2019-06-05T22:33:42Z

/remove-lifecycle stale

fejta-bot · 2019-09-03T22:45:11Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12 · 2019-09-09T16:53:24Z

/lifecycle frozen

negz · 2019-10-01T02:53:04Z

I was just exploring controller-runtime's observability stance, noticed the kubebuilder book mentioned Prometheus metrics, thought "oh man I wish they used Opencensus instead", came here to thumbs-up this issue, then found past me already did that. ;)

Throwing my two cents as a controller-tools user who has instrumented systems in the past with both Prometheus's SDK and Opencensus's I vastly prefer the latter for basically all the reasons @grantr outlines in this issue.

DirectXMan12 · 2019-10-07T20:11:35Z

I'm going to add

/help

here. I'm thinking we might just want to push ahead with this. At this point, it's probably worth trying to make use of OpenTelemetry (the merged version of OpenCensus & OpenTracing), since that's the future of these efforts (https://github.com/open-telemetry/opentelemetry-go).

If anybody's willing to draw up a rough design of how things'd look and put forward a prototype, I'd be happy to review.

implemented vendor update

vincepri · 2020-02-20T18:34:29Z

@grantr are you still interested in working on this?

/lifecycle frozen

grantr · 2020-02-20T19:52:34Z

I think it's still valuable, but I don't have bandwidth to work on it. #368 is the final state of my attempt.

hasheddan · 2020-04-12T01:09:40Z

I would love to help out here and will get started on implementation :)

/assign

ncdc · 2020-09-24T15:55:28Z

@hasheddan is this something you started on?

hasheddan · 2020-09-24T17:16:10Z

@ncdc I unfortunately have not gotten around to it yet, but would still love to help out. Don't want to block anyone else who is already getting started though 👍

ncdc · 2020-09-24T17:17:20Z

No worries, just checking so folks don't step on your toes! Thanks for the update.

vincepri · 2020-10-08T16:55:07Z

@bboreham This group is interested in getting OpenTracing in Controller Runtime

cc @DirectXMan12

bboreham · 2020-10-09T12:04:49Z

@vincepri presume you meant OpenTelemetry.
I opened a PR to make it easier to see what I've done: #1211

evankanderson · 2020-10-12T17:03:17Z

A few comments from experience in Knative:

If you're just starting to add instrumentation, try to use the same data in the context for both (structured) logging, tracing, and metrics. This has three benefits:

It reduces the amount of boilerplate you need to record the information, and increases the chance that you'll get your labels right.
It reduces the amount of "stuff" you have to carry around in the Context, which is a linked list.
It provides a consistent interface for observability (e.g. Record(ctx, measurement), Info(ctx, "message"), Start(ctx, "spanname")),

Design your Resource schema ahead of time. For many applications, the application acts a single tenant and only a single Resource is needed. Kubernetes controllers are not the typical application; I'd suggest defaulting to a separate Resource for each Kubernetes Resource, on the theory that k8s Resources probably represent an "entity which produces telemetry". In some cases for short-lived resources (for example, a TaskRun in Tekton), it may make sense to associate the resource with the parent (i.e. Task in Tekton).

Note that using Metrics and Traces in a multi-resource way requires passing provider options to NewAccumulator/NewTracerProvider... you probably want to cache these (on the context?), rather than creating a new one every time.

DirectXMan12 · 2020-10-13T20:43:06Z

We've got some work done for us in that regard from @dashpole's work on the apiserver tracing KEP, so we should consider that as well

alvaroaleman · 2021-10-08T22:21:43Z

Removing good first issue label, as it is not quite clear if we want this and if so, how it would look.
/remove-good-first-issue

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 29, 2019

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 29, 2019

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 6, 2019

k8s-ci-robot assigned grantr Mar 6, 2019

This was referenced Mar 21, 2019

✨ pkg/manager,metrics: Expose ServingMetrics func #367

Closed

✨ Add OpenCensus controller metrics #368

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2019

negz mentioned this issue Oct 14, 2019

docs: add error and event reporting one-pager crossplane/crossplane#858

Merged

xrstf mentioned this issue Dec 5, 2019

✨ Define metrics.Registry as an interface #713

Merged

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020

Merge pull request #305 from droot/vendor-update-cmd

01a605a

implemented vendor update

vincepri mentioned this issue Feb 20, 2020

Remove metrics global registry #210

Open

vincepri changed the title ~~Consider using OpenCensus for metrics instead of Prometheus~~ Consider using OpenCensus/OpenTelemetry for metrics instead of Prometheus Feb 20, 2020

vincepri added this to the v0.6.0 milestone Feb 20, 2020

vincepri removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020

k8s-ci-robot assigned hasheddan Apr 12, 2020

randomvariable mentioned this issue Sep 22, 2020

Metrics kubernetes-sigs/cluster-api#1477

Closed

vincepri changed the title ~~Consider using OpenCensus/OpenTelemetry for metrics instead of Prometheus~~ Consider using OpenTelemetry for metrics instead of Prometheus Sep 24, 2020

negz mentioned this issue Nov 10, 2020

Exposing metrics for core and rbac crossplane/crossplane#1940

Merged

2 tasks

schrodit mentioned this issue May 31, 2021

☂️ Monitoring gardener/landscaper#48

Closed

6 tasks

k8s-ci-robot removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Oct 8, 2021

mumoshu mentioned this issue Jul 16, 2022

Adding support for open telemetry actions/actions-runner-controller#1386

Open

negz mentioned this issue Mar 7, 2023

Add upjet runtime Prometheus metrics crossplane/upjet#170

Merged

4 tasks

ulucinar mentioned this issue Mar 8, 2023

Use OpenTelemetry to expose custom Prometheus metrics crossplane/upjet#171

Open

jbw976 mentioned this issue Apr 12, 2023

Crossplane should export timeseries metrics crossplane/crossplane#314

Closed

negz mentioned this issue Aug 7, 2023

reconciler/managed: add crossplane_resource_drift_seconds metric crossplane/crossplane-runtime#489

Closed

2 tasks

kosmoz mentioned this issue Jul 3, 2024

Expose OpenTelemetry Events for Glasskube packages glasskube/glasskube#918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using OpenTelemetry for metrics instead of Prometheus #305

Consider using OpenTelemetry for metrics instead of Prometheus #305

grantr commented Jan 24, 2019

DirectXMan12 commented Jan 29, 2019

DirectXMan12 commented Jan 29, 2019

DirectXMan12 commented Mar 6, 2019

grantr commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

DirectXMan12 commented Jun 5, 2019

fejta-bot commented Sep 3, 2019

DirectXMan12 commented Sep 9, 2019

negz commented Oct 1, 2019

DirectXMan12 commented Oct 7, 2019

vincepri commented Feb 20, 2020

grantr commented Feb 20, 2020

hasheddan commented Apr 12, 2020

ncdc commented Sep 24, 2020

hasheddan commented Sep 24, 2020

ncdc commented Sep 24, 2020

vincepri commented Oct 8, 2020

bboreham commented Oct 9, 2020

evankanderson commented Oct 12, 2020

DirectXMan12 commented Oct 13, 2020

alvaroaleman commented Oct 8, 2021

Consider using OpenTelemetry for metrics instead of Prometheus #305

Consider using OpenTelemetry for metrics instead of Prometheus #305

Comments

grantr commented Jan 24, 2019

DirectXMan12 commented Jan 29, 2019

DirectXMan12 commented Jan 29, 2019

DirectXMan12 commented Mar 6, 2019

grantr commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

DirectXMan12 commented Jun 5, 2019

fejta-bot commented Sep 3, 2019

DirectXMan12 commented Sep 9, 2019

negz commented Oct 1, 2019

DirectXMan12 commented Oct 7, 2019

vincepri commented Feb 20, 2020

grantr commented Feb 20, 2020

hasheddan commented Apr 12, 2020

ncdc commented Sep 24, 2020

hasheddan commented Sep 24, 2020

ncdc commented Sep 24, 2020

vincepri commented Oct 8, 2020

bboreham commented Oct 9, 2020

evankanderson commented Oct 12, 2020

DirectXMan12 commented Oct 13, 2020

alvaroaleman commented Oct 8, 2021