Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using OpenTelemetry for metrics instead of Prometheus #305

Open
grantr opened this issue Jan 24, 2019 · 21 comments
Open

Consider using OpenTelemetry for metrics instead of Prometheus #305

grantr opened this issue Jan 24, 2019 · 21 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@grantr
Copy link
Contributor

grantr commented Jan 24, 2019

Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.

Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.

The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).

Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.

And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.

It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.

@DirectXMan12
Copy link
Contributor

/kind feature

I do kind-of like this idea. Want to put together a PoC?

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 29, 2019
@DirectXMan12
Copy link
Contributor

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 29, 2019
@DirectXMan12
Copy link
Contributor

/good-first-issue

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 6, 2019
@grantr
Copy link
Contributor Author

grantr commented Mar 6, 2019

Working on this now!
/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019
@DirectXMan12
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019
@DirectXMan12
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2019
@negz
Copy link
Contributor

negz commented Oct 1, 2019

I was just exploring controller-runtime's observability stance, noticed the kubebuilder book mentioned Prometheus metrics, thought "oh man I wish they used Opencensus instead", came here to thumbs-up this issue, then found past me already did that. ;)

Throwing my two cents as a controller-tools user who has instrumented systems in the past with both Prometheus's SDK and Opencensus's I vastly prefer the latter for basically all the reasons @grantr outlines in this issue.

@DirectXMan12
Copy link
Contributor

I'm going to add

/help

here. I'm thinking we might just want to push ahead with this. At this point, it's probably worth trying to make use of OpenTelemetry (the merged version of OpenCensus & OpenTracing), since that's the future of these efforts (https://github.com/open-telemetry/opentelemetry-go).

If anybody's willing to draw up a rough design of how things'd look and put forward a prototype, I'd be happy to review.

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020
@vincepri vincepri changed the title Consider using OpenCensus for metrics instead of Prometheus Consider using OpenCensus/OpenTelemetry for metrics instead of Prometheus Feb 20, 2020
@vincepri vincepri added this to the v0.6.0 milestone Feb 20, 2020
@vincepri vincepri removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020
@vincepri
Copy link
Member

@grantr are you still interested in working on this?

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 20, 2020
@grantr
Copy link
Contributor Author

grantr commented Feb 20, 2020

I think it's still valuable, but I don't have bandwidth to work on it. #368 is the final state of my attempt.

@hasheddan
Copy link
Contributor

I would love to help out here and will get started on implementation :)

/assign

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

@hasheddan is this something you started on?

@vincepri vincepri changed the title Consider using OpenCensus/OpenTelemetry for metrics instead of Prometheus Consider using OpenTelemetry for metrics instead of Prometheus Sep 24, 2020
@hasheddan
Copy link
Contributor

@ncdc I unfortunately have not gotten around to it yet, but would still love to help out. Don't want to block anyone else who is already getting started though 👍

@ncdc
Copy link
Contributor

ncdc commented Sep 24, 2020

No worries, just checking so folks don't step on your toes! Thanks for the update.

@vincepri
Copy link
Member

vincepri commented Oct 8, 2020

@bboreham This group is interested in getting OpenTracing in Controller Runtime

cc @DirectXMan12

@bboreham
Copy link
Contributor

bboreham commented Oct 9, 2020

@vincepri presume you meant OpenTelemetry.
I opened a PR to make it easier to see what I've done: #1211

@evankanderson
Copy link
Contributor

A few comments from experience in Knative:

If you're just starting to add instrumentation, try to use the same data in the context for both (structured) logging, tracing, and metrics. This has three benefits:

  • It reduces the amount of boilerplate you need to record the information, and increases the chance that you'll get your labels right.
  • It reduces the amount of "stuff" you have to carry around in the Context, which is a linked list.
  • It provides a consistent interface for observability (e.g. Record(ctx, measurement), Info(ctx, "message"), Start(ctx, "spanname")),

Design your Resource schema ahead of time. For many applications, the application acts a single tenant and only a single Resource is needed. Kubernetes controllers are not the typical application; I'd suggest defaulting to a separate Resource for each Kubernetes Resource, on the theory that k8s Resources probably represent an "entity which produces telemetry". In some cases for short-lived resources (for example, a TaskRun in Tekton), it may make sense to associate the resource with the parent (i.e. Task in Tekton).

Note that using Metrics and Traces in a multi-resource way requires passing provider options to NewAccumulator/NewTracerProvider... you probably want to cache these (on the context?), rather than creating a new one every time.

@DirectXMan12
Copy link
Contributor

We've got some work done for us in that regard from @dashpole's work on the apiserver tracing KEP, so we should consider that as well

@alvaroaleman
Copy link
Member

Removing good first issue label, as it is not quite clear if we want this and if so, how it would look.
/remove-good-first-issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.