Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Exemplars Support #448

Conversation

EvgeniaMartynova-thebeat
Copy link

@EvgeniaMartynova-thebeat EvgeniaMartynova-thebeat commented Dec 21, 2021

Which problem is this PR solving?

Enabling Prometheus Exemplars. This PR is a copy of the POC made by @ioannissavvaidis.

Short description of the changes

  • Added support for Prometheus Exemplars

  • Updated Observarbility.md documentation

  • Enabled Prometheus Exemplars for request latency and counter metric in

    • ../client/http/http.go
    • ../client/amqp/v2/amqp.go
    • ../client/es/elasticsearch.go
    • ../client/grpc/grpc.go
    • ../client/redis/redis.go
    • ../client/sns/v2/publisher.go
    • ../client/sql/publisher.go
    • ../client/sns/v2/publisher.go
    • ../component/amqp
    • ../component/grpc
    • ../component/kafka
    • ..component/http/middleware
    • ../component/sqs
    • ../client/kafka/v2
  • Fixed a way how we record "success" metric. Mark metric entry as "success" if err == nil before we marked it as success if err != nil

siavashs
siavashs previously approved these changes Dec 21, 2021
Copy link

@mantzas mantzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR handles only HTTP but we need to handle all possible cases here:

  • gRPC
  • Kafka
  • AMQP
  • etc
    So this PR should include all of them

client/http/http.go Outdated Show resolved Hide resolved

traceID, ok := wct.ExtractTraceID(req.Context())
if ok {
durationHistogram.(prometheus.ExemplarObserver).ObserveWithExemplar(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to type assert safely which is:
v, ok := x.(T) // ok will be either true or false

@tpaschalis
Copy link

Hello @EvgeniaMartynova-thebeat and thanks for opening your first PR for Patron! 🎉 We really appreciate you taking the time to improve our shared tooling!

I have no idea what Prometheus Exemplars are all about (admittedly, I've got some reading to do 😄) but I wanted to clarify a couple of points.

  • Is this change backwards-compatible, or do we have to add a feature flag, an option, or something so that it can be enabled optionally?
  • Is this something that is specific for our Grafana-based observability stack or something that works generally?

@siavashs
Copy link

Hi @tpaschalis Exemplars are part of OpenMetrics Spec. They are an optional feature and as specified in the spec, the ingestors may ignore them so they are backwards compatible.
The observability tools (Prometheus, Thanos, Grafana) optionally support Exemplars https://grafana.com/docs/grafana/latest/basics/exemplars/

…pc, sql, redis, etc; put success label as err == nil for all of the clients
@EvgeniaMartynova-thebeat
Copy link
Author

Hello @EvgeniaMartynova-thebeat and thanks for opening your first PR for Patron! 🎉 We really appreciate you taking the time to improve our shared tooling!

I have no idea what Prometheus Exemplars are all about (admittedly, I've got some reading to do 😄) but I wanted to clarify a couple of points.

  • Is this change backwards-compatible, or do we have to add a feature flag, an option, or something so that it can be enabled optionally?
  • Is this something that is specific for our Grafana-based observability stack or something that works generally?

Thank you for your quick review, @tpaschalis
Just an addition to what @siavashs said: the changes are backward-compatible when by enabling exemplars a record will be added after # for a metric entry like:

safety_perception_dynamodb_latency_seconds_bucket{method="GetPassengerProperties",le="0.005"} 2 # {traceID="4314cf66020d9262"} 2.4e-07 1.6367215682210166e+09

It is now part of Open Metrics format: https://grafana.com/blog/2021/03/31/intro-to-exemplars-which-enable-grafana-tempos-distributed-tracing-at-massive-scale/

@EvgeniaMartynova-thebeat
Copy link
Author

  • ../client/http/http.go
  • ../client/amqp/v2/amqp.go
  • ../client/es/elasticsearch.go
  • ../client/grpc/grpc.go
  • ../client/redis/redis.go
  • ../client/sns/v2/publisher.go
  • ../client/sql/publisher.go
  • ../client/sns/v2/publisher.go

Added exemplars for the following clients:
../client/http/http.go
../client/amqp/v2/amqp.go
../client/es/elasticsearch.go
../client/grpc/grpc.go
../client/redis/redis.go
../client/sns/v2/publisher.go
../client/sql/publisher.go
../client/sns/v2/publisher.go

Since we use a counter vector in Kafka client we cannot use exemplars there.

Copy link

@mantzas mantzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Components are also missing (Kafka, AWS SQS, etc.). We have only the clients covered.

@@ -4,6 +4,7 @@ package grpc

import (
"context"
"github.com/uber/jaeger-client-go"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the imports are not following the standard layout. check with go fmt or goimports which can be setup in any IDE.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, @mantzas, installed goimports and ran on the changed files

durationHistogram := rpcDurationMetrics.WithLabelValues(unary, target, method, rpcStatus.Code().String())
if sctx, ok := span.Context().(jaeger.SpanContext); ok {
durationHistogram.(prometheus.ExemplarObserver).ObserveWithExemplar(
invokeDuration.Seconds(), prometheus.Labels{"traceID": sctx.TraceID().String()},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"traceID" seems to be a good candidate as a constant in the trace package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a constant as suggested

run goimports
added support of Prometheus exemplars to components http middleware and gRPC observability
@EvgeniaMartynova-thebeat
Copy link
Author

EvgeniaMartynova-thebeat commented Dec 31, 2021

Components are also missing (Kafka, AWS SQS, etc.). We have only the clients covered.

introduced Prometheus Exemplars also in
component/http/middleware.go
component/grpc/observability.go

For the other components it seems a server side latency is not measured, do you think it make sense to introduce such metric in this PR for the other components?

@mantzas
Copy link

mantzas commented Dec 31, 2021

other components it

We might not measure latency but we measure message count over time which can be augmented with exemplars in order to give us more metadata. Is this not the case or a good use case? WDYT?

@EvgeniaMartynova-thebeat
Copy link
Author

other components it

We might not measure latency but we measure message count over time which can be augmented with exemplars in order to give us more metadata. Is this not the case or a good use case? WDYT?

Good point! Just found out that a counter is implementing ExemplarAdder interface, I will try to use it in order to have exemplars for counters.

@EvgeniaMartynova-thebeat
Copy link
Author

other components it

We might not measure latency but we measure message count over time which can be augmented with exemplars in order to give us more metadata. Is this not the case or a good use case? WDYT?

Good point! Just found out that a counter is implementing ExemplarAdder interface, I will try to use it in order to have exemplars for counters.

added support of Prometheus Exemplars for request/message CounterVec metric for
component/amqp
component/grpc
component/kafka
component/http/middleware
component/sqs
client/kafka/v2 - partially, because otherwise it would require a change in a signature of exported methods

Copy link

@mantzas mantzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only have an empty line in the imports that separate the system from the other packages.

Looking at this and all the other example I am wondering if we should introduce some abstraction to get rid of this repetitive code. Example on the counter: If we introduce a new counter in the trace package that abstracts the prometheus counter we could implemented methods to increase, decrease, add that also take care of the exemplar business.
We have the following option to move into this direction if it makes sense. Make it part of this PR or create a new ticket in Patron to cleanup all of this in a separate PR.

"strconv"
"time"

"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line should be removed

@@ -13,12 +12,14 @@ import (
"strings"
"time"

"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

"time"

"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"io"
"net/http"
"strconv"
"time"

"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"github.com/beatlabs/patron/trace"
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line

"strconv"
"time"

"github.com/uber/jaeger-client-go"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@EvgeniaMartynova-thebeat
Copy link
Author

We should only have an empty line in the imports that separate the system from the other packages.

Looking at this and all the other example I am wondering if we should introduce some abstraction to get rid of this repetitive code. Example on the counter: If we introduce a new counter in the trace package that abstracts the prometheus counter we could implemented methods to increase, decrease, add that also take care of the exemplar business. We have the following option to move into this direction if it makes sense. Make it part of this PR or create a new ticket in Patron to cleanup all of this in a separate PR.

I also don't like this code duplication, I will do what you are suggesting in this PR.

@EvgeniaMartynova-thebeat
Copy link
Author

We should only have an empty line in the imports that separate the system from the other packages.
Looking at this and all the other example I am wondering if we should introduce some abstraction to get rid of this repetitive code. Example on the counter: If we introduce a new counter in the trace package that abstracts the prometheus counter we could implemented methods to increase, decrease, add that also take care of the exemplar business. We have the following option to move into this direction if it makes sense. Make it part of this PR or create a new ticket in Patron to cleanup all of this in a separate PR.

I also don't like this code duplication, I will do what you are suggesting in this PR.

Used a decorator pattern to move all repetitive Exemplars related code to trace package. Calling the decorator methods to add/inc/observe metrics.

trace/metric.go Show resolved Hide resolved
trace/metric.go Outdated Show resolved Hide resolved
trace/metric_test.go Outdated Show resolved Hide resolved
mantzas
mantzas previously approved these changes Jan 10, 2022
Copy link

@siavashs siavashs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @EvgeniaMartynova-thebeat
I just added a minor comment regarding documentation.

Comment on lines 14 to 16
[Prometheus Exemplars](https://grafana.com/docs/grafana/latest/basics/exemplars/)
it became possible to move seamlessly between metrics, logs, and traces.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be rephrased as any OpenTracing compatible tracing system (Jaeger and Tempo so far only AFAIK) can work with Prometheus Exemplars, not just Tempo.
The second part regarding seamless integration between metrics, logs and traces is a Grafana specific feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased according to the comment, removed the second part.

mantzas
mantzas previously approved these changes Jan 11, 2022
@mantzas mantzas requested a review from siavashs January 11, 2022 08:00
siavashs
siavashs previously approved these changes Jan 11, 2022
Copy link

@siavashs siavashs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

trace/metric.go Outdated
spanFromCtx := opentracing.SpanFromContext(ctx)
if spanFromCtx != nil {
if sctx, ok := spanFromCtx.Context().(jaeger.SpanContext); ok {
c.Counter.(prometheus.ExemplarAdder).AddWithExemplar(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm missing the point about why do we need to make type assertions. If Counter is a prometheus.ExemplarAdder, then why can't we make this field a prometheus.ExemplarAdder,?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Counter is a prometheus.Counter interface.
Theoretically there could be a bespoke counter implementing prometheus.Counter interface but not prometheus.ExemplarAdder interface, that is why we need a type assertion here.
Or am I missing your point here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we talk about evolutions and that the Counter may be of another type later on, then this code will panic (example). How are we going to handle such a case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot, @teivah, indeed you caught a potential panic, fixed, I applied the same change for Histogram struct in the same file.

if tt.expectedPanic {
defer func() {
if r := recover(); r == nil {
t.Errorf("Add method did not panic.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Errorf("Add method did not panic.")
t.Error("Add method did not panic.")

if tt.expectedPanic {
defer func() {
if r := recover(); r == nil {
t.Errorf("Add method did not panic.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Errorf("Add method did not panic.")
t.Error("Add method did not panic.")

}

pb := &dto.Metric{}
m.Write(pb)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error isn't handled. I suggest that we ignore it explicitly at least:

Suggested change
m.Write(pb)
_ = m.Write(pb)

@teivah
Copy link

teivah commented Jan 13, 2022

And also, CI is failing.

@teivah teivah requested a review from mantzas January 20, 2022 08:47
@mantzas mantzas requested a review from siavashs January 20, 2022 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants