-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for outbound API calls #1360
Conversation
cmd/dev/app/rule_type/rule_type.go
Outdated
@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error { | |||
}, | |||
db.ProviderAccessToken{}, | |||
token, | |||
provtelemetry.NewNoopMetrics(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the fact that I added a new parameter to a call that's used so widely. Maybe the provider builders could default to noop metrics and we could have an WithMetrics
option that could be added where needed? The "typed" provider builders like NewRest or NewGithub would then require an explicit metric to be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually really like this idea. Having a default NoopMetrics
implementation and adding the implementation with a With
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a functional options pattern.
Does this also mean that the rule type commands actually make calls to providers, or is that just an artifact of how we construct them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rule type command is a developer-facing command. It's instantiating the rule types in the same manner as the engine normally does and then the rule types are making the provider calls.
cmd/dev/app/rule_type/rule_type.go
Outdated
@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error { | |||
}, | |||
db.ProviderAccessToken{}, | |||
token, | |||
provtelemetry.NewNoopMetrics(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually really like this idea. Having a default NoopMetrics
implementation and adding the implementation with a With
function.
cmd/server/app/serve.go
Outdated
|
||
s, err := controlplane.NewServer(store, evt, serverMetrics, cfg, vldtr) | ||
s, err := controlplane.NewServer(store, evt, serverMetrics, providerMetrics, cfg, vldtr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use the options pattern (WithMetrics
or something of the sort) in the server, engine and reconcilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the actual instrumentation happen? Is there some magic function that does that? cause I don't see that happening 😕
} | ||
|
||
func (irt *instrumentedRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) { | ||
startTime := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every time a provider whose http client transport is wrapped using NewDurationRoundTripper
makes an HTTP request, this method is called. All it does is to call the wrapped round trip and mark the time before and after and shove it into the histogram.
I added two inline comments, do they help? Happy to walk you through the code, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be aware of:
open-telemetry/opentelemetry-go-contrib#3769
Which is working on generic upstream metrics for http clients. OTOH, it's not clear exactly when that will get merged, so I think this is a reasonable approach in the meantime.
labels := []attribute.KeyValue{ | ||
attribute.String("http_method", r.Method), | ||
attribute.Int("http_status_code", resp.StatusCode), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to include r.URL.Host
as well? I think we'll probably see different latencies from different hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're putting it in the metric name instead. That works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the metric name contains the provider type, not the host. I think your suggestion makes sense.
histogram, ok := m.httpProviderHistograms[providerType] | ||
if !ok { | ||
var err error | ||
histogram, err = m.createProviderHistogram(providerType) | ||
if err != nil { | ||
return nil, err | ||
} | ||
m.httpProviderHistograms[providerType] = histogram | ||
} | ||
return histogram, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to lock this map access? I'm guessing we can get away without it because we're only calling NewDurationRoundTripper
at startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, not yet, but soon we'll need to (PR #1356)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fixed)
|
||
func (m *httpClientMetrics) getHistogramForProvider(providerType db.ProviderType) (metric.Int64Histogram, error) { | ||
histogram, ok := m.httpProviderHistograms[providerType] | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be
if !ok { | |
if histogram == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted while making use of xsync.Map
cmd/dev/app/rule_type/rule_type.go
Outdated
@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error { | |||
}, | |||
db.ProviderAccessToken{}, | |||
token, | |||
provtelemetry.NewNoopMetrics(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a functional options pattern.
Does this also mean that the rule type commands actually make calls to providers, or is that just an artifact of how we construct them?
Initially I tried to use https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#NewTransport but that seems to just create a new span. I was never able to get any useful metrics out of the endpoint even though I verified I was actually using the instrumented transport. We're using `otelhttp` already for instrumenting handlers where it works fine, but I wasn't able to get the client side working. So I added a custom transport that records the duration of the request along with the method and the status code. The resulting historam looks like this: ``` mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="0"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="25"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="50"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="75"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="100"} 0 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="250"} 4 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="500"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="750"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="1000"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="2500"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5000"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="7500"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10000"} 7 mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="+Inf"} 7 mediator_github_http_roundtrip_duration_milliseconds_sum{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""} 1709 mediator_github_http_roundtrip_duration_milliseconds_count{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""} 7 ```
dd98ee4
to
4748092
Compare
actually, wait up with re-review, I will also use the functional pattern in the server, executor and reconciler constructors. |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine for now, I don't want to spend too much time noodling how to make this super-clean given our constraints.
pbOpts := []providers.ProviderBuilderOption{ | ||
providers.WithProviderMetrics(s.provMt), | ||
} | ||
provBuilder, err := providers.GetProviderBuilder(ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, pbOpts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to serialize into a slice here (that's the whole point of varargs):
pbOpts := []providers.ProviderBuilderOption{ | |
providers.WithProviderMetrics(s.provMt), | |
} | |
provBuilder, err := providers.GetProviderBuilder(ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, pbOpts...) | |
provBuilder, err := providers.GetProviderBuilder( | |
ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, providers.WithProviderMetrics(s.provMt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was just a way to not make the line super long (and breaking the line looked visually weird, too). But you're right, I'll reformat the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, I'll send a follow-up to both suggestions, I want to see the metrics in staging ASAP. Hope it's fine.
func (m *httpClientMetrics) NewDurationRoundTripper( | ||
wrapped http.RoundTripper, | ||
providerType db.ProviderType, | ||
) (http.RoundTripper, error) { | ||
histogram := m.getHistogramForProvider(providerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be really tricky, you can leverage the possibly of a nil
client as:
func (m *httpClientMetrics) NewDurationRoundTripper( | |
wrapped http.RoundTripper, | |
providerType db.ProviderType, | |
) (http.RoundTripper, error) { | |
histogram := m.getHistogramForProvider(providerType) | |
func (m *httpClientMetrics) NewDurationRoundTripper( | |
wrapped http.RoundTripper, | |
providerType db.ProviderType, | |
) (http.RoundTripper, error) { | |
if m == nil { | |
return NewNoopMetrics().NewDurationRoundTripper(wrapped, providerType) | |
} | |
histogram := m.getHistogramForProvider(providerType) |
Initially I tried to use
https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#NewTransport
but that seems to just create a new span. I was never able to get any
useful metrics out of the endpoint even though I verified I was actually
using the instrumented transport. We're using
otelhttp
already forinstrumenting handlers where it works fine, but I wasn't able to get the
client side working.
So I added a custom transport that records the duration of the request
along with the method and the status code.
The resulting historam looks like this: