Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for outbound API calls #1360

Merged
merged 5 commits into from
Nov 2, 2023

Conversation

jhrozek
Copy link
Contributor

@jhrozek jhrozek commented Oct 31, 2023

Initially I tried to use
https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#NewTransport
but that seems to just create a new span. I was never able to get any
useful metrics out of the endpoint even though I verified I was actually
using the instrumented transport. We're using otelhttp already for
instrumenting handlers where it works fine, but I wasn't able to get the
client side working.

So I added a custom transport that records the duration of the request
along with the method and the status code.

The resulting historam looks like this:

mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="0"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="25"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="50"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="75"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="100"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="250"}
4
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="750"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="1000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="2500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="7500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="+Inf"}
7
mediator_github_http_roundtrip_duration_milliseconds_sum{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""}
1709
mediator_github_http_roundtrip_duration_milliseconds_count{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""}
7

@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error {
},
db.ProviderAccessToken{},
token,
provtelemetry.NewNoopMetrics(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the fact that I added a new parameter to a call that's used so widely. Maybe the provider builders could default to noop metrics and we could have an WithMetrics option that could be added where needed? The "typed" provider builders like NewRest or NewGithub would then require an explicit metric to be passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually really like this idea. Having a default NoopMetrics implementation and adding the implementation with a With function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a functional options pattern.

Does this also mean that the rule type commands actually make calls to providers, or is that just an artifact of how we construct them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rule type command is a developer-facing command. It's instantiating the rule types in the same manner as the engine normally does and then the rule types are making the provider calls.

@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error {
},
db.ProviderAccessToken{},
token,
provtelemetry.NewNoopMetrics(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually really like this idea. Having a default NoopMetrics implementation and adding the implementation with a With function.


s, err := controlplane.NewServer(store, evt, serverMetrics, cfg, vldtr)
s, err := controlplane.NewServer(store, evt, serverMetrics, providerMetrics, cfg, vldtr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use the options pattern (WithMetrics or something of the sort) in the server, engine and reconcilers.

Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the actual instrumentation happen? Is there some magic function that does that? cause I don't see that happening 😕

}

func (irt *instrumentedRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {
startTime := time.Now()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every time a provider whose http client transport is wrapped using NewDurationRoundTripper makes an HTTP request, this method is called. All it does is to call the wrapped round trip and mark the time before and after and shove it into the histogram.

@jhrozek
Copy link
Contributor Author

jhrozek commented Nov 1, 2023

Where does the actual instrumentation happen? Is there some magic function that does that? cause I don't see that happening 😕

I added two inline comments, do they help? Happy to walk you through the code, too.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be aware of:

open-telemetry/opentelemetry-go-contrib#3769

Which is working on generic upstream metrics for http clients. OTOH, it's not clear exactly when that will get merged, so I think this is a reasonable approach in the meantime.

Comment on lines 43 to 48
labels := []attribute.KeyValue{
attribute.String("http_method", r.Method),
attribute.Int("http_status_code", resp.StatusCode),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include r.URL.Host as well? I think we'll probably see different latencies from different hosts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're putting it in the metric name instead. That works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the metric name contains the provider type, not the host. I think your suggestion makes sense.

internal/providers/telemetry/telemetry.go Show resolved Hide resolved
Comment on lines 90 to 99
histogram, ok := m.httpProviderHistograms[providerType]
if !ok {
var err error
histogram, err = m.createProviderHistogram(providerType)
if err != nil {
return nil, err
}
m.httpProviderHistograms[providerType] = histogram
}
return histogram, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to lock this map access? I'm guessing we can get away without it because we're only calling NewDurationRoundTripper at startup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, not yet, but soon we'll need to (PR #1356)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fixed)


func (m *httpClientMetrics) getHistogramForProvider(providerType db.ProviderType) (metric.Int64Histogram, error) {
histogram, ok := m.httpProviderHistograms[providerType]
if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be

Suggested change
if !ok {
if histogram == nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted while making use of xsync.Map

@@ -130,6 +131,7 @@ func testCmdRun(cmd *cobra.Command, _ []string) error {
},
db.ProviderAccessToken{},
token,
provtelemetry.NewNoopMetrics(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a functional options pattern.

Does this also mean that the rule type commands actually make calls to providers, or is that just an artifact of how we construct them?

internal/providers/github/github.go Show resolved Hide resolved
Initially I tried to use
https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#NewTransport
but that seems to just create a new span. I was never able to get any
useful metrics out of the endpoint even though I verified I was actually
using the instrumented transport. We're using `otelhttp` already for
instrumenting handlers where it works fine, but I wasn't able to get the
client side working.

So I added a custom transport that records the duration of the request
along with the method and the status code.

The resulting historam looks like this:
```
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="0"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="25"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="50"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="75"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="100"}
0
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="250"}
4
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="750"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="1000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="2500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="5000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="7500"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="10000"}
7
mediator_github_http_roundtrip_duration_milliseconds_bucket{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version="",le="+Inf"}
7
mediator_github_http_roundtrip_duration_milliseconds_sum{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""}
1709
mediator_github_http_roundtrip_duration_milliseconds_count{http_method="GET",http_status_code="200",otel_scope_name="providers",otel_scope_version=""}
7
```
@jhrozek jhrozek force-pushed the outbound_api_metrics branch from dd98ee4 to 4748092 Compare November 1, 2023 20:46
@jhrozek
Copy link
Contributor Author

jhrozek commented Nov 1, 2023

actually, wait up with re-review, I will also use the functional pattern in the server, executor and reconciler constructors.

@jhrozek
Copy link
Contributor Author

jhrozek commented Nov 1, 2023

actually, wait up with re-review, I will also use the functional pattern in the server, executor and reconciler constructors.

done

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine for now, I don't want to spend too much time noodling how to make this super-clean given our constraints.

Comment on lines +401 to +404
pbOpts := []providers.ProviderBuilderOption{
providers.WithProviderMetrics(s.provMt),
}
provBuilder, err := providers.GetProviderBuilder(ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, pbOpts...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to serialize into a slice here (that's the whole point of varargs):

Suggested change
pbOpts := []providers.ProviderBuilderOption{
providers.WithProviderMetrics(s.provMt),
}
provBuilder, err := providers.GetProviderBuilder(ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, pbOpts...)
provBuilder, err := providers.GetProviderBuilder(
ctx, prov, dbRepo.ProjectID, s.store, s.cryptoEngine, providers.WithProviderMetrics(s.provMt))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was just a way to not make the line super long (and breaking the line looked visually weird, too). But you're right, I'll reformat the line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, I'll send a follow-up to both suggestions, I want to see the metrics in staging ASAP. Hope it's fine.

Comment on lines +110 to +114
func (m *httpClientMetrics) NewDurationRoundTripper(
wrapped http.RoundTripper,
providerType db.ProviderType,
) (http.RoundTripper, error) {
histogram := m.getHistogramForProvider(providerType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be really tricky, you can leverage the possibly of a nil client as:

Suggested change
func (m *httpClientMetrics) NewDurationRoundTripper(
wrapped http.RoundTripper,
providerType db.ProviderType,
) (http.RoundTripper, error) {
histogram := m.getHistogramForProvider(providerType)
func (m *httpClientMetrics) NewDurationRoundTripper(
wrapped http.RoundTripper,
providerType db.ProviderType,
) (http.RoundTripper, error) {
if m == nil {
return NewNoopMetrics().NewDurationRoundTripper(wrapped, providerType)
}
histogram := m.getHistogramForProvider(providerType)

@jhrozek jhrozek merged commit 81ba458 into mindersec:main Nov 2, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants