-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
otelgrpc high metric cardinality from rpc.server.duration #3071
Comments
Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.
|
Correct, this does look to be something required by the specification. It might be worth opening an issue in the specification repository if you think this shouldn't be a required attribute. |
How to remove this metric? It is exposing 7000 metrics in the exporter. It should be disabled as default. |
After |
You can use a |
Could you share an example how to drop certain attributes from the aggregation? |
The linked documentation has an example that should help as a basis: You'll need to update your match criteria to specifically match what you want though. |
I've done with the following example (thanks @MrAlias ): func allowedAttr(v ...string) attribute.Filter {
m := make(map[string]struct{}, len(v))
for _, s := range v {
m[s] = struct{}{}
}
return func(kv attribute.KeyValue) bool {
_, ok := m[string(kv.Key)]
return ok
}
}
func setGlobalMeterProvider() error {
exporter, err := prometheus.New(
prometheus.WithRegisterer(client_prom.DefaultRegisterer),
)
if err != nil {
return err
}
global.SetMeterProvider(metric.NewMeterProvider(
metric.WithReader(exporter),
metric.WithView(
metric.NewView(
metric.Instrument{Scope: instrumentation.Scope{Name: "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"}},
metric.Stream{
AttributeFilter: allowedAttr(
"http.method",
"http.status_code",
"http.target",
),
},
),
),
))
return nil
} |
Another issue :( A new version of This function writes the whole request URI (including the query string) into |
Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:
|
Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage. |
I think that the problem is here https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/metric/internal/filter.go#L67 The filter uses the original attributes set as a key in the func (f *filter[N]) Aggregate(measurement N, attr attribute.Set) {
// TODO (#3006): drop stale attributes from seen. // <-- not yet implemented
f.Lock()
defer f.Unlock()
fAttr, ok := f.seen[attr]
fmt.Println(len(f.seen)) // <-- seen is growing with a random attribute set
if !ok {
fAttr, _ = attr.Filter(f.filter)
f.seen[attr] = fAttr // <-- memory leak
}
f.aggregator.Aggregate(measurement, fAttr)
} See open-telemetry/opentelemetry-go#3006 this is my test case: import (
...
"math/rand"
...
)
func TestRandomFilterAggregate(t *testing.T) {
f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
for i := 0; i < 1000; i++ {
set := attribute.NewSet(
attribute.Int("foo", rand.Int()),
)
f.Aggregate(1, set)
}
} Another problem is in func TestRandomFilterAggregate(t *testing.T) {
f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
for i := 0; i < 1000; i++ {
attr := attribute.Int("foo", rand.Int()),
set1 := attribute.NewSet(attr)
f.Aggregate(1, set1)
set2 := attribute.NewSet(attr)
f.Aggregate(1, set2)
}
} That is my proposal open-telemetry/opentelemetry-go#3695 |
I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application. Any way to do it? I don't want to have to do bench tests because of these third-party metrics... |
|
@MrAlias I think I encounter the same memory issue. So have this open-telemetry/opentelemetry-go#3695 fixed it and does the current release includes it? Thank you :) Ah, sorry I just saw this https://github.com/open-telemetry/opentelemetry-go/milestone/40 looks like we are very close to releasing a fix on this. Thank you very much and I will give a test again when 0.37 is out 👍 |
Hey, I'm seeing this issue in services, both for However, based on reading the specification I see two, possibly, incorrect things.
@MrAlias I can submit issue in specification repo for 2, but I think 1 should be fixed by the go implementation. |
Hello all, we're running a collector right now and we're seeing this issue cause a massive memory leak when converting the metric to prometheus. This stems from the new usage of the meter provider here in the collector. IMO this is a massive issue because it means someone could cause OOMs/dropped data by flooding the collector with bad rpcs. Would it be possible to make the inclusion of the Users of the collector are unable to implement a custom view as has been recommended on this thread, because the code is already built with this. Given other users are experiencing this issue, it felt prudent to upstream this request to the source. |
I experienced this when running Here is a graph showing the memory usage of the collector pod before / after implementing the filter from #3071 (comment) in productcatalogservice (fix pushed at 17h25): IMO, cc @pellared as it seems like you are working on a fix to this issue in #3730. |
Since upgrading to v1.12.0/0.37.0/0.6.0 & specifically #2700 we've seen an explosion in the cardinality of our metrics.
opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go
Line 346 in ad04d9c
spanInfo
:opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go
Line 331 in ad04d9c
opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go
Lines 476 to 479 in ad04d9c
Is the intention that these types of high cardinality issues are handled in userland, by us using a view (or blocking at the ingestion layer) to strip labels from certain metrics after we find them problematic? On one hand I can understand that being acceptable, on the other, this feels like pushing a problem downstream to consumers that will suddenly get hit by a metric that is most likely going to grow very wide.
CC @fatsheep9146
The text was updated successfully, but these errors were encountered: