Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelgrpc high metric cardinality from rpc.server.duration #3071

Closed
rcrowe opened this issue Dec 8, 2022 · 18 comments · Fixed by #4322
Closed

otelgrpc high metric cardinality from rpc.server.duration #3071

rcrowe opened this issue Dec 8, 2022 · 18 comments · Fixed by #4322
Milestone

Comments

@rcrowe
Copy link

rcrowe commented Dec 8, 2022

Since upgrading to v1.12.0/0.37.0/0.6.0 & specifically #2700 we've seen an explosion in the cardinality of our metrics.

  1. Metric recorded:
    cfg.rpcServerDuration.Record(ctx, int64(elapsedTime), attr...)
  2. Attributes are taken from spanInfo:
    name, attr := spanInfo(info.FullMethod, peerFromCtx(ctx))
  3. Peer attributes added:
    return []attribute.KeyValue{
    semconv.NetPeerIPKey.String(host),
    semconv.NetPeerPortKey.String(port),
    }

Is the intention that these types of high cardinality issues are handled in userland, by us using a view (or blocking at the ingestion layer) to strip labels from certain metrics after we find them problematic? On one hand I can understand that being acceptable, on the other, this feels like pushing a problem downstream to consumers that will suddenly get hit by a metric that is most likely going to grow very wide.

CC @fatsheep9146

@rcrowe
Copy link
Author

rcrowe commented Dec 8, 2022

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.

To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.

@MrAlias
Copy link
Contributor

MrAlias commented Dec 8, 2022

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.

To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.

Correct, this does look to be something required by the specification. It might be worth opening an issue in the specification repository if you think this shouldn't be a required attribute.

@lucasoares
Copy link

lucasoares commented Jan 31, 2023

How to remove this metric? It is exposing 7000 metrics in the exporter. It should be disabled as default.

@regeda
Copy link

regeda commented Feb 3, 2023

After v0.38.0 release, I see the same issue with otelhttp package. We avoid this upgrade bc it blows up the memory consumption drastically.

@MrAlias
Copy link
Contributor

MrAlias commented Feb 3, 2023

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

@regeda
Copy link

regeda commented Feb 3, 2023

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

@MrAlias
Copy link
Contributor

MrAlias commented Feb 3, 2023

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

The linked documentation has an example that should help as a basis:

20230203_085416

You'll need to update your match criteria to specifically match what you want though.

@regeda
Copy link

regeda commented Feb 6, 2023

I've done with the following example (thanks @MrAlias ):

func allowedAttr(v ...string) attribute.Filter {
	m := make(map[string]struct{}, len(v))
	for _, s := range v {
		m[s] = struct{}{}
	}
	return func(kv attribute.KeyValue) bool {
		_, ok := m[string(kv.Key)]
		return ok
	}
}

func setGlobalMeterProvider() error {
	exporter, err := prometheus.New(
		prometheus.WithRegisterer(client_prom.DefaultRegisterer),
	)
	if err != nil {
		return err
	}
	global.SetMeterProvider(metric.NewMeterProvider(
		metric.WithReader(exporter),
		metric.WithView(
			metric.NewView(
				metric.Instrument{Scope: instrumentation.Scope{Name: "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"}},
				metric.Stream{
					AttributeFilter: allowedAttr(
						"http.method",
						"http.status_code",
						"http.target",
					),
				},
			),
		),
	))
	return nil
}

@regeda
Copy link

regeda commented Feb 7, 2023

Another issue :(

A new version of otelhttp/v0.38.0 uses go.opentelemetry.io/otel/semconv/v1.17.0/httpconv.ServerRequest for attributes generating https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/net/http/otelhttp/handler.go#L218.

This function writes the whole request URI (including the query string) into http.target. It affects the cardinality of the metric and the server allocates a lot of memory if the query string is constantly random.

@regeda
Copy link

regeda commented Feb 7, 2023

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

  • peer port
  • user agent
  • request uri

@MrAlias
Copy link
Contributor

MrAlias commented Feb 7, 2023

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

  • peer port
  • user agent
  • request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

@regeda
Copy link

regeda commented Feb 8, 2023

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

  • peer port
  • user agent
  • request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

I think that the problem is here https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/metric/internal/filter.go#L67

The filter uses the original attributes set as a key in the seen map.

func (f *filter[N]) Aggregate(measurement N, attr attribute.Set) {
	// TODO (#3006): drop stale attributes from seen. // <-- not yet implemented
	f.Lock()
	defer f.Unlock()
	fAttr, ok := f.seen[attr]
        fmt.Println(len(f.seen)) // <-- seen is growing with a random attribute set
	if !ok {
		fAttr, _ = attr.Filter(f.filter)
		f.seen[attr] = fAttr // <-- memory leak
	}
	f.aggregator.Aggregate(measurement, fAttr)
}

See open-telemetry/opentelemetry-go#3006

this is my test case:

import (
    ...
    "math/rand"
    ...
)

func TestRandomFilterAggregate(t *testing.T) {
	f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
	for i := 0; i < 1000; i++ {
		set := attribute.NewSet(
			attribute.Int("foo", rand.Int()),
		)
		f.Aggregate(1, set)
	}
}

Another problem is in seen map[attribute.Set]attribute.Set bc attribute.Set is not comparable type then the following test will consume 2x memory even if the attribute set is the same:

func TestRandomFilterAggregate(t *testing.T) {
	f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
	for i := 0; i < 1000; i++ {
		attr := attribute.Int("foo", rand.Int()),
		set1 := attribute.NewSet(attr)
		f.Aggregate(1, set1)
		set2 := attribute.NewSet(attr)
		f.Aggregate(1, set2)
	}
}

That is my proposal open-telemetry/opentelemetry-go#3695

@lucasoares
Copy link

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

@MrAlias
Copy link
Contributor

MrAlias commented Feb 8, 2023

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

#3071 (comment)

@liufuyang
Copy link
Contributor

liufuyang commented Feb 20, 2023

@MrAlias I think I encounter the same memory issue. So have this open-telemetry/opentelemetry-go#3695 fixed it and does the current release includes it? Thank you :)

Ah, sorry I just saw this https://github.com/open-telemetry/opentelemetry-go/milestone/40 looks like we are very close to releasing a fix on this.

Thank you very much and I will give a test again when 0.37 is out 👍

@ericwenn
Copy link

Hey, I'm seeing this issue in services, both for otelhttp and otelgrpc.
Filtering attributes with views and using the unreleased fix for memory consumptions in this issue alleviates the immediate issues.

However, based on reading the specification I see two, possibly, incorrect things.

  1. For HTTP (otelhttp) some attributes are required only on the client while optional on server. Specifically net.peer.port and http.client_id are not required for servers, but is collected on servers here.

  2. For gRPC (otelgrpc) the specification makes no distinction on client vs server attributes, but using http metrics as ground truth the net.peer.port attribute should not be collected for servers in gRPC either.

@MrAlias I can submit issue in specification repo for 2, but I think 1 should be fixed by the go implementation.
Am I missing something?

@jaronoff97
Copy link
Contributor

Hello all, we're running a collector right now and we're seeing this issue cause a massive memory leak when converting the metric to prometheus. This stems from the new usage of the meter provider here in the collector.

IMO this is a massive issue because it means someone could cause OOMs/dropped data by flooding the collector with bad rpcs. Would it be possible to make the inclusion of the net.sock.peer.addr, net.sock.peer.port, etc. optional? The inclusion of these attributes are useful for traces, but cause massive cardinality explosion in metrics. Because the collector also is exporting Prometheus Histograms, the cardinality we observe is |net.sock.peer.addr| x |net.sock.peer.port| x |buckets| which results in millions of metrics created.

Users of the collector are unable to implement a custom view as has been recommended on this thread, because the code is already built with this. Given other users are experiencing this issue, it felt prudent to upstream this request to the source.

@mackjmr
Copy link
Member

mackjmr commented Sep 12, 2023

I experienced this when running opentelemetry-demo.

Here is a graph showing the memory usage of the collector pod before / after implementing the filter from #3071 (comment) in productcatalogservice (fix pushed at 17h25):

IMO, net.sock.peer.port should be disabled by default seeing the cardinality / significant memory usage in the collector it leads to when enabled.

cc @pellared as it seems like you are working on a fix to this issue in #3730.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants