Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify cardinality limiting and attribute filtering interaction #3803

Closed
wants to merge 3 commits into from

Conversation

MrAlias
Copy link
Contributor

@MrAlias MrAlias commented Jan 4, 2024

Fixes #3798

Clarify that the specification allows implementations to choose the best approach for them in implementing cardinality limiting when attributes are being filtered.

cc @jsuereth @jack-berg

@MrAlias MrAlias added the spec:metrics Related to the specification/metrics directory label Jan 4, 2024
@MrAlias MrAlias requested review from a team January 4, 2024 17:14
Comment on lines +780 to +785
#### Interaction with filtering

It is left unspecified if attribute filtering from a user provided view needs
to be applied before or after applying the cardinality limit. While this can
lead to inconsistent telemetry across implementations, this will only happen in
an error scenario where a cardinality limit is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought through this --
I couldn't find which aspect of "attribute filtering" and views you are concerned about, and I'd like to understand it.

The key phrase in #2960, to me, is:

Regardless of aggregation temporality, the SDK MUST ensure that every metric event is reflected in exactly one Aggregator, which is either an Aggregator associated with the correct attribute set or an aggregator associated with the overflow attribute set.

The line you're referring to, I think, is:

Implementations MAY accept additional attribute filtering functionality for this parameter.

In my mental model for this, the view is responsible for choosing which aggregators accumulate which metric event. Custom filtering might change the logic used by a View to choose aggregators, but doesn't change the View's role.

The relationship between views and cardinality limits as defined (i.e., either the correct aggregator or the overflow aggregator selected), to me is decoupled from attribute filtering as performed by views.

There is, on purpose (IMO) a lot unspecified about how cardinality limits are to be applied because it is very implementation dependent.

Would you be equally happy with a more general declaration, such as:

Metric cardinality limits are meant as an self protection feature. Implementations are free to use any definition 
that works for its design, subject to the the two constraints (1) each metric event is aggregated by exactly one aggregator, and (2) when overflow is happening the `otel.metric.overflow` attribute appears on at least one timeseries.

What inconsistency are you considering? Does the inconsistency somehow not happen without attribute filtering?

Does this happen because attribute filters, as described here, are really being used as measurement processors? I think it's time for OTel to add a measurement processor specification, having a clear relationship with cardinality limits. Measurement processors MUST be applied before aggregation so that they are effective/correct control against cardinality limits.

Copy link
Contributor Author

@MrAlias MrAlias Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What inconsistency are you considering? Does the inconsistency somehow not happen without attribute filtering?

I have laid out an example of the inconsistency in #3798:

For example, if measurements for the following attributes are made:

  1. {path: "/", code: 200, query: ""}
  2. {path: "/", code: 400, query: "user=bob"}
  3. {path: "/admin", code: 200, query: ""}

If an attribute filter is applied so that only the path attribute is retained and cardinality limit of 3 is set, if the filtering is applied prior to checking the cardinality limit the following attributes will be kept on the output metric streams:

  • {path: "/"}
  • {path: "admin"}

However, if the cardinality limit is applied prior to filtering the following attributes will be kept on the output metric streams:

  • {path: "/"}
  • {otel.metric.overflow: true}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be equally happy with a more general declaration, such as:

Metric cardinality limits are meant as an self protection feature. Implementations are free to use any definition 
that works for its design, subject to the the two constraints (1) each metric event is aggregated by exactly one aggregator, and (2) when overflow is happening the `otel.metric.overflow` attribute appears on at least one timeseries.

I think this could work. I am worried that it would be to vague for readers though, they may not understand it to imply it relates to attribute filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I'm sure I agree that it should be left unspecified. The view attribute filter mechanism is meant to be a user control for cardinality, and so I think it would be surprising to users if its interaction with this other cardinality limiting control was ambiguous.

I imagine the workflow being something like:

  • Run the app with default cardinality limits and views.
  • Exceed the cardinality limit and notice with otel.metric_overflow and application logs.
  • Add an attribute filter to remove problematic attributes
  • Expect to see that the cardinality otel.metric_overflow is gone.

If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow. With a cardinality limit of 2000, I could end up seeing only 1000 series exported and still see the otel.metric_overflow series. This would be quite surprising.

Note that @dashpole brought this up in the #2960 where cardinality limits were originally added:

Being able to correct cardinality problems via Views seems essential, and while this is a step forward, this feels like a big omission. Are views necessarily "late in the pipeline"?

I believe that that the conclusion (based on comment, comment) was to imply that view attribute filtering should happen before cardinality limits are applied.

Copy link
Contributor Author

@MrAlias MrAlias Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow. With a cardinality limit of 2000, I could end up seeing only 1000 series exported and still see the otel.metric_overflow series. This would be quite surprising.

As by @jsuereth stated in the issue this was discussed, and is resolving (#3798):

[...] regarding cardinality limits we're talking about error scenarios and worst-case behavior. We already have a lot of inconssitencies in how failures are handled due to runtime limitations. We try to be consistent, but when it comes to extraordinary/error scenarios, I think some inconsistencies between SDKs is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow.

Agreed, but I don't think it was an effective tool to begin with. Many implementations are not filtering on measurement and it is therefore not preventing memory overflows.

I also don't think we can say those implementations are wrong to not do this. Filtering on measurement is a poorly scaling computation pattern.

I think the only thing that can be said about attribute filtering is that it is an effective tool to prevent overflow of data being sent to a backend. Meaning it is an effective cost reduction tool, not a memory limiting tool in many implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most compelling argument I heard in our discussion today was around user expectations. If users think that attribute filtering would lead to memory limiting, then we have an issue and an argument to specify semantics here to match user expectations.

However, if possible, I'd prefer to allow room for experimentation in implementation. I can see three implementations today:

  1. A PubSub / RingBuffer style hot-path where measurements are pushed into a queue and some aggregation is responsible for pulling them off (OpenCensus did this, I believe C# works like this in some fashion).
  2. "Fast" access to local memory storage from the hot path, where data is directly written to memory (possibly via "fast" atomics, etc.). Java has this design and it tries to quickly reduce contention between threads recording values via two-stage locking (one to reach a particular timeseries, another to write data into that timeseries). We migrated the Java SDK to this style after benchmarking vs. OpenCensus.
  3. A two-stage aggregation process where known storage is allocated ahead of time and metric writes are flushed quickly, and then a follow up aggregation is done collecting the first set of aggregations/timeseries and performing filtering / further aggregation as needed. (@jmacd's design).

In all of these designs we're trying to trade off "slowdown in the hotpath" for "total memory consumption". I think there's still room for us to explore here, so I don't want too rigid of a specification.

Can we guarantee that:

  • Cardinality Limits => Memory Out of bounds protectoins?

Are we sure that:

  • Users assume Attribute filters will also limit memory consumption for timeseries?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users assume Attribute filters will also limit memory consumption for timeseries?

I think so, see open-telemetry/opentelemetry-java-instrumentation#10119 for an example

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 17, 2024
@MrAlias MrAlias removed the Stale label Jan 17, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@MrAlias
Copy link
Contributor Author

MrAlias commented Jan 30, 2024

SIG meeting follow up on this:

  • Myself, @jsuereth, @jmacd, and @jack-berg: we would like to find a time for us to discuss this synchronously, develop a plan of action, and report back to the spec SIG (via meeting or PR).

I plan to touch base with the people listed via slack in a few hours to find a time that works.

Copy link

github-actions bot commented Feb 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 14, 2024
@trask
Copy link
Member

trask commented Feb 14, 2024

/unstale /reopen 🤞

I'm interested in this from the perspective of #3785, where I think it's important to apply the attributes "filter" advisory before applying the cardinality limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How is the cardinality limit applied when attributes are being filtered?
8 participants