-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify cardinality limiting and attribute filtering interaction #3803
Conversation
#### Interaction with filtering | ||
|
||
It is left unspecified if attribute filtering from a user provided view needs | ||
to be applied before or after applying the cardinality limit. While this can | ||
lead to inconsistent telemetry across implementations, this will only happen in | ||
an error scenario where a cardinality limit is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought through this --
I couldn't find which aspect of "attribute filtering" and views you are concerned about, and I'd like to understand it.
The key phrase in #2960, to me, is:
Regardless of aggregation temporality, the SDK MUST ensure that every metric event is reflected in exactly one Aggregator, which is either an Aggregator associated with the correct attribute set or an aggregator associated with the overflow attribute set.
The line you're referring to, I think, is:
Implementations MAY accept additional attribute filtering functionality for this parameter.
In my mental model for this, the view is responsible for choosing which aggregators accumulate which metric event. Custom filtering might change the logic used by a View to choose aggregators, but doesn't change the View's role.
The relationship between views and cardinality limits as defined (i.e., either the correct aggregator or the overflow aggregator selected), to me is decoupled from attribute filtering as performed by views.
There is, on purpose (IMO) a lot unspecified about how cardinality limits are to be applied because it is very implementation dependent.
Would you be equally happy with a more general declaration, such as:
Metric cardinality limits are meant as an self protection feature. Implementations are free to use any definition
that works for its design, subject to the the two constraints (1) each metric event is aggregated by exactly one aggregator, and (2) when overflow is happening the `otel.metric.overflow` attribute appears on at least one timeseries.
What inconsistency are you considering? Does the inconsistency somehow not happen without attribute filtering?
Does this happen because attribute filters, as described here, are really being used as measurement processors? I think it's time for OTel to add a measurement processor specification, having a clear relationship with cardinality limits. Measurement processors MUST be applied before aggregation so that they are effective/correct control against cardinality limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What inconsistency are you considering? Does the inconsistency somehow not happen without attribute filtering?
I have laid out an example of the inconsistency in #3798:
For example, if measurements for the following attributes are made:
{path: "/", code: 200, query: ""}
{path: "/", code: 400, query: "user=bob"}
{path: "/admin", code: 200, query: ""}
If an attribute filter is applied so that only the
path
attribute is retained and cardinality limit of3
is set, if the filtering is applied prior to checking the cardinality limit the following attributes will be kept on the output metric streams:
{path: "/"}
{path: "admin"}
However, if the cardinality limit is applied prior to filtering the following attributes will be kept on the output metric streams:
{path: "/"}
{otel.metric.overflow: true}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be equally happy with a more general declaration, such as:
Metric cardinality limits are meant as an self protection feature. Implementations are free to use any definition that works for its design, subject to the the two constraints (1) each metric event is aggregated by exactly one aggregator, and (2) when overflow is happening the `otel.metric.overflow` attribute appears on at least one timeseries.
I think this could work. I am worried that it would be to vague for readers though, they may not understand it to imply it relates to attribute filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I'm sure I agree that it should be left unspecified. The view attribute filter mechanism is meant to be a user control for cardinality, and so I think it would be surprising to users if its interaction with this other cardinality limiting control was ambiguous.
I imagine the workflow being something like:
- Run the app with default cardinality limits and views.
- Exceed the cardinality limit and notice with
otel.metric_overflow
and application logs. - Add an attribute filter to remove problematic attributes
- Expect to see that the cardinality
otel.metric_overflow
is gone.
If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow. With a cardinality limit of 2000, I could end up seeing only 1000 series exported and still see the otel.metric_overflow
series. This would be quite surprising.
Note that @dashpole brought this up in the #2960 where cardinality limits were originally added:
Being able to correct cardinality problems via Views seems essential, and while this is a step forward, this feels like a big omission. Are views necessarily "late in the pipeline"?
I believe that that the conclusion (based on comment, comment) was to imply that view attribute filtering should happen before cardinality limits are applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow. With a cardinality limit of 2000, I could end up seeing only 1000 series exported and still see the
otel.metric_overflow
series. This would be quite surprising.
As by @jsuereth stated in the issue this was discussed, and is resolving (#3798):
[...] regarding cardinality limits we're talking about error scenarios and worst-case behavior. We already have a lot of inconssitencies in how failures are handled due to runtime limitations. We try to be consistent, but when it comes to extraordinary/error scenarios, I think some inconsistencies between SDKs is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cardinality limit applies before view attribute filtering, then view attribute filtering isn't an effective tool to prevent overflow.
Agreed, but I don't think it was an effective tool to begin with. Many implementations are not filtering on measurement and it is therefore not preventing memory overflows.
I also don't think we can say those implementations are wrong to not do this. Filtering on measurement is a poorly scaling computation pattern.
I think the only thing that can be said about attribute filtering is that it is an effective tool to prevent overflow of data being sent to a backend. Meaning it is an effective cost reduction tool, not a memory limiting tool in many implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most compelling argument I heard in our discussion today was around user expectations. If users think that attribute filtering would lead to memory limiting, then we have an issue and an argument to specify semantics here to match user expectations.
However, if possible, I'd prefer to allow room for experimentation in implementation. I can see three implementations today:
- A PubSub / RingBuffer style hot-path where measurements are pushed into a queue and some aggregation is responsible for pulling them off (OpenCensus did this, I believe C# works like this in some fashion).
- "Fast" access to local memory storage from the hot path, where data is directly written to memory (possibly via "fast" atomics, etc.). Java has this design and it tries to quickly reduce contention between threads recording values via two-stage locking (one to reach a particular timeseries, another to write data into that timeseries). We migrated the Java SDK to this style after benchmarking vs. OpenCensus.
- A two-stage aggregation process where known storage is allocated ahead of time and metric writes are flushed quickly, and then a follow up aggregation is done collecting the first set of aggregations/timeseries and performing filtering / further aggregation as needed. (@jmacd's design).
In all of these designs we're trying to trade off "slowdown in the hotpath" for "total memory consumption". I think there's still room for us to explore here, so I don't want too rigid of a specification.
Can we guarantee that:
- Cardinality Limits => Memory Out of bounds protectoins?
Are we sure that:
- Users assume Attribute filters will also limit memory consumption for timeseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users assume Attribute filters will also limit memory consumption for timeseries?
I think so, see open-telemetry/opentelemetry-java-instrumentation#10119 for an example
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
SIG meeting follow up on this:
I plan to touch base with the people listed via slack in a few hours to find a time that works. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
/unstale /reopen 🤞 I'm interested in this from the perspective of #3785, where I think it's important to apply the attributes "filter" advisory before applying the cardinality limit |
Fixes #3798
Clarify that the specification allows implementations to choose the best approach for them in implementing cardinality limiting when attributes are being filtered.
cc @jsuereth @jack-berg