Ability to forget and reclaim MetricPoint (Delta) #2360

cijothomas · 2021-09-17T00:01:16Z

Each Metric keeps track of its MetricPoints. This points are never removed, even if there are no new updates to it.
For delta aggregation:
We should reclaim any un-used MetricPoints, when we hit Metric Point limits.
Due to pre-allocation of MetricPoints in current implementation, there won't be any memory savings due to this. But we'll be able to accomodate more MetricPoints, by reclaiming unused points. This would mean the MetricPoint limit is the "limit of active points", rather than "limit of points seen since the beginning of process".

For cumulative aggregation:
Its possible to "forget" an un-used Metricpoint. However, if the forgotten point is reported again, we'll not be able to calculate cumulative since the process start. This would result in a reset (i.e start time will be reset).

EDIT by utpilla:

This issue has been updated to track MetricPoint reclaim for Delta aggregation temporality only.

cijothomas · 2021-11-17T18:02:46Z

This issue is slightly related to #2524, but more complex.
A few attempts were done to achieve this, see PRs: #2466
#2600 #2598

Some observations:

To account for the possibility of a MetricPoint being reclaimed by someone else, the hot path has to do some synchronization mechanism. This has significantly affected throughput, in all the above PRs.
At the root, the issue is "atomically lookup which MetricPoint should be used for given tags, and do the update, without affecting performance".
As proved by the PR attempts above, plain locks/ or even ReadWriteLockSlim cannot be used as is, without sacrificing performance.

This requires more investigation and time to address. For 1.2 release, #2524 and #2358 will be focused, and this issue will be re-investigated post 1.2

reyang · 2021-11-17T20:15:16Z

Throwing couple ideas here:

the synchronization (lock) approach - break down the lock granularity - e.g. instead of having a single R/W lock for each point, having a per CPU core/group lock then decide if there is a need to escalate to a bigger lock.
non synchronization approach (even better) - leverage the page faults or segment access from memory manager to prevent stale writes (and the stale writes should result in a recoverable error + retry).

tgrieger-sf · 2023-01-09T19:53:31Z

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

alanwest · 2023-01-09T22:30:41Z

Cardinality issues can be tricky. Depending on your scenario, the ability for the SDK to reclaim metric points may or may not help resolve your issue.

If there are known keys that have a large or infinite set of values one solution may be to use the View API to filter to a specific set of keys thereby excluding problematic ones.

We have an example here:

opentelemetry-dotnet/docs/metrics/customizing-the-sdk/Program.cs

Lines 42 to 43 in 9c05eea

    
           // For the instrument "MyCounterCustomTags", aggregate with only the keys "tag1", "tag2". 
        
           .AddView(instrumentName: "MyCounterCustomTags", new MetricStreamConfiguration() { TagKeys = new string[] { "tag1", "tag2" } })

reyang · 2023-01-09T23:05:57Z

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

@tgrieger-sf could you share more about the scenario? (e.g. why do you need to put "an ever-increasing amount of unique tags" as metrics dimensions)

tgrieger-sf · 2023-01-09T23:14:53Z

@alanwest Unfortunately, for our scenario, we need all of the tags that are coming through (we're the ones creating the data).

@reyang In our scenario, we are using an observable gauge to log the log consumption of various tables in our SQL instances. We don't directly control these tables and they are thousands that get created and dropped in any given hour so we rack up many unique tags in no time.

reyang · 2023-01-10T00:09:39Z

@tgrieger-sf if these tables are created/dropped quickly, how do you use them as metrics dimension? (e.g. is the goal to get the unique count of tables?)

tgrieger-sf · 2023-01-10T00:42:38Z

@reyang Stats are collected every few minutes. Table live for long enough to be caught during our aggregation in our service. The goal isn't to get a unique count of the tables but more to determine which parts of the application (by looking at the tables) are generating the most log. We can't get any less granular than this unfortunately.

reyang · 2023-01-10T01:03:08Z

@tgrieger-sf I wonder if "which parts of the application" have a limited/finite number of parts that can be derived from the ever-increasing names, if yes, could this be used as the actual dimension?

tgrieger-sf · 2023-01-10T01:05:11Z

@reyang You're not wrong but the problem is that's not defined (it's being worked on but isn't a priority and won't be ready anytime soon).

reyang · 2023-01-10T01:08:37Z

@tgrieger-sf based on the discussion, I feel that the "right" approach seems to be using exemplars to sample the table names, and avoid using the table names as a dimension.

tgrieger-sf · 2023-01-10T01:16:39Z

@reyang I appreciate the conversation and I'll look into exemplars. Thanks!

idoublefirstt · 2023-03-17T20:49:24Z

@reyang Any update on this? I'm an internal customer. We are trying to migrate to OpenTelemetry to emit SLI/SLO data which contains ArmId as part of metrics dimensions. Resources get created and deleted daily. It would be great to able to reclaim metrics points on deleted armIds to save some memory consumption. Our service is a quite large. With current design, it seems like we'd need to restart our monitoring service once a week to cleanup memory usage.

reyang · 2023-03-18T00:59:31Z

@idoublefirstt this is a top item we plan to address before the next release https://github.com/open-telemetry/opentelemetry-dotnet/milestone/36. The actual time might vary depending on the prioritization and bandwidth.

yoziv · 2023-03-18T11:36:13Z

I would like to also strengthen the case for that , in another internal group we have a big security related service which needs this ability greatly as we also need to support some kind of guid for our purpose and this feature is greatly needed

cijothomas · 2023-03-20T16:57:57Z

Please follow this issue to be updated about this. Though its treated as high-priority, its not really in the 1.5 or 1.6 milestone. Its still high priority (no doubt!), but it is unlikely to land in 1.5 milestone. I'll propose to make this part of 1.6 milestone (which released later this year)

utpilla · 2023-10-17T03:57:37Z

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

utpilla · 2023-12-09T00:37:33Z

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

There is an update in the Metrics SDK behavior starting the latest stable 1.7.0 release. Instead of making this the default behavior of the SDK, we have offered it as an experimental opt-in only feature. You can opt-in to enable this behavior by setting OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS to true either using an environment variable or through IConfiguration. Check #5052 for more details.

CodeBlanch · 2024-03-13T20:10:51Z

I'm going to close this because we have the feature work done. I opened #5443 for future work to make it stable.

cijothomas added enhancement New feature or request metrics Metrics signal related labels Sep 17, 2021

reyang added the priority:p1 label Sep 17, 2021

alanwest self-assigned this Oct 4, 2021

cijothomas mentioned this issue Oct 6, 2021

Handle exception from observable instruments #2457

Merged

alanwest mentioned this issue Oct 7, 2021

Reuse unused MetricPoints #2466

Closed

cijothomas added the release:required-for-ga label Oct 26, 2021

cijothomas removed the release:required-for-ga label Nov 16, 2021

cijothomas added this to the 1.2.0 milestone Nov 16, 2021

cijothomas removed this from the 1.2.0 milestone Nov 17, 2021

cijothomas changed the title ~~Ability to forget MetricPoint~~ Ability to forget and reclaim MetricPoint Nov 17, 2021

cijothomas mentioned this issue Nov 17, 2021

Allow ability to override max Metric streams and MetricPoints per stream #2635

Merged

alanwest mentioned this issue Nov 18, 2021

[WIP] Reuse unused MetricPoints - a different attempt #2598

Closed

utpilla mentioned this issue May 13, 2023

Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

Merged

1 task

danelson mentioned this issue Jun 1, 2023

Clarity around connector/spanmetrics dimensions_cache_size open-telemetry/opentelemetry-collector-contrib#23004

Closed

cijothomas mentioned this issue Jul 24, 2023

Get the current MetricPoint count #4670

Closed

utpilla added this to the 1.7.0 milestone Sep 12, 2023

utpilla self-assigned this Oct 17, 2023

utpilla changed the title ~~Ability to forget and reclaim MetricPoint~~ Ability to forget and reclaim MetricPoint (Delta) Oct 17, 2023

utpilla mentioned this issue Nov 28, 2023

Remove threshold for MetricPoint reclaim #5087

Merged

1 task

utpilla mentioned this issue Dec 9, 2023

Make MetricPoint reclaim an opt-in experimental feature #5052

Merged

1 task

vishweshbankwar modified the milestones: 1.7.0, 1.9.0 Mar 12, 2024

CodeBlanch mentioned this issue Mar 13, 2024

Promote MetricPoint reclaim feature for delta aggregation from experimental to stable #5443

Closed

CodeBlanch closed this as completed Mar 13, 2024

CodeBlanch removed this from the 1.9.0 milestone Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to forget and reclaim MetricPoint (Delta) #2360

Ability to forget and reclaim MetricPoint (Delta) #2360

cijothomas commented Sep 17, 2021 •

edited by utpilla

Loading

cijothomas commented Nov 17, 2021

reyang commented Nov 17, 2021

tgrieger-sf commented Jan 9, 2023

alanwest commented Jan 9, 2023

reyang commented Jan 9, 2023

tgrieger-sf commented Jan 9, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

idoublefirstt commented Mar 17, 2023

reyang commented Mar 18, 2023

yoziv commented Mar 18, 2023 •

edited

Loading

cijothomas commented Mar 20, 2023

utpilla commented Oct 17, 2023 •

edited

Loading

utpilla commented Dec 9, 2023 •

edited

Loading

CodeBlanch commented Mar 13, 2024

Ability to forget and reclaim MetricPoint (Delta) #2360

Ability to forget and reclaim MetricPoint (Delta) #2360

Comments

cijothomas commented Sep 17, 2021 • edited by utpilla Loading

EDIT by utpilla:

cijothomas commented Nov 17, 2021

reyang commented Nov 17, 2021

tgrieger-sf commented Jan 9, 2023

alanwest commented Jan 9, 2023

reyang commented Jan 9, 2023

tgrieger-sf commented Jan 9, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

reyang commented Jan 10, 2023

tgrieger-sf commented Jan 10, 2023

idoublefirstt commented Mar 17, 2023

reyang commented Mar 18, 2023

yoziv commented Mar 18, 2023 • edited Loading

cijothomas commented Mar 20, 2023

utpilla commented Oct 17, 2023 • edited Loading

utpilla commented Dec 9, 2023 • edited Loading

CodeBlanch commented Mar 13, 2024

cijothomas commented Sep 17, 2021 •

edited by utpilla

Loading

yoziv commented Mar 18, 2023 •

edited

Loading

utpilla commented Oct 17, 2023 •

edited

Loading

utpilla commented Dec 9, 2023 •

edited

Loading