Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to forget and reclaim MetricPoint (Delta) #2360

Closed
cijothomas opened this issue Sep 17, 2021 · 19 comments
Closed

Ability to forget and reclaim MetricPoint (Delta) #2360

cijothomas opened this issue Sep 17, 2021 · 19 comments
Assignees
Labels
enhancement New feature or request metrics Metrics signal related

Comments

@cijothomas
Copy link
Member

cijothomas commented Sep 17, 2021

Each Metric keeps track of its MetricPoints. This points are never removed, even if there are no new updates to it.
For delta aggregation:
We should reclaim any un-used MetricPoints, when we hit Metric Point limits.
Due to pre-allocation of MetricPoints in current implementation, there won't be any memory savings due to this. But we'll be able to accomodate more MetricPoints, by reclaiming unused points. This would mean the MetricPoint limit is the "limit of active points", rather than "limit of points seen since the beginning of process".

For cumulative aggregation:
Its possible to "forget" an un-used Metricpoint. However, if the forgotten point is reported again, we'll not be able to calculate cumulative since the process start. This would result in a reset (i.e start time will be reset).

EDIT by utpilla:

This issue has been updated to track MetricPoint reclaim for Delta aggregation temporality only.

@cijothomas
Copy link
Member Author

This issue is slightly related to #2524, but more complex.
A few attempts were done to achieve this, see PRs: #2466
#2600 #2598

Some observations:

  1. To account for the possibility of a MetricPoint being reclaimed by someone else, the hot path has to do some synchronization mechanism. This has significantly affected throughput, in all the above PRs.
  2. At the root, the issue is "atomically lookup which MetricPoint should be used for given tags, and do the update, without affecting performance".
  3. As proved by the PR attempts above, plain locks/ or even ReadWriteLockSlim cannot be used as is, without sacrificing performance.

This requires more investigation and time to address. For 1.2 release, #2524 and #2358 will be focused, and this issue will be re-investigated post 1.2

@cijothomas cijothomas removed this from the 1.2.0 milestone Nov 17, 2021
@reyang
Copy link
Member

reyang commented Nov 17, 2021

Throwing couple ideas here:

  1. the synchronization (lock) approach - break down the lock granularity - e.g. instead of having a single R/W lock for each point, having a per CPU core/group lock then decide if there is a need to escalate to a bigger lock.
  2. non synchronization approach (even better) - leverage the page faults or segment access from memory manager to prevent stale writes (and the stale writes should result in a recoverable error + retry).

@cijothomas cijothomas changed the title Ability to forget MetricPoint Ability to forget and reclaim MetricPoint Nov 17, 2021
@tgrieger-sf
Copy link

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

@alanwest
Copy link
Member

alanwest commented Jan 9, 2023

Cardinality issues can be tricky. Depending on your scenario, the ability for the SDK to reclaim metric points may or may not help resolve your issue.

If there are known keys that have a large or infinite set of values one solution may be to use the View API to filter to a specific set of keys thereby excluding problematic ones.

We have an example here:

// For the instrument "MyCounterCustomTags", aggregate with only the keys "tag1", "tag2".
.AddView(instrumentName: "MyCounterCustomTags", new MetricStreamConfiguration() { TagKeys = new string[] { "tag1", "tag2" } })

@reyang
Copy link
Member

reyang commented Jan 9, 2023

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

@tgrieger-sf could you share more about the scenario? (e.g. why do you need to put "an ever-increasing amount of unique tags" as metrics dimensions)

@tgrieger-sf
Copy link

@alanwest Unfortunately, for our scenario, we need all of the tags that are coming through (we're the ones creating the data).

@reyang In our scenario, we are using an observable gauge to log the log consumption of various tables in our SQL instances. We don't directly control these tables and they are thousands that get created and dropped in any given hour so we rack up many unique tags in no time.

@reyang
Copy link
Member

reyang commented Jan 10, 2023

@tgrieger-sf if these tables are created/dropped quickly, how do you use them as metrics dimension? (e.g. is the goal to get the unique count of tables?)

@tgrieger-sf
Copy link

@reyang Stats are collected every few minutes. Table live for long enough to be caught during our aggregation in our service. The goal isn't to get a unique count of the tables but more to determine which parts of the application (by looking at the tables) are generating the most log. We can't get any less granular than this unfortunately.

@reyang
Copy link
Member

reyang commented Jan 10, 2023

@tgrieger-sf I wonder if "which parts of the application" have a limited/finite number of parts that can be derived from the ever-increasing names, if yes, could this be used as the actual dimension?

@tgrieger-sf
Copy link

@reyang You're not wrong but the problem is that's not defined (it's being worked on but isn't a priority and won't be ready anytime soon).

@reyang
Copy link
Member

reyang commented Jan 10, 2023

@tgrieger-sf based on the discussion, I feel that the "right" approach seems to be using exemplars to sample the table names, and avoid using the table names as a dimension.

@tgrieger-sf
Copy link

@reyang I appreciate the conversation and I'll look into exemplars. Thanks!

@idoublefirstt
Copy link

@reyang Any update on this? I'm an internal customer. We are trying to migrate to OpenTelemetry to emit SLI/SLO data which contains ArmId as part of metrics dimensions. Resources get created and deleted daily. It would be great to able to reclaim metrics points on deleted armIds to save some memory consumption. Our service is a quite large. With current design, it seems like we'd need to restart our monitoring service once a week to cleanup memory usage.

@reyang
Copy link
Member

reyang commented Mar 18, 2023

@idoublefirstt this is a top item we plan to address before the next release https://github.com/open-telemetry/opentelemetry-dotnet/milestone/36. The actual time might vary depending on the prioritization and bandwidth.

@yoziv
Copy link

yoziv commented Mar 18, 2023

I would like to also strengthen the case for that , in another internal group we have a big security related service which needs this ability greatly as we also need to support some kind of guid for our purpose and this feature is greatly needed

@cijothomas
Copy link
Member Author

Please follow this issue to be updated about this. Though its treated as high-priority, its not really in the 1.5 or 1.6 milestone. Its still high priority (no doubt!), but it is unlikely to land in 1.5 milestone. I'll propose to make this part of 1.6 milestone (which released later this year)

@utpilla
Copy link
Contributor

utpilla commented Oct 17, 2023

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

@utpilla utpilla self-assigned this Oct 17, 2023
@utpilla utpilla changed the title Ability to forget and reclaim MetricPoint Ability to forget and reclaim MetricPoint (Delta) Oct 17, 2023
@utpilla
Copy link
Contributor

utpilla commented Dec 9, 2023

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

There is an update in the Metrics SDK behavior starting the latest stable 1.7.0 release. Instead of making this the default behavior of the SDK, we have offered it as an experimental opt-in only feature. You can opt-in to enable this behavior by setting OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS to true either using an environment variable or through IConfiguration. Check #5052 for more details.

@CodeBlanch
Copy link
Member

I'm going to close this because we have the feature work done. I opened #5443 for future work to make it stable.

@CodeBlanch CodeBlanch removed this from the 1.9.0 milestone Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request metrics Metrics signal related
Projects
None yet
Development

No branches or pull requests

9 participants