-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486
Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486
Conversation
Stress TestsFor max MetricPoints = 1000, there is a drop of ~4.6% in total number of loops executed in around 5 minutes: main branch: Loops: 7,809,957,791, Loops/Second: 23,745,565, CPU Cycles/Loop: 878, RunwayTime (Seconds): 315 Replace Program.cs with the code below using System.Diagnostics.Metrics;
using System.Runtime.CompilerServices;
using OpenTelemetry.Metrics;
namespace OpenTelemetry.Tests.Stress;
public partial class Program
{
private const int ArraySize = 10;
private static readonly Meter TestMeter = new(Utils.GetCurrentMethodName());
private static readonly Counter<long> TestCounter = TestMeter.CreateCounter<long>("TestCounter");
private static readonly string[] DimensionValues = new string[ArraySize];
private static readonly ThreadLocal<Random> ThreadLocalRandom = new(() => new Random());
public static void Main()
{
for (int i = 0; i < ArraySize; i++)
{
DimensionValues[i] = $"DimValue{i}";
}
using var exporter = new CustomExporter();
using var metricReader = new PeriodicExportingMetricReader(exporter, exportIntervalMilliseconds: 10)
{
TemporalityPreference = MetricReaderTemporalityPreference.Delta,
};
using var meterProvider = Sdk.CreateMeterProviderBuilder()
.AddMeter(TestMeter.Name)
.AddReader(metricReader)
.Build();
Stress(prometheusPort: 9464);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
protected static void Run()
{
var random = ThreadLocalRandom.Value;
TestCounter.Add(
100,
new("DimName1", DimensionValues[random.Next(0, ArraySize)]),
new("DimName2", DimensionValues[random.Next(0, ArraySize)]),
new("DimName3", DimensionValues[random.Next(0, ArraySize)]));
}
private class CustomExporter : BaseExporter<Metric>
{
public long Sum = 0;
public string DimensionKey;
public object DimensionValue;
public CustomExporter()
{
}
public override ExportResult Export(in Batch<Metric> batch)
{
foreach (var metric in batch)
{
foreach (ref readonly var metricPoint in metric.GetMetricPoints())
{
foreach (var tag in metricPoint.Tags)
{
this.DimensionKey = tag.Key;
this.DimensionValue = tag.Value;
}
if (metric.MetricType.IsSum())
{
this.Sum += metricPoint.GetSumLong();
}
}
}
return ExportResult.Success;
}
}
} |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Codecov Report
@@ Coverage Diff @@
## main #4486 +/- ##
==========================================
- Coverage 83.19% 82.93% -0.27%
==========================================
Files 293 294 +1
Lines 11984 12193 +209
==========================================
+ Hits 9970 10112 +142
- Misses 2014 2081 +67
Flags with carried forward coverage won't be shown. Click here to find out more.
|
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
@utpilla - Could you please define what "limit of active points" means here? Since it is very specific to how the reclaiming is proposed here. |
lock (this.tagsToMetricPointIndexDictionaryDelta!) | ||
{ | ||
LookupData? dictionaryValue; | ||
if (lookupData.SortedTags != Tags.EmptyTags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit to consider... this might read a little better with with a property if (!lookupData.GivenEqualsSorted)
.
If this were the case I think EmptyTags
could be moved to the LookupData
class and made private.
internal class LookupData {
private static readonly Tags EmptyTags = new(Array.Empty<KeyValuePair<string, object?>>());
public LookupData(int index) { ... }
public LookupData(int index, Tags tags) { ... }
public LookupData(int index, Tags sorted, Tags given) { ... }
public bool GivenEqualsSorted => SortedTags == Tags.EmptyTags;
}
Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>
Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>
Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>
Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>
…atorStore-To-Reclaim-MetricPoints
The MetricPoint reclaim behavior makes some fundamental changes to how the SDK exports metrics. It might surprise some users. For example, @vishweshbankwar pointed out a specific case where the SDK has exported a particular MetricPoint (k1,v1) at time T1. After a while, this MetricPoint gets reclaimed. Now if their application produces a totally new set of unique metric points and consistently updates them, they might run into a situation where there is no MetricPoint available when a thread later tries to update the MetricPoint (k1,v1) at time T2 and the measurement gets dropped. When consuming the metrics, their dashboard would show them that no measurements were recorded for the MetricPoint (k1,v1) at time T2, which is not correct as it was actually dropped. I would be in favor of the user opting in for the reclaim behavior so that they are aware of what to expect instead of simply making it the default behavior for Delta aggregation temporality. I also spoke to @alanwest who brought up a good point to check how the Java SIG sets the default behavior and assess if it would make sense to be consistent with them. I think for now it's okay to have this behavior (even for |
I'm curious about what would be the behavioral difference from the client side between the below 2 cases:
It seems that Delta temporality is designed to unburden the client from keeping high-cardinality state? |
I don't feel we should be too worried about this case. Folks using DELTA aggregation temporality should expect:
+1 on making this opt-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I spoke with @utpilla and I agree we can ship this PR as-is for 1.7.0-alpha release. Making this opt-in vs. opt-out will be a follow up discussion to resolve prior to 1.7.0 stable.
@vishweshbankwar I think between you and me, we've reviewed this PR most closely. If you ready to give it a 👍, I'll merge it.
// We never increment the ReferenceCount for MetricPoint with no tags (index == 0) and the MetricPoint for overflow attribute, | ||
// but we always decrement it (in the Update methods). This should be fine. | ||
// ReferenceCount doesn't matter for MetricPoint with no tags and overflow attribute as they are never reclaimed. | ||
internal int ReferenceCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanwest @vishweshbankwar This is going to add to our memory consumption (4 bytes x MaxMetricPoints). Below there is another field AggregationType aggType
which I'm guessing is also taking up 4 bytes. I don't think either of those two things really need the full 4 bytes, what if we did some bit manipulation on a single int
field to process both needs with better memory utilization?
Hi @utpilla @vishweshbankwar , I'd like to better understand the risk associated with enabling reclaim. Does the above mean that any reclaimed MetricPoint's key-value set will be dropped moving forward? So if (k1,v1) is reclaimed and then some time later a metric with (k1,v1) is emitted, that metric will be dropped instead of a new metric point being created? EDIT: Sorry, I missed the "they might run into a situation where there is no MetricPoint available" earlier, now that case makes sense. If I understand correctly, the quoted case applies only when the max limit for metric points is reached and nothing can be reclaimed to make room for (k1,v1) again. |
Addresses #2360
This PR shows a possible approach to reclaim MetricPoints for Delta aggregation temporality.
Current behavior:
For any metric, once the SDK has encountered a given number (default: 2000) of unique dimension combinations, it drops any new measurement with a newer dimension combination.
With this PR, the AggregatorStore is updated to start reclaiming unused MetricPoints once it has encountered a particular number of unique dimensions combinations. This number is set to 75% of the max metric points allowed. For default case, this means the AggregatorStore does not begin reclaiming MetricPoints until it has seen 1500 (= 75% * 2000) unique dimension combinations. Once it hits 1500, it changes its behavior to begin reclaiming unused MetricPoints. I have set this threshold as there is some cost associated with reclaiming MetricPoints and we should try to avoid every user to pay for this.
What changes when AggregatorStore starts to reclaim unused MetricPoints?
Changes
Func
namedlookupAggregatorStore
to allow for different behavior for Delta and Cumulative aggregationQueue<int>
namedavailableMetricPoints
- This holds the indices for reclaimed MetricPoints. In the AggregatorStore ctor, it's initialized with the remaining 25% of the indices after the threshold.ReferenceCount
- This is used by Snapshot method to determine if a MetricPoint is in use by an Update thread. Update threads increment the ReferenceCount of the MetricPoint before the update and decrement it after the updateLookupData
- This is the lookup dictionary value type for Delta aggregation. This helps Update threads determine if the MetricPoint that they are about to update has already been reclaimed by the Snapshot thread or by some other Update thread for a different set of dimensions combinationDroppedCount
- This is used by the unit tests to verify that no measurements were dropped when they test the MetricPoint reclaim featureCHANGELOG.md
files updated for non-trivial changes