Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APR-208] fix: allow pushing multiple data points along within a single metric #217

Merged
merged 5 commits into from
Aug 30, 2024

Conversation

tobz
Copy link
Member

@tobz tobz commented Aug 26, 2024

Context

In #216, we documented the suboptimal behavior of the aggregate transform when compared to the Datadog Agent. Specifically, the current behavior of the aggregate transform leads to additional metric payloads being sent, and thus more network bandwidth consumed, when compared to the Datadog Agent.

This is suboptimal as metrics traffic could jump by a large amount -- 20 to 40% -- for an identical workload, which is an unacceptable difference, even in this experimental stage.

Solution

This PR introduces a large body of work to effectively bake in the concept of a metric being able to hold multiple timestamp/value pairs in a single Metric, commonly referred to as "data points" in the Datadog Agent and other popular metrics protocols such as OTLP.

In making this change, we can more efficiently shuttle multiple data points from source to transform/destination, and also allow destinations to avoid having to implement their own costly/complex aggregation logic to efficiently forward these metrics.

Most of the work centers around the addition of a new value container, MetricValues, which lives alongside MetricValue, and handles the hard work of ensuring a homogenous set of values, holding their timestamps, merging in values based on timestamp, and all ancillary operations needed to effectively build and utilize MetricValues.

Fixes #216.

@tobz tobz added the type/bug Bug fixes. label Aug 26, 2024
@tobz tobz requested review from a team as code owners August 26, 2024 15:01
@github-actions github-actions bot added area/core Core functionality, event model, etc. area/io General I/O and networking. area/components Sources, transforms, and destinations. source/dogstatsd DogStatsD source. transform/aggregate Aggregate transform. destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. labels Aug 26, 2024
@tobz tobz changed the title fix: properly fix: allow pushing multiple data points along within a single metric Aug 26, 2024
@pr-commenter
Copy link

pr-commenter bot commented Aug 26, 2024

Regression Detector (DogStatsD)

Regression Detector Results

Run ID: f9d8b06b-add6-4080-af9b-7ccec9295291

Baseline: 7.55.2
Comparison: 7.55.3

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
dsd_uds_512kb_3k_contexts ingress throughput +0.07 [+0.00, +0.13]
dsd_uds_1mb_50k_contexts_memlimit ingress throughput +0.03 [+0.00, +0.06]
dsd_uds_100mb_3k_contexts ingress throughput +0.02 [+0.01, +0.03]
dsd_uds_1mb_50k_contexts ingress throughput +0.02 [-0.01, +0.05]
dsd_uds_100mb_250k_contexts ingress throughput +0.00 [-0.05, +0.06]
dsd_uds_500mb_3k_contexts ingress throughput +0.00 [-0.00, +0.01]
dsd_uds_10mb_3k_contexts ingress throughput -0.00 [-0.04, +0.04]
dsd_uds_1mb_3k_contexts ingress throughput -0.05 [-0.09, -0.01]
dsd_uds_100mb_3k_contexts_distributions_only memory utilization -0.45 [-0.69, -0.22]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@tobz tobz force-pushed the tobz/metrics-data-model-multiple-data-points branch from c251996 to 71ea255 Compare August 26, 2024 15:18
@github-actions github-actions bot added area/ci CI/CD, automated testing, etc. and removed source/dogstatsd DogStatsD source. labels Aug 26, 2024
@pr-commenter
Copy link

pr-commenter bot commented Aug 26, 2024

Regression Detector (Saluki)

Regression Detector Results

Run ID: 892b2b8b-4aea-4fd0-b7ef-1d9a6d128208

Baseline: a5bdd38
Comparison: 86dd5d8

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

Significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

perf experiment goal Δ mean % Δ mean % CI links
dsd_uds_100mb_250k_contexts ingress throughput +6.52 [+5.78, +7.25]
dsd_uds_100mb_3k_contexts_distributions_only memory utilization -7.86 [-7.99, -7.72]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
dsd_uds_100mb_250k_contexts ingress throughput +6.52 [+5.78, +7.25]
dsd_uds_1mb_50k_contexts_memlimit ingress throughput +2.13 [-1.11, +5.37]
dsd_uds_500mb_3k_contexts ingress throughput +0.66 [+0.55, +0.76]
dsd_uds_512kb_3k_contexts ingress throughput +0.07 [+0.00, +0.14]
dsd_uds_1mb_3k_contexts ingress throughput +0.07 [+0.02, +0.12]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs ingress throughput +0.01 [-0.04, +0.05]
dsd_uds_10mb_3k_contexts ingress throughput +0.00 [-0.05, +0.06]
dsd_uds_50mb_10k_contexts_no_inlining ingress throughput +0.00 [-0.00, +0.00]
dsd_uds_1mb_50k_contexts ingress throughput -0.00 [-0.00, +0.00]
dsd_uds_100mb_3k_contexts ingress throughput -0.00 [-0.01, +0.00]
dsd_uds_100mb_3k_contexts_distributions_only memory utilization -7.86 [-7.99, -7.72]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@pr-commenter
Copy link

pr-commenter bot commented Aug 26, 2024

Regression Detector Links

Experiment Result Links

experiment link(s)
dsd_uds_100mb_250k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts_distributions_only [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_10mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts_memlimit [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_500mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_512kb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining (ADP only) [Profiling (ADP)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs (ADP only) [Profiling (ADP)] [SMP Dashboard]

@tobz tobz force-pushed the tobz/metrics-data-model-multiple-data-points branch from b9a08a4 to 554a55d Compare August 26, 2024 19:48
Copy link
Contributor

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a high-level pass at least, but will review more in-depth after some discussion.

Comment on lines +289 to +295
let metric = if self.forward_timestamped_metrics {
// If we're configured to forward timestamped metrics immediately, then we need to
// try to handle any timestamped values in this metric. If we get back `Some(...)`,
// it's either the original metric because no values had timestamps _or_ it's a
// modified version of the metric after all timestamped values were split out and
// directly forwarded.
match handle_forward_timestamped_metric(metric, &mut event_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making me wonder if it'd be better to split out timestamped metric samples earlier (i.e. at the source) and send them through a different pipeline that doesn't include the aggregate transform. That'd be a much more direct mapping of the agent's current no-aggregation "pipeline". We could even differentiate more strongly between "samples" (single data point, no timestamp, to be aggregated) and "metrics" (potentially multiple points, timestamped, aggregated) at the topology level and have different components that work on each (e.g. aggregate is samples -> metrics). Just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea.

I did try to spend some time during the crafting of MetricValues to figure out how to separate those two concerns but found it difficult, because single DSD packets can be multi-value, and it's not great to be emitting five events from the DSD source instead of just one, since they all have an identical timestamp (or lack thereof, really).

Like if we just sketched it out right now, based on your thoughts, we could end up with something like...

pub type RawSamples = Vec<MetricValue>;
pub type AggregatedSamples = Vec<(u64, MetricValue)>;

pub enum Event {
    RawMetric(Context, RawSamples, MetricMetadata),
    AggregatedMetric(Context, AggregatedSamples, MetricMetadata),
    ...
}

This doesn't strike me as the worst thing, although I don't love the idea of every component needing to now consider metrics separately -- raw vs aggregated -- although I suppose this PR is still making them do that, just by calling things like MetricValues::set_timestamp explicitly to reify the maybe-timestamped-maybe-not samples into definitely-timestamped samples. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other bit, which the Datadog Agent gets around by deconstructing what we have with the MetricValue enum, is by having all the non-common data -- rate interval, etc -- on the metric sample type, and then just having the raw samples in a vector.

Our choice here of having MetricValue definitely paints us into a corner, to an extent, since it makes it harder to have homogenous value containers.

Comment on lines 273 to 276
pub struct MetricValues {
kind: MetricKind,
values: SmallVec<[(u64, MetricValue); 2]>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my previous comment, my biggest question here is whether we want to have these two axes mixed within a single collection (timestamped vs not and potentially values with different kinds). It kind of seems like we're allowing a lot of potentially possibilities to be represented and I'm not sure how many of are really valid/useful. If we can split these possibilities out into more structured forms, the code that handles them can likely get quite a bit simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think to pull it out a little bit, there's a bunch of very real scenarios in terms of what we see coming into the DSD source:

  • single metric value w/o timestamp (arguably the most common scenario)
  • multiple metric values w/o timestamp (I don't have the data on how common this is, but I'd guess much much less common than single metric value)
  • single metric value w/ timestamp (probably becoming more common.. maybe more common than multiple-value-no-timestamp?)
  • multiple metric values w/ timestamp (technically possible, but not likely since having a timestamp implies aggregation, which implies a single value)

So our most common scenario is handling metric values without a timestamp, with a very real possibility of handling multiple of this. After that, it's single timestamped value.

After that, then there's what we need to handle coming out of the aggregator phase, which is single or multiple values with timestamps: they always have a timestamp, and it's just as likely to have one timestamped value as it is to have two (the two is just a factor of the flush interval, and will likely only ever be a max of two... but I'm not sure it's safe to use that as an invariant or if it buys us anything by doing so).

Kind of stream of consciousness spewing here... but yeah, I do think this design is only as complex as it needs to be to support the relevant scenarios with a single data type, but related to your comment above, I agree that it's not the most optimal design and we could definitely think about trying to bifurcate some of this a little more.

I'll give some more thought to that approach.

@tobz tobz force-pushed the tobz/metrics-data-model-multiple-data-points branch from 554a55d to 86dd5d8 Compare August 28, 2024 20:04
Copy link
Contributor

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note for future discussion, but happy to see this merged as a definite improvement.

Comment on lines +10 to +11
struct TimestampedValue<T> {
timestamp: Option<NonZeroU64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit odd that TimestampedValue might not have a timestamp. Obviously we can get values without timestamps from the DSD protocol, but I don't think we have any use cases right now where a timestamp never gets assigned (and I'm not sure we will?). Is there ever any meaningful variation in how those timestamps eventually get assigned? If not, it seems simplest to remove this variability from the data model and simply assign the timestamp when we receive the metric. If we're using timestamped vs not as an indicator of whether something should be aggregated or not, it seems like we could mark that more explicitly somehow (or maybe we don't need to? How important is it that those points are explicitly not aggregated vs running through the same logic and simply not getting aggregated because they're already independent? I.e. is it an optimization or a semantic difference?)

@tobz tobz merged commit 409e44e into main Aug 30, 2024
14 checks passed
@tobz tobz deleted the tobz/metrics-data-model-multiple-data-points branch August 30, 2024 17:39
@tobz tobz changed the title fix: allow pushing multiple data points along within a single metric [APR-208] fix: allow pushing multiple data points along within a single metric Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci CI/CD, automated testing, etc. area/components Sources, transforms, and destinations. area/core Core functionality, event model, etc. area/io General I/O and networking. destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. transform/aggregate Aggregate transform. type/bug Bug fixes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve aggregation and support for aggregated metrics.
2 participants