-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APR-208] fix: allow pushing multiple data points along within a single metric #217
Conversation
Regression Detector (DogStatsD)Regression Detector ResultsRun ID: f9d8b06b-add6-4080-af9b-7ccec9295291 Baseline: 7.55.2 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | links |
---|---|---|---|---|---|
➖ | dsd_uds_512kb_3k_contexts | ingress throughput | +0.07 | [+0.00, +0.13] | |
➖ | dsd_uds_1mb_50k_contexts_memlimit | ingress throughput | +0.03 | [+0.00, +0.06] | |
➖ | dsd_uds_100mb_3k_contexts | ingress throughput | +0.02 | [+0.01, +0.03] | |
➖ | dsd_uds_1mb_50k_contexts | ingress throughput | +0.02 | [-0.01, +0.05] | |
➖ | dsd_uds_100mb_250k_contexts | ingress throughput | +0.00 | [-0.05, +0.06] | |
➖ | dsd_uds_500mb_3k_contexts | ingress throughput | +0.00 | [-0.00, +0.01] | |
➖ | dsd_uds_10mb_3k_contexts | ingress throughput | -0.00 | [-0.04, +0.04] | |
➖ | dsd_uds_1mb_3k_contexts | ingress throughput | -0.05 | [-0.09, -0.01] | |
➖ | dsd_uds_100mb_3k_contexts_distributions_only | memory utilization | -0.45 | [-0.69, -0.22] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
c251996
to
71ea255
Compare
Regression Detector (Saluki)Regression Detector ResultsRun ID: 892b2b8b-4aea-4fd0-b7ef-1d9a6d128208 Baseline: a5bdd38 Performance changes are noted in the perf column of each table:
Significant changes in experiment optimization goalsConfidence level: 90.00%
|
perf | experiment | goal | Δ mean % | Δ mean % CI | links |
---|---|---|---|---|---|
✅ | dsd_uds_100mb_250k_contexts | ingress throughput | +6.52 | [+5.78, +7.25] | |
➖ | dsd_uds_1mb_50k_contexts_memlimit | ingress throughput | +2.13 | [-1.11, +5.37] | |
➖ | dsd_uds_500mb_3k_contexts | ingress throughput | +0.66 | [+0.55, +0.76] | |
➖ | dsd_uds_512kb_3k_contexts | ingress throughput | +0.07 | [+0.00, +0.14] | |
➖ | dsd_uds_1mb_3k_contexts | ingress throughput | +0.07 | [+0.02, +0.12] | |
➖ | dsd_uds_50mb_10k_contexts_no_inlining_no_allocs | ingress throughput | +0.01 | [-0.04, +0.05] | |
➖ | dsd_uds_10mb_3k_contexts | ingress throughput | +0.00 | [-0.05, +0.06] | |
➖ | dsd_uds_50mb_10k_contexts_no_inlining | ingress throughput | +0.00 | [-0.00, +0.00] | |
➖ | dsd_uds_1mb_50k_contexts | ingress throughput | -0.00 | [-0.00, +0.00] | |
➖ | dsd_uds_100mb_3k_contexts | ingress throughput | -0.00 | [-0.01, +0.00] | |
✅ | dsd_uds_100mb_3k_contexts_distributions_only | memory utilization | -7.86 | [-7.99, -7.72] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Regression Detector LinksExperiment Result Links
|
b9a08a4
to
554a55d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a high-level pass at least, but will review more in-depth after some discussion.
let metric = if self.forward_timestamped_metrics { | ||
// If we're configured to forward timestamped metrics immediately, then we need to | ||
// try to handle any timestamped values in this metric. If we get back `Some(...)`, | ||
// it's either the original metric because no values had timestamps _or_ it's a | ||
// modified version of the metric after all timestamped values were split out and | ||
// directly forwarded. | ||
match handle_forward_timestamped_metric(metric, &mut event_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making me wonder if it'd be better to split out timestamped metric samples earlier (i.e. at the source) and send them through a different pipeline that doesn't include the aggregate transform. That'd be a much more direct mapping of the agent's current no-aggregation "pipeline". We could even differentiate more strongly between "samples" (single data point, no timestamp, to be aggregated) and "metrics" (potentially multiple points, timestamped, aggregated) at the topology level and have different components that work on each (e.g. aggregate
is samples -> metrics
). Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a bad idea.
I did try to spend some time during the crafting of MetricValues
to figure out how to separate those two concerns but found it difficult, because single DSD packets can be multi-value, and it's not great to be emitting five events from the DSD source instead of just one, since they all have an identical timestamp (or lack thereof, really).
Like if we just sketched it out right now, based on your thoughts, we could end up with something like...
pub type RawSamples = Vec<MetricValue>;
pub type AggregatedSamples = Vec<(u64, MetricValue)>;
pub enum Event {
RawMetric(Context, RawSamples, MetricMetadata),
AggregatedMetric(Context, AggregatedSamples, MetricMetadata),
...
}
This doesn't strike me as the worst thing, although I don't love the idea of every component needing to now consider metrics separately -- raw vs aggregated -- although I suppose this PR is still making them do that, just by calling things like MetricValues::set_timestamp
explicitly to reify the maybe-timestamped-maybe-not samples into definitely-timestamped samples. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other bit, which the Datadog Agent gets around by deconstructing what we have with the MetricValue
enum, is by having all the non-common data -- rate interval, etc -- on the metric sample type, and then just having the raw samples in a vector.
Our choice here of having MetricValue
definitely paints us into a corner, to an extent, since it makes it harder to have homogenous value containers.
lib/saluki-event/src/metric/data.rs
Outdated
pub struct MetricValues { | ||
kind: MetricKind, | ||
values: SmallVec<[(u64, MetricValue); 2]>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my previous comment, my biggest question here is whether we want to have these two axes mixed within a single collection (timestamped vs not and potentially values with different kinds). It kind of seems like we're allowing a lot of potentially possibilities to be represented and I'm not sure how many of are really valid/useful. If we can split these possibilities out into more structured forms, the code that handles them can likely get quite a bit simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think to pull it out a little bit, there's a bunch of very real scenarios in terms of what we see coming into the DSD source:
- single metric value w/o timestamp (arguably the most common scenario)
- multiple metric values w/o timestamp (I don't have the data on how common this is, but I'd guess much much less common than single metric value)
- single metric value w/ timestamp (probably becoming more common.. maybe more common than multiple-value-no-timestamp?)
- multiple metric values w/ timestamp (technically possible, but not likely since having a timestamp implies aggregation, which implies a single value)
So our most common scenario is handling metric values without a timestamp, with a very real possibility of handling multiple of this. After that, it's single timestamped value.
After that, then there's what we need to handle coming out of the aggregator phase, which is single or multiple values with timestamps: they always have a timestamp, and it's just as likely to have one timestamped value as it is to have two (the two is just a factor of the flush interval, and will likely only ever be a max of two... but I'm not sure it's safe to use that as an invariant or if it buys us anything by doing so).
Kind of stream of consciousness spewing here... but yeah, I do think this design is only as complex as it needs to be to support the relevant scenarios with a single data type, but related to your comment above, I agree that it's not the most optimal design and we could definitely think about trying to bifurcate some of this a little more.
I'll give some more thought to that approach.
554a55d
to
86dd5d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note for future discussion, but happy to see this merged as a definite improvement.
struct TimestampedValue<T> { | ||
timestamp: Option<NonZeroU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit odd that TimestampedValue
might not have a timestamp. Obviously we can get values without timestamps from the DSD protocol, but I don't think we have any use cases right now where a timestamp never gets assigned (and I'm not sure we will?). Is there ever any meaningful variation in how those timestamps eventually get assigned? If not, it seems simplest to remove this variability from the data model and simply assign the timestamp when we receive the metric. If we're using timestamped vs not as an indicator of whether something should be aggregated or not, it seems like we could mark that more explicitly somehow (or maybe we don't need to? How important is it that those points are explicitly not aggregated vs running through the same logic and simply not getting aggregated because they're already independent? I.e. is it an optimization or a semantic difference?)
Context
In #216, we documented the suboptimal behavior of the aggregate transform when compared to the Datadog Agent. Specifically, the current behavior of the aggregate transform leads to additional metric payloads being sent, and thus more network bandwidth consumed, when compared to the Datadog Agent.
This is suboptimal as metrics traffic could jump by a large amount -- 20 to 40% -- for an identical workload, which is an unacceptable difference, even in this experimental stage.
Solution
This PR introduces a large body of work to effectively bake in the concept of a metric being able to hold multiple timestamp/value pairs in a single
Metric
, commonly referred to as "data points" in the Datadog Agent and other popular metrics protocols such as OTLP.In making this change, we can more efficiently shuttle multiple data points from source to transform/destination, and also allow destinations to avoid having to implement their own costly/complex aggregation logic to efficiently forward these metrics.
Most of the work centers around the addition of a new value container,
MetricValues
, which lives alongsideMetricValue
, and handles the hard work of ensuring a homogenous set of values, holding their timestamps, merging in values based on timestamp, and all ancillary operations needed to effectively build and utilizeMetricValues
.Fixes #216.