-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the reference time rounding on Azure Metrics #37365
Fix the reference time rounding on Azure Metrics #37365
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
❕ Build Aborted
Expand to view the summary
Build stats
Steps errorsExpand to view the steps failures
|
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
// | ||
// See "Round outer limits" and "Round inner limits" tests in | ||
// the metric_registry_test.go for more information. | ||
referenceTime := time.Now().UTC().Round(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to what extent do we care about duplicate collections vs skipped collections?
i.e. does it make sense to widen the window here to allow for drift in the collection period, maybe up to 5s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this jitter I added with 72c5b69 allows the collection time to drift a little bit to compensate fluctuations.
I am currently using a 1-second jitter, but we can go with 2-5s I guess.
// or more contributor license agreements. Licensed under the Elastic License; | ||
// you may not use this file except in compliance with the Elastic License. | ||
|
||
package azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a clean copy-paste?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied and pasted the license header from another file; let me check if I picked a bad one 👀
Hey @tommyers-elastic, with 72c5b69 I switched from truncating/rounding to use a jitter during timestamp comparison: The pros should be:
In my tests, this version works as good as the previous one. There's an image available with this change and info about how to run this version on a local stack. |
Here's how difference ( $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .time_grain, .needs_update, .reference_time, .last_collection_at//"na", .time_grain_start_time//"na", .distance//"na", .elapsed//"na", .jitter//"na"] | @tsv' | grep Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:45:03.982Z 2023-12-12T13:40:03.985Z 2023-12-12T13:40:03.982Z 3.251ms 4m59.996749s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:50:03.980Z 2023-12-12T13:45:03.982Z 2023-12-12T13:45:03.980Z 2.171ms 4m59.997829s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T13:55:03.977Z 2023-12-12T13:50:03.980Z 2023-12-12T13:50:03.977Z 3.283ms 4m59.996717s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:00:03.974Z 2023-12-12T13:55:03.977Z 2023-12-12T13:55:03.974Z 2.489ms 4m59.997511s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:05:03.977Z 2023-12-12T14:00:03.974Z 2023-12-12T14:00:03.977Z -2.857ms 5m0.002857s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:10:03.975Z 2023-12-12T14:05:03.977Z 2023-12-12T14:05:03.975Z 1.953ms 4m59.998047s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:15:03.976Z 2023-12-12T14:10:03.975Z 2023-12-12T14:10:03.976Z -922µs 5m0.000922s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s
Microsoft.Compute/virtualMachines PT5M true 2023-12-12T14:20:03.976Z 2023-12-12T14:15:03.976Z 2023-12-12T14:15:03.976Z -412µs 5m0.000412s 1s |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
awesome - thanks @zmoog ! |
Are these gaps a behavior specific to the Metricbeat implementation, or is it inherent to collecting metrics with a time grain equal to the collection interval (for example, collecting a PT1M metric using a 60-second collection interval)? To answer this question, I set up an OTel Collector and tried to figure out how to collect Azure metrics using the azuremonitorreceiver. After running the Azure Monitor Receiver for a while, I compared the metrics with the data on Azure Portal and Metricbeat (with the changes in this PR): The gaps also appear on the Azure Monitor Receiver when time grain and collection interval have the same duration. The changes in this PR try to address this problem, avoiding the gaps. Check zmoog/public-notes#67 (comment) to learn more. |
72c5b69
to
fabacd6
Compare
💚 Build Succeeded
Expand to view the summary
Build stats
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
9f266b7
to
76c879a
Compare
Hey @elastic/elastic-agent-data-plane, it seems you are now the only owner of I wanted to include this fix in the last 8.12 BC. Please let me know if we need to update the CODEOWNERS or if I need to ask you for a review. |
💚 Build Succeeded
Expand to view the summary
Build stats
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmoog it looks like the entry for /x-pack/metricbeat/module/azure
was never there, only for filebeat. So, it defaults to our team. Please update the codeowners. I'm approving this PR, so you can merge it.
Yep, indeed. Previously, I'll open a PR to add an entry to CODEOWNERS for each module we own. @rdner, thanks for approving the PR in the meantime. |
During a testing session on 8.11.2, we noticed some skipped collections on one of the testing agents. Debug information revealed the metricset skipped some collections due to a 1-second difference between the reference time in the current collection, and the reference time in the previous collection, making the collection period is 1 second shorter (299s instead of 300s). Collection skip may happen due to reference time rounding. For example, the timestamp 2023-12-08T10:58:32.999Z may become 2023-12-08T10:58:32.000Z due to the truncation. As of today, this problem is happening on one agent only, but the problem is real, and we should replace the truncate(1s) with a round(1s) to eliminate fluctuations.
Not just equal, I want to check the value is the expected one.
Instead of truncating or rounding `referenceTime` to a value, I am opting to keep the `referenceTime` value intact and using a jitter when comparing it with the last collected time. Pros: - avoid having the thresholds we have with truncating or rounding, where a 1ms difference can flip the final result to the next or previous second. - using a jitter gives us more flexibility (we can make it configurable) - keeping the `referenceTime` value intact helps with troubleshooting
76c879a
to
cbf7167
Compare
💚 Build Succeeded
Expand to view the summary
Build stats
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)
…ics (#37557) * Fix the reference time rounding on Azure Metrics (#37365) ### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) * Remove extra changelog entry --------- Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](elastic#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ```
Proposed commit message
What
Change the
MetricRegistry.NeedsUpdate()
method to decide whether to collect the metrics by comparing the collection interval with the time grain.If the time since the last collection < time grain duration, then the metrics skip the collection.
For example, given the following scenario:
Scenario A: collect PT1M metrics every 60s
In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.
Scenario B: collect PT15M metrics every 60s
In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.
The jitter
During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.
Why
During a testing session on 8.11.2, we noticed one out of four agents skipped some metrics collections.
The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).
The 1-second difference may happen due to an inaccurate rounding in the reference time.
For example, suppose the following two events occur:
Fetch()
on the metricset a few milliseconds earlier than in the previous collection.In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.
This problem happened to one test agent. However, if it happens to one agent, it can happen to others.
Extended Structured Logging
We also added new fields to the debug structured logs:
Here's an example using
jq
:Command output
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs