-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.11](backport #37365) Fix the reference time rounding on Azure Metrics #37568
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)
botelastic
bot
added
the
needs_team
Indicates that the issue/PR needs a Team:* label
label
Jan 5, 2024
This pull request doesn't have a |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an automatic backport of pull request #37365 done by Mergify.
Mergify commands and options
More conditions and actions can be found in the documentation.
You can also trigger Mergify actions by commenting on this pull request:
@Mergifyio refresh
will re-evaluate the rules@Mergifyio rebase
will rebase this PR on its base branch@Mergifyio update
will merge the base branch into this PR@Mergifyio backport <destination>
will backport this PR on<destination>
branchAdditionally, on Mergify dashboard you can:
Finally, you can contact us on https://mergify.com