Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.11](backport #37365) Fix the reference time rounding on Azure Metrics #37568

Merged
merged 1 commit into from
Jan 6, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jan 5, 2024

This is an automatic backport of pull request #37365 done by Mergify.


Mergify commands and options

More conditions and actions can be found in the documentation.

You can also trigger Mergify actions by commenting on this pull request:

  • @Mergifyio refresh will re-evaluate the rules
  • @Mergifyio rebase will rebase this PR on its base branch
  • @Mergifyio update will merge the base branch into this PR
  • @Mergifyio backport <destination> will backport this PR on <destination> branch

Additionally, on Mergify dashboard you can:

  • look at your merge queues
  • generate the Mergify configuration with the config editor.

Finally, you can contact us on https://mergify.com

### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain.

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z.

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```

(cherry picked from commit 824dd04)
@mergify mergify bot requested a review from a team as a code owner January 5, 2024 23:42
@mergify mergify bot added the backport label Jan 5, 2024
@mergify mergify bot assigned zmoog Jan 5, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 5, 2024
@botelastic
Copy link

botelastic bot commented Jan 5, 2024

This pull request doesn't have a Team:<team> label.

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-05T23:42:34.344+0000

  • Duration: 53 min 56 sec

Test stats 🧪

Test Results
Failed 0
Passed 1482
Skipped 96
Total 1578

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@zmoog zmoog merged commit 6133710 into 8.11 Jan 6, 2024
22 checks passed
@zmoog zmoog deleted the mergify/bp/8.11/pr-37365 branch January 6, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants