Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect monitoring for the overflow metrics for all aggregations #10330

Merged
merged 9 commits into from
Feb 23, 2023

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented Feb 22, 2023

Motivation/summary

Collect monitoring for the overflow metrics for all aggregations. Some major changes in the PR are:

  1. The PR moves the metric updates to publish method, this would mean that the metrics are updated periodically based on the metric interval (set to 1m).
  2. The overflow metrics collected represent an estimate of the total number of unique transaction groups that overflowed rather than total number of overflow.
  3. Changes the metric name from txmetrics.overflowed to txmetrics.overflowed.total.
  4. Introduces 3 more metrics: txmetrics.overflowed.services, txmetrics.overflowed.per_service_txn_groups and txmetrics.overflowed.txn_groups to represent overflow reason.

Checklist

How to test these changes

Create overflow as per these steps and check if the metrics are published.

Related issues

Fixes #10207

@mergify
Copy link
Contributor

mergify bot commented Feb 22, 2023

This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Feb 22, 2023
@apmmachine
Copy link
Contributor

apmmachine commented Feb 22, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-23T11:42:40.738+0000

  • Duration: 9 min 46 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@apmmachine
Copy link
Contributor

apmmachine commented Feb 22, 2023

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │              bench.out               │
                                  │        sec/op        │    sec/op      vs base               │
FetchAndAdd/FetchFromCache-12               41.36n ± ∞ ¹    46.12n ± ∞ ¹  +11.51% (p=0.008 n=5)
geomean                                     63.13n          68.55n         +8.59%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ build/main/bench.out │              bench.out              │
                                  │         B/op         │    B/op      vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                  │ build/main/bench.out │              bench.out              │
                                  │      allocs/op       │  allocs/op   vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextResetContentEncoding/empty-12                   122.8n ± ∞ ¹   136.2n ± ∞ ¹  +10.91% (p=0.008 n=5)
ContextResetContentEncoding/uncompressed-12            151.7n ± ∞ ¹   162.1n ± ∞ ¹   +6.86% (p=0.008 n=5)
geomean                                                846.6n         915.8n         +8.17%
¹ need >= 6 samples for confidence interval at level 0.95

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
geomean                                                           ³                  +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                           ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out           │
             │        sec/op        │   sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │           bench.out           │
                 │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                 │ build/main/bench.out │            bench.out             │
                 │         B/op         │     B/op       vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                 │ build/main/bench.out │           bench.out            │
                 │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │           bench.out           │
                        │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                        │ build/main/bench.out │           bench.out            │
                        │         B/op         │    B/op      vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                        │ build/main/bench.out │           bench.out            │
                        │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  663.9n         618.5n        -6.84%
¹ need >= 6 samples for confidence interval at level 0.95

               │ build/main/bench.out │               bench.out               │
               │         B/op         │     B/op       vs base                │
geomean                             ³                  -0.51%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

               │ build/main/bench.out │              bench.out              │
               │      allocs/op       │  allocs/op   vs base                │
geomean                             ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                            │ build/main/bench.out │               bench.out               │
                                            │        sec/op        │    sec/op      vs base                │
WriteTransaction/json_codec-12                        4.548µ ± ∞ ¹   12.872µ ± ∞ ¹  +183.03% (p=0.008 n=5)
WriteTransaction/json_codec_big_tx-12                 5.716µ ± ∞ ¹   14.025µ ± ∞ ¹  +145.36% (p=0.008 n=5)
ReadEvents/json_codec/0_events-12                     330.3n ± ∞ ¹    347.6n ± ∞ ¹    +5.24% (p=0.008 n=5)
ReadEvents/json_codec_big_tx/0_events-12              328.0n ± ∞ ¹    347.8n ± ∞ ¹    +6.04% (p=0.016 n=5)
ReadEvents/json_codec_big_tx/399_events-12            3.836m ± ∞ ¹    3.764m ± ∞ ¹    -1.88% (p=0.008 n=5)
ReadEvents/nop_codec/0_events-12                      308.9n ± ∞ ¹    346.5n ± ∞ ¹   +12.17% (p=0.032 n=5)
ReadEvents/nop_codec_big_tx/0_events-12               323.2n ± ∞ ¹    340.8n ± ∞ ¹    +5.45% (p=0.032 n=5)
ReadEvents/nop_codec_big_tx/100_events-12             132.2µ ± ∞ ¹    156.4µ ± ∞ ¹   +18.24% (p=0.008 n=5)
IsTraceSampled/sampled-12                             72.30n ± ∞ ¹    76.81n ± ∞ ¹    +6.24% (p=0.008 n=5)
IsTraceSampled/unknown-12                             383.0n ± ∞ ¹    423.7n ± ∞ ¹   +10.63% (p=0.008 n=5)
geomean                                               29.07µ          31.94µ          +9.88%
¹ need >= 6 samples for confidence interval at level 0.95

                                            │ build/main/bench.out │               bench.out                │
                                            │         B/op         │      B/op       vs base                │
ReadEvents/json_codec_big_tx/10_events-12            100.9Ki ± ∞ ¹    101.0Ki ± ∞ ¹  +0.05% (p=0.032 n=5)
ReadEvents/nop_codec/399_events-12                   876.7Ki ± ∞ ¹    879.3Ki ± ∞ ¹  +0.29% (p=0.032 n=5)
ReadEvents/nop_codec/1000_events-12                  2.084Mi ± ∞ ¹    2.078Mi ± ∞ ¹  -0.32% (p=0.024 n=5)
ReadEvents/nop_codec_big_tx/1_events-12              3.195Ki ± ∞ ¹    3.191Ki ± ∞ ¹  -0.12% (p=0.008 n=5)
ReadEvents/nop_codec_big_tx/1000_events-12           2.075Mi ± ∞ ¹    2.089Mi ± ∞ ¹  +0.65% (p=0.048 n=5)
geomean                                              31.36Ki          31.37Ki        +0.01%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                            │ build/main/bench.out │              bench.out               │
                                            │      allocs/op       │  allocs/op    vs base                │
geomean                                                144.7          144.7        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Comment on lines 230 to 239
overflowCount := int64(svcEntry.otherCardinalityEstimator.Estimate())
if isMetricsPeriod {
activeGroups = int64(current.entries)
if svc == overflowBucketName {
servicesOverflow += overflowCount
} else if svcEntry.entries >= a.config.MaxTransactionGroupsPerService {
perSvcTxnGroupsOverflow += overflowCount
} else {
txnGroupsOverflow += overflowCount
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewers] Collection of metrics in publish method was driven by the following factors:

  1. Collect total number of unique transaction groups that overflowed due to max services limit, per service max transaction limit, and max transaction limit. To do this, we would require the estimate from the hyperloglog sketch. While, the value would still be an estimate due to hash-collisions and hyperloglog usage but the metrics will be better aligned with the published metricset.
  2. Hyperloglog++ sketch with sparse representation requires a write lock even for estimating cardinality.
  3. IIUC, CollectMonitoring is part of a pull-based monitoring architecture, and doing a full lock on the whole struct didn't sound like a good idea.

@carsonip carsonip added backport-8.7 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Feb 22, 2023
@lahsivjar lahsivjar requested a review from a team February 22, 2023 05:51
@lahsivjar lahsivjar marked this pull request as ready for review February 23, 2023 07:25
uniqueTxnCount: 60,
uniqueServices: 20,
expectedActiveGroups: 40,
expectedPerSvcTxnLimitOverflow: 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 0 now? (same for the above example)

Copy link
Contributor Author

@lahsivjar lahsivjar Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the meaning behind the variables. Previously the value expectedPerSvcTxnLimitOverflow represented the total number of events that were recorded in the overflow bucket per service. These included the overflow due to max transaction limit as well as due to per service transaction limit (since if the max transaction limit is breached we will still record the value in the overflow bucket of the specific service). In this PR, I updated the variables to represent the total overflow due to a specific limit breach. The older and newer values can be translated as:

expectedPerSvcTxnLimitOverflow * min(uniqueServices, maxServices) = expectedOverflowReasonPerSvcTxnGrps + expectedOverflowReasonTxnGrps and
expectedOtherSvcTxnLimitOverflow = expectedOverflowReasonSvc

@lahsivjar lahsivjar enabled auto-merge (squash) February 23, 2023 11:44
@lahsivjar lahsivjar merged commit 55e78ab into elastic:main Feb 23, 2023
mergify bot pushed a commit that referenced this pull request Feb 23, 2023
lahsivjar added a commit that referenced this pull request Feb 23, 2023
) (#10342)

(cherry picked from commit 55e78ab)

Co-authored-by: Vishal Raj <vishal.raj@elastic.co>
@lahsivjar lahsivjar deleted the monitoring_agg_10207 branch February 27, 2023 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.7 Automated backport with mergify test-plan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add monitoring data for metrics aggregation
4 participants