Collect monitoring for the overflow metrics for all aggregations #10330

lahsivjar · 2023-02-22T03:44:18Z

Motivation/summary

Collect monitoring for the overflow metrics for all aggregations. Some major changes in the PR are:

The PR moves the metric updates to publish method, this would mean that the metrics are updated periodically based on the metric interval (set to 1m).
The overflow metrics collected represent an estimate of the total number of unique transaction groups that overflowed rather than total number of overflow.
Changes the metric name from txmetrics.overflowed to txmetrics.overflowed.total.
Introduces 3 more metrics: txmetrics.overflowed.services, txmetrics.overflowed.per_service_txn_groups and txmetrics.overflowed.txn_groups to represent overflow reason.

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

Create overflow as per these steps and check if the metrics are published.

Related issues

Fixes #10207

mergify · 2023-02-22T03:44:52Z

This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

apmmachine · 2023-02-22T03:46:04Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-02-23T11:42:40.738+0000
Duration: 9 min 46 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2023-02-22T03:56:11Z

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │              bench.out               │
                                  │        sec/op        │    sec/op      vs base               │
FetchAndAdd/FetchFromCache-12               41.36n ± ∞ ¹    46.12n ± ∞ ¹  +11.51% (p=0.008 n=5)
geomean                                     63.13n          68.55n         +8.59%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ build/main/bench.out │              bench.out              │
                                  │         B/op         │    B/op      vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                  │ build/main/bench.out │              bench.out              │
                                  │      allocs/op       │  allocs/op   vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextResetContentEncoding/empty-12                   122.8n ± ∞ ¹   136.2n ± ∞ ¹  +10.91% (p=0.008 n=5)
ContextResetContentEncoding/uncompressed-12            151.7n ± ∞ ¹   162.1n ± ∞ ¹   +6.86% (p=0.008 n=5)
geomean                                                846.6n         915.8n         +8.17%
¹ need >= 6 samples for confidence interval at level 0.95

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
geomean                                                           ³                  +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                           ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out           │
             │        sec/op        │   sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │           bench.out           │
                 │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                 │ build/main/bench.out │            bench.out             │
                 │         B/op         │     B/op       vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                 │ build/main/bench.out │           bench.out            │
                 │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │           bench.out           │
                        │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                        │ build/main/bench.out │           bench.out            │
                        │         B/op         │    B/op      vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                        │ build/main/bench.out │           bench.out            │
                        │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  663.9n         618.5n        -6.84%
¹ need >= 6 samples for confidence interval at level 0.95

               │ build/main/bench.out │               bench.out               │
               │         B/op         │     B/op       vs base                │
geomean                             ³                  -0.51%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

               │ build/main/bench.out │              bench.out              │
               │      allocs/op       │  allocs/op   vs base                │
geomean                             ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                            │ build/main/bench.out │               bench.out               │
                                            │        sec/op        │    sec/op      vs base                │
WriteTransaction/json_codec-12                        4.548µ ± ∞ ¹   12.872µ ± ∞ ¹  +183.03% (p=0.008 n=5)
WriteTransaction/json_codec_big_tx-12                 5.716µ ± ∞ ¹   14.025µ ± ∞ ¹  +145.36% (p=0.008 n=5)
ReadEvents/json_codec/0_events-12                     330.3n ± ∞ ¹    347.6n ± ∞ ¹    +5.24% (p=0.008 n=5)
ReadEvents/json_codec_big_tx/0_events-12              328.0n ± ∞ ¹    347.8n ± ∞ ¹    +6.04% (p=0.016 n=5)
ReadEvents/json_codec_big_tx/399_events-12            3.836m ± ∞ ¹    3.764m ± ∞ ¹    -1.88% (p=0.008 n=5)
ReadEvents/nop_codec/0_events-12                      308.9n ± ∞ ¹    346.5n ± ∞ ¹   +12.17% (p=0.032 n=5)
ReadEvents/nop_codec_big_tx/0_events-12               323.2n ± ∞ ¹    340.8n ± ∞ ¹    +5.45% (p=0.032 n=5)
ReadEvents/nop_codec_big_tx/100_events-12             132.2µ ± ∞ ¹    156.4µ ± ∞ ¹   +18.24% (p=0.008 n=5)
IsTraceSampled/sampled-12                             72.30n ± ∞ ¹    76.81n ± ∞ ¹    +6.24% (p=0.008 n=5)
IsTraceSampled/unknown-12                             383.0n ± ∞ ¹    423.7n ± ∞ ¹   +10.63% (p=0.008 n=5)
geomean                                               29.07µ          31.94µ          +9.88%
¹ need >= 6 samples for confidence interval at level 0.95

                                            │ build/main/bench.out │               bench.out                │
                                            │         B/op         │      B/op       vs base                │
ReadEvents/json_codec_big_tx/10_events-12            100.9Ki ± ∞ ¹    101.0Ki ± ∞ ¹  +0.05% (p=0.032 n=5)
ReadEvents/nop_codec/399_events-12                   876.7Ki ± ∞ ¹    879.3Ki ± ∞ ¹  +0.29% (p=0.032 n=5)
ReadEvents/nop_codec/1000_events-12                  2.084Mi ± ∞ ¹    2.078Mi ± ∞ ¹  -0.32% (p=0.024 n=5)
ReadEvents/nop_codec_big_tx/1_events-12              3.195Ki ± ∞ ¹    3.191Ki ± ∞ ¹  -0.12% (p=0.008 n=5)
ReadEvents/nop_codec_big_tx/1000_events-12           2.075Mi ± ∞ ¹    2.089Mi ± ∞ ¹  +0.65% (p=0.048 n=5)
geomean                                              31.36Ki          31.37Ki        +0.01%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                            │ build/main/bench.out │              bench.out               │
                                            │      allocs/op       │  allocs/op    vs base                │
geomean                                                144.7          144.7        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

lahsivjar · 2023-02-22T04:05:03Z

x-pack/apm-server/aggregation/txmetrics/aggregator.go

+			overflowCount := int64(svcEntry.otherCardinalityEstimator.Estimate())
+			if isMetricsPeriod {
+				activeGroups = int64(current.entries)
+				if svc == overflowBucketName {
+					servicesOverflow += overflowCount
+				} else if svcEntry.entries >= a.config.MaxTransactionGroupsPerService {
+					perSvcTxnGroupsOverflow += overflowCount
+				} else {
+					txnGroupsOverflow += overflowCount
+				}
+			}


[For reviewers] Collection of metrics in publish method was driven by the following factors:

Collect total number of unique transaction groups that overflowed due to max services limit, per service max transaction limit, and max transaction limit. To do this, we would require the estimate from the hyperloglog sketch. While, the value would still be an estimate due to hash-collisions and hyperloglog usage but the metrics will be better aligned with the published metricset.

Hyperloglog++ sketch with sparse representation requires a write lock even for estimating cardinality.

IIUC, CollectMonitoring is part of a pull-based monitoring architecture, and doing a full lock on the whole struct didn't sound like a good idea.

simitt · 2023-02-23T08:10:46Z

x-pack/apm-server/aggregation/txmetrics/aggregator_test.go

-			uniqueTxnCount:                   60,
-			uniqueServices:                   20,
-			expectedActiveGroups:             40,
-			expectedPerSvcTxnLimitOverflow:   2,


why is this 0 now? (same for the above example)

I changed the meaning behind the variables. Previously the value expectedPerSvcTxnLimitOverflow represented the total number of events that were recorded in the overflow bucket per service. These included the overflow due to max transaction limit as well as due to per service transaction limit (since if the max transaction limit is breached we will still record the value in the overflow bucket of the specific service). In this PR, I updated the variables to represent the total overflow due to a specific limit breach. The older and newer values can be translated as:

expectedPerSvcTxnLimitOverflow * min(uniqueServices, maxServices) = expectedOverflowReasonPerSvcTxnGrps + expectedOverflowReasonTxnGrps and
expectedOtherSvcTxnLimitOverflow = expectedOverflowReasonSvc

) (cherry picked from commit 55e78ab)

) (#10342) (cherry picked from commit 55e78ab) Co-authored-by: Vishal Raj <vishal.raj@elastic.co>

lahsivjar force-pushed the monitoring_agg_10207 branch from 5f75b31 to 9b89ea4 Compare February 22, 2023 03:44

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Feb 22, 2023

lahsivjar commented Feb 22, 2023

View reviewed changes

carsonip added backport-8.7 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Feb 22, 2023

lahsivjar requested a review from a team February 22, 2023 05:51

lahsivjar added 6 commits February 23, 2023 15:08

Record overflow metrics for txn metrics aggregation

96ede1f

Minor refactor

0f65949

Minor refactor of txmetrics metrics

bee7be5

Add monitoring metrics for service txn aggregation

18e7c97

Add monitoring metrics for span aggregation

5530e84

Add monitoring metrics for svc summary aggregation

58b7303

lahsivjar force-pushed the monitoring_agg_10207 branch from e92d728 to 58b7303 Compare February 23, 2023 07:21

Add changelog

f463a72

lahsivjar marked this pull request as ready for review February 23, 2023 07:25

lahsivjar added the test-plan label Feb 23, 2023

simitt reviewed Feb 23, 2023

View reviewed changes

simitt approved these changes Feb 23, 2023

View reviewed changes

lahsivjar added 2 commits February 23, 2023 19:40

Update changelog

986b07f

Merge branch 'main' into monitoring_agg_10207

fc14710

lahsivjar enabled auto-merge (squash) February 23, 2023 11:44

lahsivjar merged commit 55e78ab into elastic:main Feb 23, 2023

mergify bot pushed a commit that referenced this pull request Feb 23, 2023

Collect monitoring for the overflow metrics for all aggregations (#10330

ed4f940

) (cherry picked from commit 55e78ab)

mergify bot mentioned this pull request Feb 23, 2023

[8.7] Collect monitoring for the overflow metrics for all aggregations (backport #10330) #10342

Merged

lahsivjar added a commit that referenced this pull request Feb 23, 2023

Collect monitoring for the overflow metrics for all aggregations (#10330

ad112d1

) (#10342) (cherry picked from commit 55e78ab) Co-authored-by: Vishal Raj <vishal.raj@elastic.co>

lahsivjar deleted the monitoring_agg_10207 branch February 27, 2023 02:45

miltonhultgren mentioned this pull request Mar 9, 2023

Update mappings to align with changes to APM Server metrics elastic/kibana#153048

Closed

miltonhultgren mentioned this pull request Mar 10, 2023

[beat] Update mappings for APM Server aggregation metrics elastic/integrations#5509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect monitoring for the overflow metrics for all aggregations #10330

Collect monitoring for the overflow metrics for all aggregations #10330

lahsivjar commented Feb 22, 2023 •

edited

Loading

mergify bot commented Feb 22, 2023

apmmachine commented Feb 22, 2023 •

edited

Loading

Build stats

apmmachine commented Feb 22, 2023 •

edited

Loading

lahsivjar Feb 22, 2023

simitt Feb 23, 2023

lahsivjar Feb 23, 2023 •

edited

Loading

Collect monitoring for the overflow metrics for all aggregations #10330

Collect monitoring for the overflow metrics for all aggregations #10330

Conversation

lahsivjar commented Feb 22, 2023 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Feb 22, 2023

apmmachine commented Feb 22, 2023 • edited Loading

💚 Build Succeeded

Build stats

🤖 GitHub comments

apmmachine commented Feb 22, 2023 • edited Loading

📚 Go benchmark report

lahsivjar Feb 22, 2023

Choose a reason for hiding this comment

simitt Feb 23, 2023

Choose a reason for hiding this comment

lahsivjar Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

lahsivjar commented Feb 22, 2023 •

edited

Loading

apmmachine commented Feb 22, 2023 •

edited

Loading

apmmachine commented Feb 22, 2023 •

edited

Loading

lahsivjar Feb 23, 2023 •

edited

Loading