-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect monitoring for the overflow metrics for all aggregations #10330
Conversation
5f75b31
to
9b89ea4
Compare
This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
NOTE: |
📚 Go benchmark reportDiff with the
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
overflowCount := int64(svcEntry.otherCardinalityEstimator.Estimate()) | ||
if isMetricsPeriod { | ||
activeGroups = int64(current.entries) | ||
if svc == overflowBucketName { | ||
servicesOverflow += overflowCount | ||
} else if svcEntry.entries >= a.config.MaxTransactionGroupsPerService { | ||
perSvcTxnGroupsOverflow += overflowCount | ||
} else { | ||
txnGroupsOverflow += overflowCount | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[For reviewers] Collection of metrics in publish
method was driven by the following factors:
- Collect total number of unique transaction groups that overflowed due to max services limit, per service max transaction limit, and max transaction limit. To do this, we would require the estimate from the hyperloglog sketch. While, the value would still be an estimate due to hash-collisions and hyperloglog usage but the metrics will be better aligned with the published metricset.
- Hyperloglog++ sketch with sparse representation requires a write lock even for estimating cardinality.
- IIUC,
CollectMonitoring
is part of a pull-based monitoring architecture, and doing a full lock on the whole struct didn't sound like a good idea.
e92d728
to
58b7303
Compare
uniqueTxnCount: 60, | ||
uniqueServices: 20, | ||
expectedActiveGroups: 40, | ||
expectedPerSvcTxnLimitOverflow: 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this 0 now? (same for the above example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the meaning behind the variables. Previously the value expectedPerSvcTxnLimitOverflow
represented the total number of events that were recorded in the overflow bucket per service. These included the overflow due to max transaction limit as well as due to per service transaction limit (since if the max transaction limit is breached we will still record the value in the overflow bucket of the specific service). In this PR, I updated the variables to represent the total overflow due to a specific limit breach. The older and newer values can be translated as:
expectedPerSvcTxnLimitOverflow * min(uniqueServices, maxServices) = expectedOverflowReasonPerSvcTxnGrps + expectedOverflowReasonTxnGrps
and
expectedOtherSvcTxnLimitOverflow = expectedOverflowReasonSvc
) (cherry picked from commit 55e78ab)
Motivation/summary
Collect monitoring for the overflow metrics for all aggregations. Some major changes in the PR are:
publish
method, this would mean that the metrics are updated periodically based on the metric interval (set to1m
).txmetrics.overflowed
totxmetrics.overflowed.total
.txmetrics.overflowed.services
,txmetrics.overflowed.per_service_txn_groups
andtxmetrics.overflowed.txn_groups
to represent overflow reason.Checklist
- [ ] Update package changelog.yml (only if changes toapmpackage
have been made)- [ ] Documentation has been updatedHow to test these changes
Create overflow as per these steps and check if the metrics are published.
Related issues
Fixes #10207