Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ES-1292925] Fix metrics with reusable counter resets #107

Merged
merged 3 commits into from
Nov 25, 2024
Merged

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented Nov 21, 2024

This PR tries to fix when a reusable counter metrics resets:

image (1)

  • Added a number of unit tested and make sure they pass when doing counter functions.

  • Also unit tested if not doing counter functions the original time series is returned

Pending integration tests, sent for early review and feedbacks

Integration tests work as expected:

Screenshot 2024-11-25 at 9 40 56 AM
  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi requested review from hczhu-db and yuchen-db November 21, 2024 23:23
pkg/dedup/merge_iter.go Show resolved Hide resolved
Comment on lines +40 to +45
var it adjustableSeriesIterator
if m.isCounter {
it = &counterErrAdjustSeriesIterator{Iterator: r.Iterator(nil)}
} else {
it = &noopAdjustableSeriesIterator{Iterator: r.Iterator(nil)}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are applying counter adjustments to the raw data samples before deduplication. This is more complicated than applying the adjustments to the single time series after deduplication. The adjustments will intervene with quorum-based deduplication logic. I'm concerned it may introduce other edge cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another test case such that one raw time series has a large gap with a reset, but the other two have complete data?

replica 0: [[1000, 10], [10000, 8], [11000, 10]]

replica 1: [[1000, 10], [2000, 0], [3000, 1], [4000, 2], [5000, 3], [6000, 4], [7000, 5], [8000, 6], [9000, 7],  [10000, 8], [11000, 10]
replica 2: [[1000, 10], [2000, 0], [3000, 1], [4000, 2], [5000, 3], [6000, 4], [7000, 5], [8000, 6], [9000, 7],  [10000, 8], [11000, 10]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added new test cases, the reason i've done it this way is to reuse counterErrAdjustSeriesIterator which you need to call adjustAtValue() somewhere, passing a merged time series to original newDedupSeries doesn't work

// feed the merged series into dedup series which apply counter adjustment
return NewMergedSeries(s.lset, repl, s.f)
}
if s.deduplicationFunc == AlgorithmChain {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @yuchen-db, I've also added the prometheus implementation, but it doesn't work for a number of unit tests if you wonder

@jnyi jnyi requested a review from hczhu-db November 22, 2024 18:06
@jnyi jnyi changed the title [ES-1292925] Fix metrics when reusable counter resets [ES-1292925] Fix metrics with reusable counter resets Nov 22, 2024
Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi force-pushed the time-series-gaps branch 3 times, most recently from 0b0fcfe to 7c98625 Compare November 22, 2024 18:43
@@ -210,18 +210,20 @@ func newQuerierWithOpts(

partialResponseStrategy := storepb.PartialResponseStrategy_ABORT
if opts.GroupReplicaPartialResponseStrategy {
level.Debug(logger).Log("msg", "Enabled group-replica partial response strategy in newQuerierInternal")
level.Info(logger).Log("msg", "Enabled group-replica partial response strategy in newQuerierInternal")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very chatty. I intentionally changed it to Debug previously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every query will print this log.

Copy link
Collaborator Author

@jnyi jnyi Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found i actually doesn't need this log line, it is logged here already:

level.Info(logger).Log("msg", "databricks querier features", "opts", fmt.Sprintf("%+v", opts))

partialResponseStrategy = storepb.PartialResponseStrategy_GROUP_REPLICA
} else if partialResponse {
partialResponseStrategy = storepb.PartialResponseStrategy_WARN
}
level.Info(logger).Log("msg", "Deduplication algorithm applied", "func", opts.DeduplicationFunc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every query will print this log.

@@ -210,18 +210,20 @@ func newQuerierWithOpts(

partialResponseStrategy := storepb.PartialResponseStrategy_ABORT
if opts.GroupReplicaPartialResponseStrategy {
level.Debug(logger).Log("msg", "Enabled group-replica partial response strategy in newQuerierInternal")
level.Info(logger).Log("msg", "Enabled group-replica partial response strategy in newQuerierInternal")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every query will print this log.

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi merged commit 67f5336 into db_main Nov 25, 2024
14 checks passed
@jnyi jnyi deleted the time-series-gaps branch November 25, 2024 17:41
@jnyi jnyi mentioned this pull request Dec 2, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants