-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed HA Tracker jitter causing unnecessary CAS operations #1861
Fixed HA Tracker jitter causing unnecessary CAS operations #1861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice bug!
pkg/distributor/ha_tracker_test.go
Outdated
if r.GetReplica() != replica { | ||
err = fmt.Errorf("replicas did not match: %s != %s", r.GetReplica(), replica) | ||
continue outer | ||
} | ||
if timestamp.Time(r.GetReceivedAt()).Equal(expected) { | ||
} else if !timestamp.Time(r.GetReceivedAt()).Equal(expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny detail, but using 'else if' when previous 'if' body does continue, break or return is not very nice. Previous version was clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Fixed
if r.GetReplica() != replica { | ||
err = fmt.Errorf("replicas did not match: %s != %s", r.GetReplica(), replica) | ||
continue outer | ||
} | ||
if timestamp.Time(r.GetReceivedAt()).Equal(expected) { | ||
if !timestamp.Time(r.GetReceivedAt()).Equal(expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, the actual change is pretty minor but good catch!
Were the whitespace/newline and mtime changes intentional?
}) | ||
assert.NoError(t, err) | ||
|
||
// Write the first time. | ||
mtime.NowForce(start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there something that necessitated this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It avoids flaky tests.
}) | ||
assert.NoError(t, err) | ||
|
||
// Write the first time. | ||
mtime.NowForce(startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It avoids flaky tests.
@gouthamve Do we want this in the |
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
47e140c
to
9e37d00
Compare
What this PR does:
Today me and @pstibrany spent few hours debugging an expected pattern on the number of CAS operations done by the distributors when the HA tracker is enabled.
We found out that the PR #1748 - which introduced the update timeout jitter - also introduced a time window (long as much as the jitter) during which every request does a CAS but the CAS operation itself doesn't update the replica updated timestamp, because the CAS is triggered if
now - receivedAt >= updateTimeout
but then the CAS function is a noop ifnow - receivedAt < updateTimeout + jitter
.While adding tests, I've also realized that the
checkReplicaTimestamp()
was broken (tests were actually failing) so I've fixed it.Which issue(s) this PR fixes:
No issue
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]