-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emits a counter value for every unique view of the hashring #5672
Emits a counter value for every unique view of the hashring #5672
Conversation
) *ring { | ||
hashring := &ring{ | ||
ring := &ring{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linter was complaiing about a naming collision with the package
common/membership/hashring.go
Outdated
} | ||
self, err := r.peerProvider.WhoAmI() | ||
if err != nil { | ||
r.logger.Error("Observed a problem self", tag.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest changing the message to be a bit more meaningful
Maybe "Observed a problem getting self hash ID" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 yes, I dunno what I was thinking, I think I got distracted halfway through writing it
common/membership/hashring.go
Outdated
// trimming here to 100 distinct values, since that should be enough | ||
// the reason for this trimming is that collision is pretty unlikely, and this metric | ||
// has only value in emitting that it is different from other hosts, so keeping the | ||
// hash space small here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we want to do this - I understand it's not an issue to do, but I do not understand what the advantage of doing it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly, there's no really good reason, and I maybe should make it greater than 100 or remove it entirely. I did it initially because I was confused (I was thinking of creating series and therefore wanted to limit the cardinality).
I realised afterwards that I didn't need to do that, but left it just to avoid any overflows in the metric system, since I wanted to keep negative integers separate.
Open to changing it or ideas, I don't feel too strongly to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add something to the comment at least, so when I read it in a few months I'll not be confused again.
"Truncating the hash in case the metric system does not work well with too big numbers"?
I don't feel strongly either, but if there is no explanation we'll never be able to change this, because "they must have had a reason to do this".
I guess just having this thread helps 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix that commit, it is confusing, you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have integration test to validate emitted metrics, did you test locally and checked the emitted metrics?
@@ -275,6 +282,41 @@ func (r *ring) ring() *hashring.HashRing { | |||
return r.value.Load().(*hashring.HashRing) | |||
} | |||
|
|||
func (r *ring) emitHashIdentifier() float64 { | |||
members, err := r.peerProvider.GetMembers(r.service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does GetMembers
return full list including self? If not we should add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know to be honest, but also I think a view of the hashring that excludes the current machine is a valid view (if it's drained, for example), and you still want to ensure it's consistent across both drained and undrained zones.
I would not want to mutate the membership list to add this because it would make a drain case look like a split when it wasn't in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If each host sees a list of members that doesn't include themselves then the hashed value will never match. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something?
If I'm on host 10.0.0.1
and I have a view of the hashrhing as equivalent to the hashed version of
10.0.0.2
10.0.0.3
then, this should be the same as the view for 10.0.0.2
and 3 respectively? The fact that it's not actually on the active list is not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only reason I fetched self, was to add it as a dimension for the guage to prevent interleaving of the metric
I deployed and tested it, yes |
ff95842
into
cadence-workflow:master
Context and problem
The consistent hashring is an eventually consistent way to see the services' members for routing in amongst shards. It's an eventually-consistent system (since actual strong consistency for membership is controlled by the database shard locking).
However, it's possible that it can be very eventually consistent, particularly if the membership resolving system is serving stale data for some reason. This can cause problems because particularly in the case of history, it breaks inter-service communication and routing.
Solution
A dumb way to determine if there's an inconsistency therefore, is to just hash the complete membership set and emit it for each host, like a fingerprint. In the healthy case, this random identifier value will quickly converge across hosts. . In the event their views are inconsistent, this will appear as a different guage values which remain persistently different, indicating that operationally some manual operation must be taken.
querying the data
Assuming m3, the trick will be to look for differences between the identifier values. The value itself is just a hash value and arbitrary. Therefore, selecting differences between the upper bound thusly:
implies that extended periods of non -1 values are showing a split.
Callouts
This does emit metrics at container-level granularity, which might be somewhat higher cardinality for large users.