Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emits a counter value for every unique view of the hashring #5672

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Feb 18, 2024

Context and problem

The consistent hashring is an eventually consistent way to see the services' members for routing in amongst shards. It's an eventually-consistent system (since actual strong consistency for membership is controlled by the database shard locking).

However, it's possible that it can be very eventually consistent, particularly if the membership resolving system is serving stale data for some reason. This can cause problems because particularly in the case of history, it breaks inter-service communication and routing.

Solution

A dumb way to determine if there's an inconsistency therefore, is to just hash the complete membership set and emit it for each host, like a fingerprint. In the healthy case, this random identifier value will quickly converge across hosts. . In the event their views are inconsistent, this will appear as a different guage values which remain persistently different, indicating that operationally some manual operation must be taken.

querying the data

Assuming m3, the trick will be to look for differences between the identifier values. The value itself is just a hash value and arbitrary. Therefore, selecting differences between the upper bound thusly:

max = fetch service:cadence* operation:hashring region:region1 service:cadence-history | maxSeries;
min = fetch service:cadence* operation:hashring region:region1 service:cadence-history | minSeries;

-- select where they're different
max | != (min)

-- and emit -1 for the default case where the upper value and lower value are all the same
| transformNull -1

implies that extended periods of non -1 values are showing a split.

Callouts

This does emit metrics at container-level granularity, which might be somewhat higher cardinality for large users.

@coveralls
Copy link

coveralls commented Feb 18, 2024

Pull Request Test Coverage Report for Build 018dc38a-e905-4f65-b0c9-581902e67c80

Details

  • -17 of 49 (65.31%) changed or added relevant lines in 4 files are covered.
  • 109 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-0.03%) to 62.719%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/server/cadence/server.go 0 2 0.0%
common/metrics/tags.go 0 6 0.0%
common/membership/hashring.go 30 39 76.92%
Files with Coverage Reduction New Missed Lines %
common/persistence/sql/sqlplugin/mysql/db.go 2 83.33%
common/persistence/sql/sqlplugin/mysql/task.go 2 73.68%
common/task/fifo_task_scheduler.go 2 84.54%
common/task/parallel_task_processor.go 2 93.06%
common/task/weighted_round_robin_task_scheduler.go 2 88.06%
common/util.go 2 91.69%
service/history/execution/mutable_state_util.go 2 37.63%
service/history/shard/context.go 2 66.52%
service/history/task/transfer_standby_task_executor.go 2 86.6%
service/matching/taskListManager.go 2 80.46%
Totals Coverage Status
Change from base Build 018db582-cf6a-4726-8ce9-adba3d0a8053: -0.03%
Covered Lines: 92614
Relevant Lines: 147666

💛 - Coveralls

) *ring {
hashring := &ring{
ring := &ring{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter was complaiing about a naming collision with the package

}
self, err := r.peerProvider.WhoAmI()
if err != nil {
r.logger.Error("Observed a problem self", tag.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest changing the message to be a bit more meaningful
Maybe "Observed a problem getting self hash ID" ?

Copy link
Member Author

@davidporter-id-au davidporter-id-au Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 yes, I dunno what I was thinking, I think I got distracted halfway through writing it

// trimming here to 100 distinct values, since that should be enough
// the reason for this trimming is that collision is pretty unlikely, and this metric
// has only value in emitting that it is different from other hosts, so keeping the
// hash space small here
Copy link
Member

@jakobht jakobht Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we want to do this - I understand it's not an issue to do, but I do not understand what the advantage of doing it is?

Copy link
Member Author

@davidporter-id-au davidporter-id-au Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, there's no really good reason, and I maybe should make it greater than 100 or remove it entirely. I did it initially because I was confused (I was thinking of creating series and therefore wanted to limit the cardinality).

I realised afterwards that I didn't need to do that, but left it just to avoid any overflows in the metric system, since I wanted to keep negative integers separate.

Open to changing it or ideas, I don't feel too strongly to be honest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something to the comment at least, so when I read it in a few months I'll not be confused again.

"Truncating the hash in case the metric system does not work well with too big numbers"?

I don't feel strongly either, but if there is no explanation we'll never be able to change this, because "they must have had a reason to do this".

I guess just having this thread helps 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix that commit, it is confusing, you're right

Copy link
Member

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have integration test to validate emitted metrics, did you test locally and checked the emitted metrics?

@@ -275,6 +282,41 @@ func (r *ring) ring() *hashring.HashRing {
return r.value.Load().(*hashring.HashRing)
}

func (r *ring) emitHashIdentifier() float64 {
members, err := r.peerProvider.GetMembers(r.service)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does GetMembers return full list including self? If not we should add it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to be honest, but also I think a view of the hashring that excludes the current machine is a valid view (if it's drained, for example), and you still want to ensure it's consistent across both drained and undrained zones.

I would not want to mutate the membership list to add this because it would make a drain case look like a split when it wasn't in this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each host sees a list of members that doesn't include themselves then the hashed value will never match. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something?

If I'm on host 10.0.0.1 and I have a view of the hashrhing as equivalent to the hashed version of

10.0.0.2
10.0.0.3

then, this should be the same as the view for 10.0.0.2 and 3 respectively? The fact that it's not actually on the active list is not a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason I fetched self, was to add it as a dimension for the guage to prevent interleaving of the metric

@davidporter-id-au
Copy link
Member Author

since we don't have integration test to validate emitted metrics, did you test locally and checked the emitted metrics?

I deployed and tested it, yes

@davidporter-id-au davidporter-id-au enabled auto-merge (squash) February 19, 2024 22:34
@davidporter-id-au davidporter-id-au merged commit ff95842 into cadence-workflow:master Feb 19, 2024
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants