Refactor PeerProvider & hashring interaction #6296

dkrotx · 2024-09-20T13:00:58Z

What changed?
After changes:

hashring never misses updated (no "channel is full" situation possible)
subscribers of MultiringResolver (history, matching) will always get
sane ChangedEvent which reflects the REAL changes (see details below)

Previously there were problems:

hashring subscribed to PeerProvider (ringpop/uns) with non-buffered channel
which led to failures to write every time ring was doing something
else than reading the channel (happened 60% of times based on error-logs).
Switched to calling handlers instead which are implementing
schedule-update with channel with cap=1 approach (see signalSelf).
This approach never skips updates.
PeerProvider supplies ChangedEvent to ring, but in reality, we do
not use it - we refresh everything from scratch. This makes very
misleading to even rely on the ChangedEvent. Basically, we might be
triggered by some event (host "c" appeared), but during refresh() we
realise there are more changes (host "a" removed, host "c" added as
well, etc.), and we notify our Subscribers with an absolutely
irrelevant data.
Because of race condition in Stop() (we hold subscribers-list locked while we
could notify subscribers at the same moment, and we were waiting for
refreshRingWorker to exit) we sometimes had issues with 1m delay which you could observe
even in local setup with ^C being too slow.
Same misleading took place in other methods like
emitHashIdentifier. It retrieved list of members from PeerProvider
independantly, which could lead to emitting hash of a different
state than members we just retrieved in refresh().
Some tests were working "by mistake": like
TestFailedLookupWillAskProvider and
TestRefreshUpdatesRingOnlyWhenRingHasChanged.

All in all, not methods are more synchronised, called more expectedly
(compareMembers should not make a new map), and notifiying subscribers
is inseparable from ring::refresh() like it should be.

Why?
We need to fix "channel is full" situation, and not just work it around, but fix with refactoring.
The reason why - there will be another diff which fixes interaction with MultiringResolver' subscribers.
(mainly, they should care about the pace and delays, not very-deep-internal ring).

How did you test it?
Unit-tests

Potential risks

Release notes
If your code implements a custom PeerProvider (for instance, UNS at Uber),
you need to change interaction from channels to calling functions (handlers).
Just do the same small change as I did in common/peerprovider/ringpopprovider/provider.go

Documentation Changes

codecov · 2024-09-20T13:39:21Z

Codecov Report

Attention: Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.27%. Comparing base (0295738) to head (8d31c74).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
common/membership/hashring.go	98.14%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
common/membership/resolver.go	`80.72% <100.00%> (+0.47%)`	⬆️
common/peerprovider/ringpopprovider/provider.go	`64.58% <100.00%> (+8.72%)`	⬆️
common/membership/hashring.go	`90.95% <98.14%> (+3.73%)`	⬆️

... and 7 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6594452...8d31c74. Read the comment docs.

davidporter-id-au · 2024-09-24T20:13:07Z

common/membership/hashring.go

@@ -59,14 +59,14 @@ type PeerProvider interface {
 	GetMembers(service string) ([]HostInfo, error)
 	WhoAmI() (HostInfo, error)
 	SelfEvict() error
-	Subscribe(name string, notifyChannel chan<- *ChangedEvent) error
+	Subscribe(name string, handler func(ChangedEvent)) error


this is probably a better API yeah,

davidporter-id-au · 2024-09-24T20:18:06Z

common/membership/hashring.go

@@ -99,7 +99,7 @@ func newHashring(
 		service:      service,
 		peerProvider: provider,
 		shutdownCh:   make(chan struct{}),
-		refreshChan:  make(chan *ChangedEvent),
+		refreshChan:  make(chan struct{}, 1),


I know the log you're referring to the in the description, but I'm slightly nervous changing this: I'd want to ensure that it doesn't cause some upstream process to block way more suddenly.

Have we tested it in practice? I'm guessing the result would be to drop just a lot more events, or block a lot more?

Given that we don't really care about the changes, only the whole state, would a debounce (on an event, waiting for 100 MS and then only processing the last event and dropping all the rest for example) more sense?

oh, I think there's already a debounce (line 255)

I'm not sure I understand why we should try to make this blocking then? Not opposed to the change, but apart from the log noise, I'm not sure if it presents a problem?

Let me explain this since it is really tricky part. What we currently have in hashring.go is unbuffered channel. Writes from peerProvider (ringpop/uns) are already happening in a non-blocking way (via select). It means write will only succeed if receiver is reading the channel right now. If it is doing something else (not ready yet, or it is in refresh() which ofc takes some time) - write will fail. That's why we see huge % of writes are failing - another membership change happens during refresh() or update or notifying subscribers.

I made this playground to illustrate the issue: https://go.dev/play/p/2Os6fkarH8W
Having a channel of size 1 guarantees we never miss the fact of being notified (notice, PeerProvider now calls an update function).

What happened before (with non-buffered chan):

PeerProvider writes ChangedEvent

We received ChangedEvent and called refresh() etc.

While we were in refresh() PeerProvider got another event and tried to write it to the channel. Since we're not reading it right now (we are in refresh()), it won't be able to notify us, and will log "channel is full"

We finished refresh(), but we don't have anything in the channel, so we will wait until the next refresh() or defaultRefreshInterval=10s to capture the update.

What happens now (with buffer=1):

PeerProvider calls handler

handler adds an event to channel to call "refresh()"

We read refreshChan; channel capacity is now empty; we are calling called refresh() etc.

While we were in refresh() PeerProvider got another event and calls handler again. Handler will add notification to the refreshChan.

refresh() and others are finished, we back to reading refreshChan. There is a message to read; which means, we didn't loose the fact we've been notified.

why we should try to make this blocking then?

I don't quite understand what do you mean by that?

my comment was wrong, I follow your point after after discussing: That we risk loosing the last event if they're sequenced one after the other and this keeps diverting to the default path. 🙏 for the explanation

common/membership/hashring.go

As the result: - hashring never misses updated (no "channel is full" situation possible) - subscribers of MultiringResolver (history, matching) will always get sane ChangedEvent which reflects the REAL changes (see details below) Previously there were problems: - hashring subscribed to PeerProvider (ringpop/uns) with non-buffered channel which led to failures to write every time `ring` was doing something else than reading the channel (happened 60% of times based on error-logs). Switched to calling handlers instead which are implementing schedule-update with channel with cap=1 approach (see `signalSelf`). This approach never skips updates. - PeerProvider supplies ChangedEvent to `ring`, but in reality, we do not use it - we refresh everything from scratch. This makes very misleading to even rely on the ChangedEvent. Basically, we might be triggered by some event (host "c" appeared), but during refresh() we realise there are more changes (host "a" removed, host "c" added as well, etc.), and we notify our Subscribers with an absolutely irrelevant data. - Same misleading took place in other methods like `emitHashIdentifier`. It retrieved list of members from PeerProvider **independantly**, which could lead to emitting hash of a different state than members we just retrieved in refresh(). - Some tests were working "by mistake": like `TestFailedLookupWillAskProvider` and `TestRefreshUpdatesRingOnlyWhenRingHasChanged`. All in all, not methods are more synchronised, called more expectedly (`compareMembers` should not make a new map), and notifiying subscribers is **inseparable** from ring::refresh() like it should be.

Also fixed race-condition: we were waiting for ring to stop, but in order to read stop-channel it sometimes had to finish notifying subscribers which took the same lock. We need to be careful with lock-scope.

…w#6296)" This reverts commit 9807d5d.

This reverts commit 9807d5d.

davidporter-id-au reviewed Sep 24, 2024

View reviewed changes

dkrotx force-pushed the fix-peer-notifications2 branch from 1d583b5 to 745ba5a Compare September 26, 2024 13:51

dkrotx marked this pull request as ready for review September 26, 2024 14:47

dkrotx requested review from Shaddoll, neil-xie, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, taylanisikdemir and demirkayaender as code owners September 26, 2024 14:47

Shaddoll reviewed Sep 27, 2024

View reviewed changes

common/membership/hashring.go Show resolved Hide resolved

davidporter-id-au approved these changes Sep 27, 2024

View reviewed changes

dkrotx added 2 commits September 27, 2024 08:16

Adding more tests to ringpop provider

8d31c74

Also fixed race-condition: we were waiting for ring to stop, but in order to read stop-channel it sometimes had to finish notifying subscribers which took the same lock. We need to be careful with lock-scope.

dkrotx force-pushed the fix-peer-notifications2 branch from dbf29f6 to 8d31c74 Compare September 27, 2024 06:17

Shaddoll approved these changes Sep 27, 2024

View reviewed changes

dkrotx merged commit 9807d5d into cadence-workflow:master Sep 27, 2024
20 checks passed

Shaddoll added a commit to Shaddoll/cadence that referenced this pull request Oct 2, 2024

Revert "Refactor PeerProvider & hashring interaction (cadence-workflo…

b0e79d4

…w#6296)" This reverts commit 9807d5d.

Shaddoll added a commit to Shaddoll/cadence that referenced this pull request Oct 15, 2024

Revert "Refactor PeerProvider & hashring interaction (cadence-workflo…

31a46d5

…w#6296)" This reverts commit 9807d5d.

3vilhamster added a commit that referenced this pull request Oct 15, 2024

Revert "Refactor PeerProvider & hashring interaction (#6296)"

30c40ee

This reverts commit 9807d5d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PeerProvider & hashring interaction #6296

Refactor PeerProvider & hashring interaction #6296

dkrotx commented Sep 20, 2024 •

edited

Loading

codecov bot commented Sep 20, 2024 •

edited

Loading

davidporter-id-au Sep 24, 2024

davidporter-id-au Sep 24, 2024

davidporter-id-au Sep 24, 2024

dkrotx Sep 26, 2024

dkrotx Sep 26, 2024 •

edited

Loading

davidporter-id-au Sep 27, 2024

Refactor PeerProvider & hashring interaction #6296

Refactor PeerProvider & hashring interaction #6296

Conversation

dkrotx commented Sep 20, 2024 • edited Loading

codecov bot commented Sep 20, 2024 • edited Loading

Codecov Report

davidporter-id-au Sep 24, 2024

Choose a reason for hiding this comment

davidporter-id-au Sep 24, 2024

Choose a reason for hiding this comment

davidporter-id-au Sep 24, 2024

Choose a reason for hiding this comment

dkrotx Sep 26, 2024

Choose a reason for hiding this comment

dkrotx Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

davidporter-id-au Sep 27, 2024

Choose a reason for hiding this comment

dkrotx commented Sep 20, 2024 •

edited

Loading

codecov bot commented Sep 20, 2024 •

edited

Loading

dkrotx Sep 26, 2024 •

edited

Loading