-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PeerProvider & hashring interaction #6296
Refactor PeerProvider & hashring interaction #6296
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 7 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
@@ -59,14 +59,14 @@ type PeerProvider interface { | |||
GetMembers(service string) ([]HostInfo, error) | |||
WhoAmI() (HostInfo, error) | |||
SelfEvict() error | |||
Subscribe(name string, notifyChannel chan<- *ChangedEvent) error | |||
Subscribe(name string, handler func(ChangedEvent)) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably a better API yeah,
@@ -99,7 +99,7 @@ func newHashring( | |||
service: service, | |||
peerProvider: provider, | |||
shutdownCh: make(chan struct{}), | |||
refreshChan: make(chan *ChangedEvent), | |||
refreshChan: make(chan struct{}, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the log you're referring to the in the description, but I'm slightly nervous changing this: I'd want to ensure that it doesn't cause some upstream process to block way more suddenly.
Have we tested it in practice? I'm guessing the result would be to drop just a lot more events, or block a lot more?
Given that we don't really care about the changes, only the whole state, would a debounce (on an event, waiting for 100 MS and then only processing the last event and dropping all the rest for example) more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I think there's already a debounce (line 255)
I'm not sure I understand why we should try to make this blocking then? Not opposed to the change, but apart from the log noise, I'm not sure if it presents a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain this since it is really tricky part. What we currently have in hashring.go is unbuffered channel. Writes from peerProvider (ringpop/uns) are already happening in a non-blocking way (via select). It means write will only succeed if receiver is reading the channel right now. If it is doing something else (not ready yet, or it is in refresh()
which ofc takes some time) - write will fail. That's why we see huge % of writes are failing - another membership change happens during refresh()
or update or notifying subscribers.
I made this playground to illustrate the issue: https://go.dev/play/p/2Os6fkarH8W
Having a channel of size 1 guarantees we never miss the fact of being notified (notice, PeerProvider now calls an update function).
What happened before (with non-buffered chan):
- PeerProvider writes ChangedEvent
- We received ChangedEvent and called refresh() etc.
- While we were in refresh() PeerProvider got another event and tried to write it to the channel. Since we're not reading it right now (we are in refresh()), it won't be able to notify us, and will log "channel is full"
- We finished refresh(), but we don't have anything in the channel, so we will wait until the next refresh() or defaultRefreshInterval=10s to capture the update.
What happens now (with buffer=1):
- PeerProvider calls handler
- handler adds an event to channel to call "refresh()"
- We read refreshChan; channel capacity is now empty; we are calling called refresh() etc.
- While we were in refresh() PeerProvider got another event and calls handler again. Handler will add notification to the refreshChan.
- refresh() and others are finished, we back to reading refreshChan. There is a message to read; which means, we didn't loose the fact we've been notified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we should try to make this blocking then?
I don't quite understand what do you mean by that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment was wrong, I follow your point after after discussing: That we risk loosing the last event if they're sequenced one after the other and this keeps diverting to the default path. 🙏 for the explanation
1d583b5
to
745ba5a
Compare
As the result: - hashring never misses updated (no "channel is full" situation possible) - subscribers of MultiringResolver (history, matching) will always get sane ChangedEvent which reflects the REAL changes (see details below) Previously there were problems: - hashring subscribed to PeerProvider (ringpop/uns) with non-buffered channel which led to failures to write every time `ring` was doing something else than reading the channel (happened 60% of times based on error-logs). Switched to calling handlers instead which are implementing schedule-update with channel with cap=1 approach (see `signalSelf`). This approach never skips updates. - PeerProvider supplies ChangedEvent to `ring`, but in reality, we do not use it - we refresh everything from scratch. This makes very misleading to even rely on the ChangedEvent. Basically, we might be triggered by some event (host "c" appeared), but during refresh() we realise there are more changes (host "a" removed, host "c" added as well, etc.), and we notify our Subscribers with an absolutely irrelevant data. - Same misleading took place in other methods like `emitHashIdentifier`. It retrieved list of members from PeerProvider **independantly**, which could lead to emitting hash of a different state than members we just retrieved in refresh(). - Some tests were working "by mistake": like `TestFailedLookupWillAskProvider` and `TestRefreshUpdatesRingOnlyWhenRingHasChanged`. All in all, not methods are more synchronised, called more expectedly (`compareMembers` should not make a new map), and notifiying subscribers is **inseparable** from ring::refresh() like it should be.
Also fixed race-condition: we were waiting for ring to stop, but in order to read stop-channel it sometimes had to finish notifying subscribers which took the same lock. We need to be careful with lock-scope.
dbf29f6
to
8d31c74
Compare
This reverts commit 9807d5d.
What changed?
After changes:
sane ChangedEvent which reflects the REAL changes (see details below)
Previously there were problems:
which led to failures to write every time
ring
was doing somethingelse than reading the channel (happened 60% of times based on error-logs).
Switched to calling handlers instead which are implementing
schedule-update with channel with cap=1 approach (see
signalSelf
).This approach never skips updates.
ring
, but in reality, we donot use it - we refresh everything from scratch. This makes very
misleading to even rely on the ChangedEvent. Basically, we might be
triggered by some event (host "c" appeared), but during refresh() we
realise there are more changes (host "a" removed, host "c" added as
well, etc.), and we notify our Subscribers with an absolutely
irrelevant data.
Stop()
(we hold subscribers-list locked while wecould notify subscribers at the same moment, and we were waiting for
refreshRingWorker
to exit) we sometimes had issues with 1m delay which you could observeeven in local setup with ^C being too slow.
emitHashIdentifier
. It retrieved list of members from PeerProviderindependantly, which could lead to emitting hash of a different
state than members we just retrieved in refresh().
TestFailedLookupWillAskProvider
andTestRefreshUpdatesRingOnlyWhenRingHasChanged
.All in all, not methods are more synchronised, called more expectedly
(
compareMembers
should not make a new map), and notifiying subscribersis inseparable from ring::refresh() like it should be.
Why?
We need to fix "channel is full" situation, and not just work it around, but fix with refactoring.
The reason why - there will be another diff which fixes interaction with MultiringResolver' subscribers.
(mainly, they should care about the pace and delays, not very-deep-internal
ring
).How did you test it?
Unit-tests
Potential risks
Release notes
If your code implements a custom PeerProvider (for instance, UNS at Uber),
you need to change interaction from channels to calling functions (handlers).
Just do the same small change as I did in
common/peerprovider/ringpopprovider/provider.go
Documentation Changes