Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PeerProvider & hashring interaction #6296

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

dkrotx
Copy link
Member

@dkrotx dkrotx commented Sep 20, 2024

What changed?
After changes:

  • hashring never misses updated (no "channel is full" situation possible)
  • subscribers of MultiringResolver (history, matching) will always get
    sane ChangedEvent which reflects the REAL changes (see details below)

Previously there were problems:

  • hashring subscribed to PeerProvider (ringpop/uns) with non-buffered channel
    which led to failures to write every time ring was doing something
    else than reading the channel (happened 60% of times based on error-logs).
    Switched to calling handlers instead which are implementing
    schedule-update with channel with cap=1 approach (see signalSelf).
    This approach never skips updates.
  • PeerProvider supplies ChangedEvent to ring, but in reality, we do
    not use it - we refresh everything from scratch. This makes very
    misleading to even rely on the ChangedEvent. Basically, we might be
    triggered by some event (host "c" appeared), but during refresh() we
    realise there are more changes (host "a" removed, host "c" added as
    well, etc.), and we notify our Subscribers with an absolutely
    irrelevant data.
  • Because of race condition in Stop() (we hold subscribers-list locked while we
    could notify subscribers at the same moment, and we were waiting for
    refreshRingWorker to exit) we sometimes had issues with 1m delay which you could observe
    even in local setup with ^C being too slow.
  • Same misleading took place in other methods like
    emitHashIdentifier. It retrieved list of members from PeerProvider
    independantly, which could lead to emitting hash of a different
    state than members we just retrieved in refresh().
  • Some tests were working "by mistake": like
    TestFailedLookupWillAskProvider and
    TestRefreshUpdatesRingOnlyWhenRingHasChanged.

All in all, not methods are more synchronised, called more expectedly
(compareMembers should not make a new map), and notifiying subscribers
is inseparable from ring::refresh() like it should be.

Why?
We need to fix "channel is full" situation, and not just work it around, but fix with refactoring.
The reason why - there will be another diff which fixes interaction with MultiringResolver' subscribers.
(mainly, they should care about the pace and delays, not very-deep-internal ring).

How did you test it?
Unit-tests

Potential risks

Release notes
If your code implements a custom PeerProvider (for instance, UNS at Uber),
you need to change interaction from channels to calling functions (handlers).
Just do the same small change as I did in common/peerprovider/ringpopprovider/provider.go

Documentation Changes

Copy link

codecov bot commented Sep 20, 2024

Codecov Report

Attention: Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.27%. Comparing base (0295738) to head (8d31c74).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
common/membership/hashring.go 98.14% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
common/membership/resolver.go 80.72% <100.00%> (+0.47%) ⬆️
common/peerprovider/ringpopprovider/provider.go 64.58% <100.00%> (+8.72%) ⬆️
common/membership/hashring.go 90.95% <98.14%> (+3.73%) ⬆️

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6594452...8d31c74. Read the comment docs.

@@ -59,14 +59,14 @@ type PeerProvider interface {
GetMembers(service string) ([]HostInfo, error)
WhoAmI() (HostInfo, error)
SelfEvict() error
Subscribe(name string, notifyChannel chan<- *ChangedEvent) error
Subscribe(name string, handler func(ChangedEvent)) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a better API yeah,

@@ -99,7 +99,7 @@ func newHashring(
service: service,
peerProvider: provider,
shutdownCh: make(chan struct{}),
refreshChan: make(chan *ChangedEvent),
refreshChan: make(chan struct{}, 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the log you're referring to the in the description, but I'm slightly nervous changing this: I'd want to ensure that it doesn't cause some upstream process to block way more suddenly.

Have we tested it in practice? I'm guessing the result would be to drop just a lot more events, or block a lot more?

Given that we don't really care about the changes, only the whole state, would a debounce (on an event, waiting for 100 MS and then only processing the last event and dropping all the rest for example) more sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think there's already a debounce (line 255)

I'm not sure I understand why we should try to make this blocking then? Not opposed to the change, but apart from the log noise, I'm not sure if it presents a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain this since it is really tricky part. What we currently have in hashring.go is unbuffered channel. Writes from peerProvider (ringpop/uns) are already happening in a non-blocking way (via select). It means write will only succeed if receiver is reading the channel right now. If it is doing something else (not ready yet, or it is in refresh() which ofc takes some time) - write will fail. That's why we see huge % of writes are failing - another membership change happens during refresh() or update or notifying subscribers.

I made this playground to illustrate the issue: https://go.dev/play/p/2Os6fkarH8W
Having a channel of size 1 guarantees we never miss the fact of being notified (notice, PeerProvider now calls an update function).

What happened before (with non-buffered chan):

  1. PeerProvider writes ChangedEvent
  2. We received ChangedEvent and called refresh() etc.
  3. While we were in refresh() PeerProvider got another event and tried to write it to the channel. Since we're not reading it right now (we are in refresh()), it won't be able to notify us, and will log "channel is full"
  4. We finished refresh(), but we don't have anything in the channel, so we will wait until the next refresh() or defaultRefreshInterval=10s to capture the update.

What happens now (with buffer=1):

  1. PeerProvider calls handler
  2. handler adds an event to channel to call "refresh()"
  3. We read refreshChan; channel capacity is now empty; we are calling called refresh() etc.
  4. While we were in refresh() PeerProvider got another event and calls handler again. Handler will add notification to the refreshChan.
  5. refresh() and others are finished, we back to reading refreshChan. There is a message to read; which means, we didn't loose the fact we've been notified.

Copy link
Member Author

@dkrotx dkrotx Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we should try to make this blocking then?

I don't quite understand what do you mean by that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment was wrong, I follow your point after after discussing: That we risk loosing the last event if they're sequenced one after the other and this keeps diverting to the default path. 🙏 for the explanation

As the result:
 - hashring never misses updated (no "channel is full" situation possible)
 - subscribers of MultiringResolver (history, matching) will always get
   sane ChangedEvent which reflects the REAL changes (see details below)

Previously there were problems:
 - hashring subscribed to PeerProvider (ringpop/uns) with non-buffered channel
   which led to failures to write every time `ring` was doing something
   else than reading the channel (happened 60% of times based on error-logs).
   Switched to calling handlers instead which are implementing
   schedule-update with channel with cap=1 approach (see `signalSelf`).
   This approach never skips updates.
 - PeerProvider supplies ChangedEvent to `ring`, but in reality, we do
   not use it - we refresh everything from scratch. This makes very
   misleading to even rely on the ChangedEvent. Basically, we might be
   triggered by some event (host "c" appeared), but during refresh() we
   realise there are more changes (host "a" removed, host "c" added as
   well, etc.), and we notify our Subscribers with an absolutely
   irrelevant data.
 - Same misleading took place in other methods like
   `emitHashIdentifier`. It retrieved list of members from PeerProvider
   **independantly**, which could lead to emitting hash of a different
   state than members we just retrieved in refresh().
 - Some tests were working "by mistake": like
   `TestFailedLookupWillAskProvider` and
   `TestRefreshUpdatesRingOnlyWhenRingHasChanged`.

All in all, not methods are more synchronised, called more expectedly
(`compareMembers` should not make a new map), and notifiying subscribers
is **inseparable** from ring::refresh() like it should be.
Also fixed race-condition: we were waiting for ring to stop, but in
order to read stop-channel it sometimes had to finish notifying
subscribers which took the same lock. We need to be careful with
lock-scope.
@dkrotx dkrotx merged commit 9807d5d into cadence-workflow:master Sep 27, 2024
20 checks passed
Shaddoll added a commit to Shaddoll/cadence that referenced this pull request Oct 2, 2024
Shaddoll added a commit to Shaddoll/cadence that referenced this pull request Oct 15, 2024
3vilhamster added a commit that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants