Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add begin.WithReselect option #1107

Closed
wants to merge 7 commits into from

Conversation

edwarnicke
Copy link
Member

Signed-off-by: Ed Warnicke hagbard@gmail.com

Description

Issue link

How Has This Been Tested?

  • Added unit testing to cover
  • Tested manually
  • Tested by integration testing
  • Have not tested

Types of changes

  • Bug fix
  • New functionallity
  • Documentation
  • Refactoring
  • CI

Signed-off-by: Ed Warnicke <hagbard@gmail.com>
@edwarnicke
Copy link
Member Author

@Bolodya1997 First: an apology. You were correct that we needed to do something about 'outside requests' and how to handle them. You'd thought of it in terms of 'diffing' things. I thought about it in c076175 as 'merging'. Either way, something needed to be done. We've got a place holder in c076175 to hash out the details (if any) beyond what's there.

I pulled forward the chain/nsmgr/heal_test.go tests. Currently all are passing except for TestNSMGR_HealForwarder. I believe this is because the interpose chain element gets stuck on trying to reach the same old Forwarder, even if its gone. I haven't fully gotten to the bottom of it, and that might be a good place for you to start @Bolodya1997

@edwarnicke
Copy link
Member Author

@Bolodya1997 Our issue appears to be that we are timing out the context here:

if err != nil {
logger.Errorf("failed to request cross NSE %v err: %v", crossNSEURL, err)
return true

@edwarnicke
Copy link
Member Author

@Bolodya1997 Figured out two things contributing to the TestNSMGR_HealForwarder failure:

  1. We are using in sandbox.DialOptions() we are using

grpc.WithBlock(),

grpc.WaitForReady(true),

If we don't use grpc.WithBlock() and grpc.WaitForReady() ... not only do the tests pass (with the note below) but the tests also run 5-6x faster.

Why are we using these options when they slow things down a lot?

  1. We are looking for unique connections on heal... but the heal is instead cleanly healing the existing connection:
    require.Eventually(t, checkSecondRequestsReceived(counter.UniqueRequests), timeout, tick)
    require.Equal(t, 2, counter.UniqueRequests())
    closes := counter.UniqueCloses()

require.Equal(t, 2, counter.UniqueRequests())
require.Equal(t, closes+1, counter.UniqueCloses())

If we change those to checking for Requests() and Closes() with the correct expectations... that helps.

Signed-off-by: Ed Warnicke <hagbard@gmail.com>
Signed-off-by: Ed Warnicke <hagbard@gmail.com>
Signed-off-by: Ed Warnicke <hagbard@gmail.com>
Signed-off-by: Ed Warnicke <hagbard@gmail.com>
TestNSMGR_HealForwarder is still failing.  It is probably failing
because the interpose chain element does not reselect Forwarders
if one does not become available.

Signed-off-by: Ed Warnicke <hagbard@gmail.com>
… grpc.WithBlock() from sandbox.DialOptions()

Signed-off-by: Ed Warnicke <hagbard@gmail.com>
@edwarnicke
Copy link
Member Author

@glazychev-art Moving the conversation back here about the fact heal as currently written retries only once.

Let's start by agreeing that 'heal tries exactly once' is wrong, and we just need to figure out the most elegant way of getting heal to retry until it succeeds (or is Closed from the outside).

I tend to think of this in terms of events, in the current implementation:

  1. heal receives a 'DOWN' event
  2. heal fires a begin.FromContext(ctx).Request(begin.WithReselect()) event which
    a. grabs the executor
    b. Calls Close down the line
    c. Fixes up the Request.Connection to reselect
    d. Calls Request down the line
    e. Releases the executor

Important thing to remember: #2b will close down the monitor eventLoop... so no new events will be processed.

If #2d returns an error, we silently fail (not good).

The reason I initially chose to have the begin.FromContext(ctx).Request(begin.WithReselect()) incorporate both the Close and heal Request in an atomic (to the executor) operation. I'm very open to rethinking how we handle that event.

The naive first thought would be to loop over resending the begin.FromContext(ctx).Request(begin.WithReselect()) event... but the question then becomes "What stops the loop?". It can't be the eventLoop context, as Close cancels that. You also have to think through leaving room for the possiblity of a Close from the outside, which should also terminate the heal.

@glazychev-art Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant