Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove RX as the primary pipe for gRPC communications #8281

Merged
merged 25 commits into from
Sep 2, 2022

Conversation

mgravell
Copy link
Contributor

@mgravell mgravell commented Apr 4, 2022

Issue describing the changes in this PR

resolves #8280; performance locally is a 5x+ speedup in throughput

the main crux of the new flow is described in src/WebJobs.Script.Grpc/Eventing/GrpcEventExtensions.cs, but in short: we establish a per-worker Channel<InboundGrpcEvent> and Channel<OutboundGrpcEvent> to handle the backlog; instead of publishing to RX, we publish to the writer of the relevant queue. Separately, we have an async worker deque data from the queues, and process accordingly. This is much more direct, avoids a lot of RX machinery, and creates isolation between workers.

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)

Additional information

Additional PR information

mgravell and others added 3 commits April 4, 2022 14:47
remove RX as the primary pipe for gRPC communications
1: GrpcWorkerChannelTests span up duplicate channels
2: suppress write failures (async void)
… TestFunctionRpcService and GrpcWorkerChannelTests to match
@fabiocav
Copy link
Member

@mgravell apologies for the delay on this. We'll get eyes on this ASAP. Would it be possible to have this rebased to deal with the conflicts?

Thanks!

@mgravell
Copy link
Contributor Author

mgravell commented Apr 30, 2022 via email

@pragnagopa
Copy link
Member

Tagging @safihamid @paulbatum - fyi

@fabiocav / @brettsam - would be good to get test E2E on private stamp before merging this PR

# Conflicts:
#	src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs
#	test/WebJobs.Script.Tests/Workers/Rpc/GrpcWorkerChannelTests.cs
#	test/WebJobs.Script.Tests/Workers/Rpc/TestFunctionRpcService.cs
@mgravell
Copy link
Contributor Author

mgravell commented May 6, 2022

@fabiocav I believe that's merged; it looks like some cheese moved around the topic of "function load collections", which made the tests merge a bit messy, but: should work

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved
src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved
src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved
// advertise that we support multiple streams, and hint at a number; with this flag, we allow
// clients to connect multiple back-hauls *with the same workerid*, and rely on the internal
// plumbing to make sure we don't process everything N times
initRequest.Capabilities.Add(RpcWorkerConstants.MultiStream, "10"); // TODO: make this configurable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the number ("10") a readonly constant for now?

Is the TODO for this PR or future work? If future work, can we create a github issue for this and mention it here?

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved
src/WebJobs.Script.Grpc/Eventing/IGrpcEventManager.cs Outdated Show resolved Hide resolved
Copy link
Member

@brettsam brettsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have very general questions here. I think we should get this run on a private stamp to make sure we have no cold start regressions.

I also want to take one more pass through to think through things like restarts and failures -- this is tricky!

@liliankasem
Copy link
Member

@mgravell anything we can do to help you get this PR merged?

@fabiocav
Copy link
Member

@mgravell indeed, this is high priority on our side, so whatever we can do to help, please let us know.

@NickCraver
Copy link
Member

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

@liliankasem
Copy link
Member

liliankasem commented Jun 16, 2022

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?

@safihamid
Copy link
Contributor

I looked at this earlier on a call - it looks like the main pending item is to test regressions in cold start and not code tidy here as the blocker, or am I missing a thread where that's already in play? Happy to help on the code side.

I think it's a mix of both, would be great to get PR feedback addressed but not a huge blocker. @safihamid and @pragnagopa can you help with cold start testing?

Ok I will get cold start comparison numbers for this by end of next week. I hope most PR feedbacks are addressed by then.

mgravell and others added 5 commits June 20, 2022 14:19
Co-authored-by: Lilian Kasem <likasem@microsoft.com>
Co-authored-by: Lilian Kasem <likasem@microsoft.com>
INCOMPLETE - still 2 new failing tests

# Conflicts:
#	src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs
#	test/WebJobs.Script.Tests/Workers/Rpc/GrpcWorkerChannelTests.cs
@mgravell
Copy link
Contributor Author

Merge is in progress; fighting SendInvocationRequest_ValidateTraceContext and SendInvocationRequest_ValidateTraceContext_SessionId currently

@safihamid
Copy link
Contributor

No noticeable cold start changes with this PR in my private stamp as long as we are using the right language worker during placeholder mode. i.e. use Node placeholder for Node app, ...

@NickCraver
Copy link
Member

Most failures here are due to more merges since the PR - 1 left I couldn't get through tonight:

[xUnit.net 00:03:48.32]     Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [FAIL]
  Failed Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded [2 m]
  Error Message:
   System.ApplicationException : Condition not reached within timeout.
  Stack Trace:
     at Microsoft.Azure.WebJobs.Script.Tests.TestHelpers.Await(Func`1 condition, Int32 timeout, Int32 pollingInterval, Boolean throwWhenDebugging, Func`1 userMessageCallback) in /_/test/WebJobs.Script.Tests.Shared/TestHelpers.cs:line 115
   at Microsoft.Azure.WebJobs.Script.Tests.WorkerConcurrencyManagerEndToEndTests.WorkerStatus_NewWorkerAdded() in /_/test/WebJobs.Script.Tests.Integration/Rpc/WorkerConcurrencyManagerEndToEndTests.cs:line 28
--- End of stack trace from previous location ---
fail: Host.General[515]
      A host error has occurred during startup operation 'fe3a9a1d-e32b-4243-9d9f-2e2d3bcbbfe3'.
      System.ArgumentNullException: Value cannot be null. (Parameter 'provider')
         at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetService[T](IServiceProvider provider)
         at Microsoft.Azure.WebJobs.Script.ScriptHostBuilderExtensions.<>c__DisplayClass6_0.<AddScriptHost>b__2(HostBuilderContext context, IConfigurationBuilder configBuilder) in /_/src/WebJobs.Script/ScriptHostBuilderExtensions.cs:line 114
         at Microsoft.Extensions.Hosting.HostBuilder.BuildAppConfiguration()
         at Microsoft.Extensions.Hosting.HostBuilder.Build()
         at Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost(Boolean skipHostStartup, Boolean skipHostConfigurationParsing) in /_/src/WebJobs.Script.WebHost/DefaultScriptHostBuilder.cs:line 59
         at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost(Boolean skipHostStartup, Boolean skipHostJsonConfiguration) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 581
         at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /_/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 276

I'm not setup for end-to-end on laptop so can't debug readily but something is amiss there with services - will poke more tomorrow.

NickCraver and others added 2 commits July 26, 2022 22:28
… new system

Here we're simulating latency in the outbound pipe was it was intended to when everything was posting through the script manager through RX. Also makes the latency non-static and more test friendly.
Copy link
Member

@liliankasem liliankasem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed there's a TODO here, should probably resolve before merge! (+ any other TODOs I may have missed). If not addressed, you should create an issue for any items we're not handling in this PR

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved
@NickCraver
Copy link
Member

Gentle reminder we should try and get this in before more conflicts pile up against the old script manager approach.

@brettsam
Copy link
Member

brettsam commented Sep 2, 2022

Let's get it in -- no objections from me. Although I do see conflicts again...

@NickCraver
Copy link
Member

@brettsam merged once more from dev and addressed one test race (existing issue but hey, improvements!)

@fabiocav
Copy link
Member

fabiocav commented Sep 2, 2022

Thanks, @NickCraver.

Good to have this merged once conflicts are resolved.

@NickCraver
Copy link
Member

@fabiocav I can't see merge conflicts local or PR UX, any idea what it's conflicting with?

@fabiocav
Copy link
Member

fabiocav commented Sep 2, 2022

@NickCraver odd, I don't see the usual list, just the conflict warning in GH.

If you can merge locally, go for it!

@NickCraver NickCraver merged commit c112509 into dev Sep 2, 2022
@NickCraver NickCraver deleted the mgravell/remove-rx branch September 2, 2022 19:12
@NickCraver
Copy link
Member

I beat the GitHub UI!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: remove RX from the primary gRPC message pipe
8 participants