Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus #1466

Merged
merged 8 commits into from
Jun 19, 2024
Merged

Nexus #1466

merged 8 commits into from
Jun 19, 2024

Conversation

bergundy
Copy link
Member

@bergundy bergundy commented May 10, 2024

What was changed

EDIT: Merged #1473 and #1475 into this PR, it now included the entire Nexus implementation for the SDK.

Added the temporalnexus package and implemented the handler side for Nexus, including registering and dispatching Nexus Operations.

Tests only pass with server main, so this PR should not be merged until the server is released.
A future PR will complete the nexus work allowing invoking Nexus Operations from a workflow.

See the proposal for more information.

Also now memoizing worker.Start() to return consistent errors to callers and avoid rerunning the function unnecessarily.

Merge Checklist:

  • Release Server
  • Depend on tagged Nexus SDK

@bergundy bergundy requested a review from a team as a code owner May 10, 2024 21:48
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Mostly minimal stuff.

Comment on lines +3 to +5
go 1.21

toolchain go1.21.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is this just an artifact of your tooling or was this change required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran go get ... and go mod tidy but it may have been required due to the nexus sdk using slog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI we should remove this before merging to main, it seems to mess up out CI trying to test multiple versions of Go

go.mod Outdated Show resolved Hide resolved
internal/common/metrics/constants.go Outdated Show resolved Hide resolved
internal/internal_nexus_task_handler.go Show resolved Hide resolved
// Associate the NexusOperationContext with the context.Context used to invoke operations.
ctx := context.WithValue(context.Background(), nexusOperationContextKey, nctx)

timeoutStr := header.Get(nexus.HeaderRequestTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I might have expected server to handle timeout and send cancellation. If a handler chooses not to respect timeout, what happens? If it is also handled server side, I think it's best to not also do it here except maybe with some considerable leeway to ensure server's cancellation logic is the one always processed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have sticky execution so there's not a way for the server to send cancelation. The server propagates this from the client request and also has its own context deadline.

I think it's good to cancel work that we know can't complete in time and have the server propagate this timeout.
As for whether the context deadline in the SDK should be shorter/same/longer than the one tracked on the server, fair point, but maybe shorter here is better so the SDK doesn't get a false sense of completion and the metrics we emit can be more accurate.

Copy link
Member

@cretz cretz May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens in racy cases where SDK and server hit at the same time? If server-side happens first does that look the exact same as if the client-side one hit first and reported this failure back? It's important to have one system be the arbiter of true timeout errors. If you want a just-in-case for the other system, no prob, just make it long enough to never be first, but having two separate systems that race each other to report timeout failure can result in racy inconsistencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that clients also set deadlines on the request context.
I think giving the most up-to-date and accurate timeout to all of the processes involved in handling this request is preferable. That's how gRPC does it and this is essentially the SDK handling RPCs.

Also note that on context deadline errors we don't respond to the server, we just drop the task so I'm not as concerned with the racy inconsistency you're talking about.

internal/internal_nexus_task_poller.go Show resolved Hide resolved

// Start the worker.
func (w *nexusWorker) Start() error {
err := verifyNamespaceExist(w.workflowService, w.executionParameters.MetricsHandler, w.executionParameters.Namespace, w.worker.logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR specifically, but sure wouldn't mind if this moved up to the aggregate worker instead of in each

@@ -953,6 +994,14 @@ func (aw *AggregatedWorker) Start() error {
}
proto.Merge(aw.capabilities, capabilities)

return aw.memoizedStart()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole method should go inside memoized start. No need to repeat stuff above for each call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, this was something that @Quinn-With-Two-Ns requested, so just confirming he's also okay with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may retry starting their worker if it failed that is why I requested we don't memorize it, I think the likely hood is low, but it's very little effort to not memorize it so why not just avoid the breaking change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to what @cretz suggested, I'm open to changing back.
I don't have a strong opinion here but slightly prefer memoizing the entire thing because it's easier to reason about.

internal/nexus_operations.go Show resolved Hide resolved
temporalnexus/operation.go Outdated Show resolved Hide resolved
@bergundy bergundy force-pushed the bergundy/nexus-handler branch from 4085dfa to d097b3a Compare May 13, 2024 17:47
@cretz
Copy link
Member

cretz commented May 16, 2024

Probably obvious, but let's not merge this until there's a server that works with it

@bergundy bergundy changed the base branch from master to nexus June 19, 2024 23:36
@bergundy bergundy force-pushed the bergundy/nexus-handler branch from 660c124 to 571b49a Compare June 19, 2024 23:38
@bergundy bergundy changed the title Nexus Handler Nexus Jun 19, 2024
@bergundy bergundy merged commit 75fcd25 into temporalio:nexus Jun 19, 2024
3 of 11 checks passed
@bergundy bergundy deleted the bergundy/nexus-handler branch June 19, 2024 23:40
@bergundy
Copy link
Member Author

Rebased and merged into the nexus branch.
I'll issue a separate PR to merge nexus into main once a server supporting Nexus is released.

bergundy added a commit that referenced this pull request Jul 19, 2024
* Nexus Handler
* Execute nexus operation from a workflow
* Add test environment support for Nexus Operations
@bergundy bergundy mentioned this pull request Jul 19, 2024
bergundy added a commit that referenced this pull request Jul 22, 2024
## What was changed

- Added the `temporalnexus` package and implemented the handler side for Nexus, including registering and dispatching Nexus Operations.
- Added the ability to execute Nexus Operations from a workflow.
- Added basic support for running Nexus Operations in the test environment.
- Added memoizing to `worker.Start()` to return consistent errors to callers and avoid rerunning the function unnecessarily.
- Updated the integration test's dev server to run CLI `0.14.0-nexus.0` which includes server `1.25.0-rc.0`.

See the [proposal](https://github.com/temporalio/proposals/blob/b72c49b0c2278e916265b00a49638006f8fce469/nexus/sdk-go.md) for more information.

Most of this code has been reviewed already in #1466, #1473, and #1475, which are all squashed in the first commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants