-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus #1466
Nexus #1466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Mostly minimal stuff.
go 1.21 | ||
|
||
toolchain go1.21.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, is this just an artifact of your tooling or was this change required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran go get ...
and go mod tidy
but it may have been required due to the nexus sdk using slog
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI we should remove this before merging to main, it seems to mess up out CI trying to test multiple versions of Go
// Associate the NexusOperationContext with the context.Context used to invoke operations. | ||
ctx := context.WithValue(context.Background(), nexusOperationContextKey, nctx) | ||
|
||
timeoutStr := header.Get(nexus.HeaderRequestTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I might have expected server to handle timeout and send cancellation. If a handler chooses not to respect timeout, what happens? If it is also handled server side, I think it's best to not also do it here except maybe with some considerable leeway to ensure server's cancellation logic is the one always processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have sticky execution so there's not a way for the server to send cancelation. The server propagates this from the client request and also has its own context deadline.
I think it's good to cancel work that we know can't complete in time and have the server propagate this timeout.
As for whether the context deadline in the SDK should be shorter/same/longer than the one tracked on the server, fair point, but maybe shorter here is better so the SDK doesn't get a false sense of completion and the metrics we emit can be more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what happens in racy cases where SDK and server hit at the same time? If server-side happens first does that look the exact same as if the client-side one hit first and reported this failure back? It's important to have one system be the arbiter of true timeout errors. If you want a just-in-case for the other system, no prob, just make it long enough to never be first, but having two separate systems that race each other to report timeout failure can result in racy inconsistencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that clients also set deadlines on the request context.
I think giving the most up-to-date and accurate timeout to all of the processes involved in handling this request is preferable. That's how gRPC does it and this is essentially the SDK handling RPCs.
Also note that on context deadline errors we don't respond to the server, we just drop the task so I'm not as concerned with the racy inconsistency you're talking about.
|
||
// Start the worker. | ||
func (w *nexusWorker) Start() error { | ||
err := verifyNamespaceExist(w.workflowService, w.executionParameters.MetricsHandler, w.executionParameters.Namespace, w.worker.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR specifically, but sure wouldn't mind if this moved up to the aggregate worker instead of in each
internal/internal_worker.go
Outdated
@@ -953,6 +994,14 @@ func (aw *AggregatedWorker) Start() error { | |||
} | |||
proto.Merge(aw.capabilities, capabilities) | |||
|
|||
return aw.memoizedStart() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole method should go inside memoized start. No need to repeat stuff above for each call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, this was something that @Quinn-With-Two-Ns requested, so just confirming he's also okay with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may retry starting their worker if it failed that is why I requested we don't memorize it, I think the likely hood is low, but it's very little effort to not memorize it so why not just avoid the breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to what @cretz suggested, I'm open to changing back.
I don't have a strong opinion here but slightly prefer memoizing the entire thing because it's easier to reason about.
4085dfa
to
d097b3a
Compare
d097b3a
to
32a3fd8
Compare
Probably obvious, but let's not merge this until there's a server that works with it |
660c124
to
571b49a
Compare
Rebased and merged into the |
## What was changed - Added the `temporalnexus` package and implemented the handler side for Nexus, including registering and dispatching Nexus Operations. - Added the ability to execute Nexus Operations from a workflow. - Added basic support for running Nexus Operations in the test environment. - Added memoizing to `worker.Start()` to return consistent errors to callers and avoid rerunning the function unnecessarily. - Updated the integration test's dev server to run CLI `0.14.0-nexus.0` which includes server `1.25.0-rc.0`. See the [proposal](https://github.com/temporalio/proposals/blob/b72c49b0c2278e916265b00a49638006f8fce469/nexus/sdk-go.md) for more information. Most of this code has been reviewed already in #1466, #1473, and #1475, which are all squashed in the first commit.
What was changed
EDIT: Merged #1473 and #1475 into this PR, it now included the entire Nexus implementation for the SDK.
Added the
temporalnexus
package and implemented the handler side for Nexus, including registering and dispatching Nexus Operations.Tests only pass with server
main
, so this PR should not be merged until the server is released.A future PR will complete the nexus work allowing invoking Nexus Operations from a workflow.
See the proposal for more information.
Also now memoizing
worker.Start()
to return consistent errors to callers and avoid rerunning the function unnecessarily.Merge Checklist: