Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make client Dial context controlable by callers #1416

Merged
merged 5 commits into from
Mar 12, 2024

Conversation

lminaudier
Copy link
Contributor

What was changed

This PR exposes new DialContext and NewClientFromExistingWithContext APIs that take a context as input and propagates it down the stack until all synchronous server calls so timeout and cancellation can be controlled by the callers.

The PR also replaces grpc.Dial call by grpc.DialContext and propagate user defined context down to the grpc dial call properly. Doing so means when users use grpc.WithBlock() dial option, they will be in control of the dial timeout

Why?

When clients use Dial or NewClientFromExisting APIs to start, they connect to the server and try to load capabilities synchronously. However, clients have no control over the timeouts and cancellation of such requests. They can't extend timeouts to accommodate slow networks or appropriately cancel those calls.

Checklist

  1. Closes

  2. How was this tested:

I expect no behavior change. Timeouts are kept the same unless you use the new APIs and define proper context deadlines.

  1. Any docs updates needed?

Godoc updates might be enough?


I noticed that the gRPC client uses the Dial API

return grpc.Dial(params.HostPort, opts...)

and not the DialContext one. However, it seems that users have the possibility to set arbitrary grpc.DialOption via the ConnectionOptions

sdk-go/internal/client.go

Lines 517 to 525 in efabf46

// Advanced dial options for gRPC connections. These are applied after the internal default dial options are
// applied. Therefore any dial options here may override internal ones.
//
// For gRPC interceptors, internal interceptors such as error handling, metrics, and retrying are done via
// grpc.WithChainUnaryInterceptor. Therefore to add inner interceptors that are wrapped by those, a
// grpc.WithChainUnaryInterceptor can be added as an option here. To add a single outer interceptor, a
// grpc.WithUnaryInterceptor option can be added since grpc.WithUnaryInterceptor is prepended to chains set with
// grpc.WithChainUnaryInterceptor.
DialOptions []grpc.DialOption

which includes the grpc.WithBlock() one. When that's set it's probably better to use DialContext with a context having a proper timeout value set. I am not sure the direction maintainers would like to go. Propagating a context and switch over to DialContext or stay with Dial and not promote usage of grpc.WithBlock() option like recommended upstream

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Quinn-With-Two-Ns - besides the comment I made, I think this is a good thing to do. Thoughts?

Comment on lines 1166 to 1168
// We set a default timeout if none is set in the context to stay
// backward compatible with the old behavior where the timeout was
// hardcoded to 5s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I am a bit concerned to remove 5s timeout just because context has a deadline. This call should really complete quick. Just because I have an outer context governing lots of things I may be doing and have set a timeout for 10m lets say, doesn't mean that I should have to wait 10m for this specific call to fail. I think we should add the timeout on top of the given context no matter what (i.e. a min of user deadline or 5s, still giving user cancelability).

IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is not a backwards compatible change in its current form

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?

Not anymore. Human interaction in interceptor is worked around by performing the interaction outside of the scope of the gRPC request. What I am trying to address here is slow/spotty networks.

Some people operate very far from Temporal servers (basically the other side of the globe), some are working with ISPs with slow DNS resolution times, working from bad wifi connections from airports, hotels, ... working over slow VPNs, working in transit from trains over a spotty network, relying on slow 3G network, ... usually those are people on call trying to respond to incidents and willing to run some automations in Temporal for instance. Hitting a 5s timeout when you combine a couple of those situations and a bit of network retries is possible.

Couple of years ago, there was wifi network congestion in some offices for weeks also impacting latencies and people used to extend their tctl --context_timeout flag, I believe to control the overall RPC timeout before loading server capabilities became a thing. That kind of flexibility was nice but we kinda lost it.

Sadly, we don't instrument developers laptops, I don't have metrics giving us the P99 dial latency for all our clients, nor traces to pinpoint where time is spent. This is based on couple of user reports that despite retries still hit the timeout.

5s timeout work 99% of the time but maybe 30s is a more appropriate default timeout in our situation. As you don't control environments clients run into having something customizable makes sense to me. Otherwise, your only option when facing timeouts are to work around the problem by ssh'ing into a node closer to server, fidgeting with auth, ... or try to find a better network.

If we take that stance, wether we can the overall dial timeout or every single RPC timeout triggered by the client dial call is a tradeoff. Expose simple imperfect configuration option or complex low level ones that compounds each time you add a new RPC call in the middle of the dial call.

Copy link
Member

@cretz cretz Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you may need the ability to control the get-system-info timeout more directly instead of changing everyone's default of 5s to their outer context. Many use large contexts with large deadlines but definitely don't have an initial connection/bootstrapping timeout expectation of that (though if this were a newly designed SDK we could have that expectation, but this is existing behavior). Maybe we can have a client option for this.

In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info.

Copy link
Contributor Author

@lminaudier lminaudier Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a client option for this.

Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info.

Sure thing, I can revert back to that!

In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.

For sure! The issue is purely for CLI tools triggering workflows asynchronously from engineers laptops, not for workers processing tasks. Workers are in the cloud doing their job in a timely manner.

// Get capabilities, lazily fetching from server if not already obtained.
func (wc *WorkflowClient) loadCapabilities() (*workflowservice.GetSystemInfoResponse_Capabilities, error) {
func (wc *WorkflowClient) loadCapabilities(ctx context.Context, opts ...loadCapabilitiesOption) (*workflowservice.GetSystemInfoResponse_Capabilities, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're overcomplicating a non-exported function with this new option concept. Just pass in the timeout as a param and change the global val name to defaultGetSystemInfoTimeout and use the default if the passed in one is 0.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Will let @Quinn-With-Two-Ns weigh in.

@lminaudier lminaudier marked this pull request as ready for review March 11, 2024 14:19
@lminaudier lminaudier requested a review from a team as a code owner March 11, 2024 14:19
@lminaudier
Copy link
Contributor Author

lminaudier commented Mar 11, 2024

Not 100% sure why CI complains. Locally for the check job I get

𝝺 pwd
.../internal/cmd/build

𝝺 go run . check
go: downloading honnef.co/go/tools v0.4.6
go: downloading golang.org/x/tools v0.16.0
2024/03/11 15:27:48 Running /opt/homebrew/bin/go in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [vet ./...]
2024/03/11 15:27:49 Installing github.com/kisielk/errcheck
2024/03/11 15:27:49 Running /Users/loic.minaudier/go/bin/errcheck in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [./...]
2024/03/11 15:27:50 Installing honnef.co/go/tools/cmd/staticcheck
2024/03/11 15:27:51 Running /Users/loic.minaudier/go/bin/staticcheck in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [./...]
2024/03/11 15:27:57 Running /opt/homebrew/bin/go in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [run ./internal/cmd/tools/copyright/licensegen.go --verifyOnly]

@cretz
Copy link
Member

cretz commented Mar 11, 2024

Can you merge to latest master and run the check again? Another DialClient-based test was added to that file.

@lminaudier
Copy link
Contributor Author

Very interesting CI setup (I mean merging latest main before running checks).

Just rebased and fixed the test file.

When clients use `Dial` or `NewClientFromExisting` APIs to start, they
connect to the server and try to load capabilities synchronously. That
being said, they have no control over the timeouts and cancellation of
such requests. They can't extend timeouts to accommodate slow networks
or appropriately cancel those calls.

This commit exposes new `DialContext` and `NewClientFromExistingWithContext`
APIs that take a context as input and propagates it down the stack until
all synchronous server calls so timeout and cancellation can be
controlled by the callers.
@cretz cretz merged commit 9c42221 into temporalio:master Mar 12, 2024
11 checks passed
@cretz
Copy link
Member

cretz commented Mar 12, 2024

Merged, thanks!

@lminaudier lminaudier deleted the dial-context branch March 12, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants