Make client Dial context controlable by callers #1416

lminaudier · 2024-03-07T17:05:11Z

What was changed

This PR exposes new DialContext and NewClientFromExistingWithContext APIs that take a context as input and propagates it down the stack until all synchronous server calls so timeout and cancellation can be controlled by the callers.

The PR also replaces grpc.Dial call by grpc.DialContext and propagate user defined context down to the grpc dial call properly. Doing so means when users use grpc.WithBlock() dial option, they will be in control of the dial timeout

Why?

When clients use Dial or NewClientFromExisting APIs to start, they connect to the server and try to load capabilities synchronously. However, clients have no control over the timeouts and cancellation of such requests. They can't extend timeouts to accommodate slow networks or appropriately cancel those calls.

Checklist

Closes
How was this tested:

I expect no behavior change. Timeouts are kept the same unless you use the new APIs and define proper context deadlines.

Any docs updates needed?

Godoc updates might be enough?

I noticed that the gRPC client uses the Dial API

sdk-go/internal/grpc_dialer.go

Line 144 in efabf46

return grpc.Dial(params.HostPort, opts...)

and not the DialContext one. However, it seems that users have the possibility to set arbitrary grpc.DialOption via the ConnectionOptions

sdk-go/internal/client.go

Lines 517 to 525 in efabf46

    
           // Advanced dial options for gRPC connections. These are applied after the internal default dial options are 
        
           // applied. Therefore any dial options here may override internal ones. 
        
           // 
        
           // For gRPC interceptors, internal interceptors such as error handling, metrics, and retrying are done via 
        
           // grpc.WithChainUnaryInterceptor. Therefore to add inner interceptors that are wrapped by those, a 
        
           // grpc.WithChainUnaryInterceptor can be added as an option here. To add a single outer interceptor, a 
        
           // grpc.WithUnaryInterceptor option can be added since grpc.WithUnaryInterceptor is prepended to chains set with 
        
           // grpc.WithChainUnaryInterceptor. 
        
           DialOptions []grpc.DialOption

which includes the grpc.WithBlock() one. When that's set it's probably better to use DialContext with a context having a proper timeout value set. I am not sure the direction maintainers would like to go. Propagating a context and switch over to DialContext or stay with Dial and not promote usage of grpc.WithBlock() option like recommended upstream

cretz

@Quinn-With-Two-Ns - besides the comment I made, I think this is a good thing to do. Thoughts?

cretz · 2024-03-07T21:49:23Z

internal/internal_workflow_client.go

+	// We set a default timeout if none is set in the context to stay
+	// backward compatible with the old behavior where the timeout was
+	// hardcoded to 5s.


Hrmm, I am a bit concerned to remove 5s timeout just because context has a deadline. This call should really complete quick. Just because I have an outer context governing lots of things I may be doing and have set a timeout for 10m lets say, doesn't mean that I should have to wait 10m for this specific call to fail. I think we should add the timeout on top of the given context no matter what (i.e. a min of user deadline or 5s, still giving user cancelability).

IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?

Yeah this is not a backwards compatible change in its current form

IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?

Not anymore. Human interaction in interceptor is worked around by performing the interaction outside of the scope of the gRPC request. What I am trying to address here is slow/spotty networks.

Some people operate very far from Temporal servers (basically the other side of the globe), some are working with ISPs with slow DNS resolution times, working from bad wifi connections from airports, hotels, ... working over slow VPNs, working in transit from trains over a spotty network, relying on slow 3G network, ... usually those are people on call trying to respond to incidents and willing to run some automations in Temporal for instance. Hitting a 5s timeout when you combine a couple of those situations and a bit of network retries is possible.

Couple of years ago, there was wifi network congestion in some offices for weeks also impacting latencies and people used to extend their tctl --context_timeout flag, I believe to control the overall RPC timeout before loading server capabilities became a thing. That kind of flexibility was nice but we kinda lost it.

Sadly, we don't instrument developers laptops, I don't have metrics giving us the P99 dial latency for all our clients, nor traces to pinpoint where time is spent. This is based on couple of user reports that despite retries still hit the timeout.

5s timeout work 99% of the time but maybe 30s is a more appropriate default timeout in our situation. As you don't control environments clients run into having something customizable makes sense to me. Otherwise, your only option when facing timeouts are to work around the problem by ssh'ing into a node closer to server, fidgeting with auth, ... or try to find a better network.

If we take that stance, wether we can the overall dial timeout or every single RPC timeout triggered by the client dial call is a tradeoff. Expose simple imperfect configuration option or complex low level ones that compounds each time you add a new RPC call in the middle of the dial call.

Sounds like you may need the ability to control the get-system-info timeout more directly instead of changing everyone's default of 5s to their outer context. Many use large contexts with large deadlines but definitely don't have an initial connection/bootstrapping timeout expectation of that (though if this were a newly designed SDK we could have that expectation, but this is existing behavior). Maybe we can have a client option for this.

In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.

Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info.

Maybe we can have a client option for this.

Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info.

Sure thing, I can revert back to that!

In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.

For sure! The issue is purely for CLI tools triggering workflows asynchronously from engineers laptops, not for workers processing tasks. Workers are in the cloud doing their job in a timely manner.

cretz · 2024-03-11T13:48:34Z

internal/internal_workflow_client.go

 // Get capabilities, lazily fetching from server if not already obtained.
-func (wc *WorkflowClient) loadCapabilities() (*workflowservice.GetSystemInfoResponse_Capabilities, error) {
+func (wc *WorkflowClient) loadCapabilities(ctx context.Context, opts ...loadCapabilitiesOption) (*workflowservice.GetSystemInfoResponse_Capabilities, error) {


I think you're overcomplicating a non-exported function with this new option concept. Just pass in the timeout as a param and change the global val name to defaultGetSystemInfoTimeout and use the default if the passed in one is 0.

cretz

This looks good to me. Will let @Quinn-With-Two-Ns weigh in.

lminaudier · 2024-03-11T14:30:25Z

Not 100% sure why CI complains. Locally for the check job I get

𝝺 pwd
.../internal/cmd/build

𝝺 go run . check
go: downloading honnef.co/go/tools v0.4.6
go: downloading golang.org/x/tools v0.16.0
2024/03/11 15:27:48 Running /opt/homebrew/bin/go in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [vet ./...]
2024/03/11 15:27:49 Installing github.com/kisielk/errcheck
2024/03/11 15:27:49 Running /Users/loic.minaudier/go/bin/errcheck in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [./...]
2024/03/11 15:27:50 Installing honnef.co/go/tools/cmd/staticcheck
2024/03/11 15:27:51 Running /Users/loic.minaudier/go/bin/staticcheck in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [./...]
2024/03/11 15:27:57 Running /opt/homebrew/bin/go in /Users/loic.minaudier/go/src/github.com/lminaudier/temporal-go-sdk with args [run ./internal/cmd/tools/copyright/licensegen.go --verifyOnly]

cretz · 2024-03-11T17:16:17Z

Can you merge to latest master and run the check again? Another DialClient-based test was added to that file.

lminaudier · 2024-03-11T17:40:46Z

Very interesting CI setup (I mean merging latest main before running checks).

Just rebased and fixed the test file.

When clients use `Dial` or `NewClientFromExisting` APIs to start, they connect to the server and try to load capabilities synchronously. That being said, they have no control over the timeouts and cancellation of such requests. They can't extend timeouts to accommodate slow networks or appropriately cancel those calls. This commit exposes new `DialContext` and `NewClientFromExistingWithContext` APIs that take a context as input and propagates it down the stack until all synchronous server calls so timeout and cancellation can be controlled by the callers.

cretz · 2024-03-12T16:20:57Z

Merged, thanks!

cretz reviewed Mar 7, 2024

View reviewed changes

cretz reviewed Mar 11, 2024

View reviewed changes

cretz approved these changes Mar 11, 2024

View reviewed changes

lminaudier marked this pull request as ready for review March 11, 2024 14:19

lminaudier requested a review from a team as a code owner March 11, 2024 14:19

lminaudier force-pushed the dial-context branch from 3971a98 to 36f9b3f Compare March 11, 2024 17:39

Quinn-With-Two-Ns approved these changes Mar 11, 2024

View reviewed changes

lminaudier added 5 commits March 11, 2024 16:54

Allow clients to tweak getSystemInfoTimeout explicitly

9d740a9

Fix tests

a4612de

Get rid of functional options pattern

3f422b1

Rebase on latest main and fix tests

6430896

Quinn-With-Two-Ns force-pushed the dial-context branch from 36f9b3f to 6430896 Compare March 11, 2024 23:54

cretz merged commit 9c42221 into temporalio:master Mar 12, 2024
11 checks passed

lminaudier deleted the dial-context branch March 12, 2024 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make client Dial context controlable by callers #1416

Make client Dial context controlable by callers #1416

lminaudier commented Mar 7, 2024

cretz left a comment

cretz Mar 7, 2024

Quinn-With-Two-Ns Mar 8, 2024

lminaudier Mar 8, 2024

cretz Mar 8, 2024 •

edited

Loading

Quinn-With-Two-Ns Mar 8, 2024

lminaudier Mar 8, 2024 •

edited

Loading

cretz Mar 11, 2024

cretz left a comment

lminaudier commented Mar 11, 2024 •

edited

Loading

cretz commented Mar 11, 2024

lminaudier commented Mar 11, 2024

cretz commented Mar 12, 2024

	// Advanced dial options for gRPC connections. These are applied after the internal default dial options are
	// applied. Therefore any dial options here may override internal ones.
	//
	// For gRPC interceptors, internal interceptors such as error handling, metrics, and retrying are done via
	// grpc.WithChainUnaryInterceptor. Therefore to add inner interceptors that are wrapped by those, a
	// grpc.WithChainUnaryInterceptor can be added as an option here. To add a single outer interceptor, a
	// grpc.WithUnaryInterceptor option can be added since grpc.WithUnaryInterceptor is prepended to chains set with
	// grpc.WithChainUnaryInterceptor.
	DialOptions []grpc.DialOption

Make client Dial context controlable by callers #1416

Make client Dial context controlable by callers #1416

Conversation

lminaudier commented Mar 7, 2024

What was changed

Why?

Checklist

cretz left a comment

Choose a reason for hiding this comment

cretz Mar 7, 2024

Choose a reason for hiding this comment

Quinn-With-Two-Ns Mar 8, 2024

Choose a reason for hiding this comment

lminaudier Mar 8, 2024

Choose a reason for hiding this comment

cretz Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Quinn-With-Two-Ns Mar 8, 2024

Choose a reason for hiding this comment

lminaudier Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Mar 11, 2024

Choose a reason for hiding this comment

cretz left a comment

Choose a reason for hiding this comment

lminaudier commented Mar 11, 2024 • edited Loading

cretz commented Mar 11, 2024

lminaudier commented Mar 11, 2024

cretz commented Mar 12, 2024

cretz Mar 8, 2024 •

edited

Loading

lminaudier Mar 8, 2024 •

edited

Loading

lminaudier commented Mar 11, 2024 •

edited

Loading