-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make client Dial context controlable by callers #1416
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Quinn-With-Two-Ns - besides the comment I made, I think this is a good thing to do. Thoughts?
internal/internal_workflow_client.go
Outdated
// We set a default timeout if none is set in the context to stay | ||
// backward compatible with the old behavior where the timeout was | ||
// hardcoded to 5s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I am a bit concerned to remove 5s timeout just because context has a deadline. This call should really complete quick. Just because I have an outer context governing lots of things I may be doing and have set a timeout for 10m lets say, doesn't mean that I should have to wait 10m for this specific call to fail. I think we should add the timeout on top of the given context no matter what (i.e. a min
of user deadline or 5s, still giving user cancelability).
IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is not a backwards compatible change in its current form
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the use case bringing about the PR was for human interaction in a gRPC interceptor, but I am not sure it's a fair default that we should allow this call to start taking as long as the outer context's deadline. Is there any way you can solve your use case without removing this timeout for everyone that just happens to have a deadline on their context?
Not anymore. Human interaction in interceptor is worked around by performing the interaction outside of the scope of the gRPC request. What I am trying to address here is slow/spotty networks.
Some people operate very far from Temporal servers (basically the other side of the globe), some are working with ISPs with slow DNS resolution times, working from bad wifi connections from airports, hotels, ... working over slow VPNs, working in transit from trains over a spotty network, relying on slow 3G network, ... usually those are people on call trying to respond to incidents and willing to run some automations in Temporal for instance. Hitting a 5s timeout when you combine a couple of those situations and a bit of network retries is possible.
Couple of years ago, there was wifi network congestion in some offices for weeks also impacting latencies and people used to extend their tctl
--context_timeout
flag, I believe to control the overall RPC timeout before loading server capabilities became a thing. That kind of flexibility was nice but we kinda lost it.
Sadly, we don't instrument developers laptops, I don't have metrics giving us the P99 dial latency for all our clients, nor traces to pinpoint where time is spent. This is based on couple of user reports that despite retries still hit the timeout.
5s timeout work 99% of the time but maybe 30s is a more appropriate default timeout in our situation. As you don't control environments clients run into having something customizable makes sense to me. Otherwise, your only option when facing timeouts are to work around the problem by ssh'ing into a node closer to server, fidgeting with auth, ... or try to find a better network.
If we take that stance, wether we can the overall dial timeout or every single RPC timeout triggered by the client dial call is a tradeoff. Expose simple imperfect configuration option or complex low level ones that compounds each time you add a new RPC call in the middle of the dial call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like you may need the ability to control the get-system-info timeout more directly instead of changing everyone's default of 5s to their outer context. Many use large contexts with large deadlines but definitely don't have an initial connection/bootstrapping timeout expectation of that (though if this were a newly designed SDK we could have that expectation, but this is existing behavior). Maybe we can have a client option for this.
In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have a client option for this.
Yeah it seems like what we need is the option for users to explicitly set the timeout of get-system-info.
Sure thing, I can revert back to that!
In general if any call like this takes longer than 5s you're going to have a subpar Temporal experience. We have task timeouts and such on pollers and Temporal in general is not built with the expectation that very simple calls to the server can take many seconds. You're going to hit task timeouts and delays everywhere with an over-5s latency.
For sure! The issue is purely for CLI tools triggering workflows asynchronously from engineers laptops, not for workers processing tasks. Workers are in the cloud doing their job in a timely manner.
internal/internal_workflow_client.go
Outdated
// Get capabilities, lazily fetching from server if not already obtained. | ||
func (wc *WorkflowClient) loadCapabilities() (*workflowservice.GetSystemInfoResponse_Capabilities, error) { | ||
func (wc *WorkflowClient) loadCapabilities(ctx context.Context, opts ...loadCapabilitiesOption) (*workflowservice.GetSystemInfoResponse_Capabilities, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're overcomplicating a non-exported function with this new option concept. Just pass in the timeout as a param and change the global val name to defaultGetSystemInfoTimeout
and use the default if the passed in one is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Will let @Quinn-With-Two-Ns weigh in.
Not 100% sure why CI complains. Locally for the check job I get
|
Can you merge to latest master and run the |
3971a98
to
36f9b3f
Compare
Very interesting CI setup (I mean merging latest main before running checks). Just rebased and fixed the test file. |
When clients use `Dial` or `NewClientFromExisting` APIs to start, they connect to the server and try to load capabilities synchronously. That being said, they have no control over the timeouts and cancellation of such requests. They can't extend timeouts to accommodate slow networks or appropriately cancel those calls. This commit exposes new `DialContext` and `NewClientFromExistingWithContext` APIs that take a context as input and propagates it down the stack until all synchronous server calls so timeout and cancellation can be controlled by the callers.
36f9b3f
to
6430896
Compare
Merged, thanks! |
What was changed
This PR exposes new
DialContext
andNewClientFromExistingWithContext
APIs that take a context as input and propagates it down the stack until all synchronous server calls so timeout and cancellation can be controlled by the callers.The PR also replaces
grpc.Dial
call bygrpc.DialContext
and propagate user defined context down to the grpc dial call properly. Doing so means when users usegrpc.WithBlock()
dial option, they will be in control of the dial timeoutWhy?
When clients use
Dial
orNewClientFromExisting
APIs to start, they connect to the server and try to load capabilities synchronously. However, clients have no control over the timeouts and cancellation of such requests. They can't extend timeouts to accommodate slow networks or appropriately cancel those calls.Checklist
Closes
How was this tested:
I expect no behavior change. Timeouts are kept the same unless you use the new APIs and define proper context deadlines.
Godoc updates might be enough?
I noticed that the gRPC client uses the
Dial
APIsdk-go/internal/grpc_dialer.go
Line 144 in efabf46
and not the
DialContext
one. However, it seems that users have the possibility to set arbitrarygrpc.DialOption
via theConnectionOptions
sdk-go/internal/client.go
Lines 517 to 525 in efabf46
which includes the
grpc.WithBlock()
one. When that's set it's probably better to useDialContext
with a context having a proper timeout value set. I am not sure the direction maintainers would like to go. Propagating a context and switch over toDialContext
or stay withDial
and not promote usage ofgrpc.WithBlock()
option like recommended upstream