Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace health checking with server-capability getter and disable retries on internal errors #706

Merged
merged 7 commits into from
Mar 1, 2022

Conversation

cretz
Copy link
Member

@cretz cretz commented Jan 24, 2022

What was changed

  • Remove health checking
  • On every connection, call WorkflowService.GetServerInfo to eagerly obtain the capabilities of the server (or ignore if call is not there, it was added in New GetSystemInfo RPC temporal#2309)
  • If the capabilities say we don't need to retry internal errors, don't

Why?

We need to know from the server what it supports. Specifically in this case, we need to stop retrying internal errors if the server says it differentiates between them.

Checklist

  1. Closes Implement server capabilities and replace health check #701

⚠️ WARNING - Breaking API Change

This makes the following backwards incompatible changes:

  • Removes client.ConnectionOptions.DisableHealthCheck, client.ConnectionOptions.HealthCheckAttemptTimeout, and client.ConnectionOptions.HealthCheckTimeout. This means users of those options will have to change code on upgrade.
    • Since on-new-client health checking (default performed, but opt-out via DisableHealthCheck) is replaced by obtaining these capabilities, we removed the configuration options.
    • We now offer no option to disable eager connection to the server. This was chosen as the better option to allowing callers to disable our ability to obtain server capabilities.
      • This will require reworking for anyone using DisableHealthCheck (which includes our own server), to not create the client until the server is running.
      • Arguably one should not use a client that cannot connect to a server, and connecting later instead of fail-fast makes troubleshooting tougher.
    • It was decided to remove the options instead of deprecate them since they wouldn't be deprecated, but rather no longer used.
  • Removes health checking from NewNamespaceClient.

The removal of health checking and the removal of eager-connect opt-out was discussed internally with @bergundy, @alexshtin, and @yiminc. We can of course take other approaches if there are concerns here.

@mfateev and @samarabbas - Please review.

client/client.go Outdated Show resolved Hide resolved
@cretz
Copy link
Member Author

cretz commented Jan 24, 2022

Note, I am having CI failures I am struggling to replicate locally. Pardon the repeated forced pushes.

@cretz
Copy link
Member Author

cretz commented Jan 25, 2022

After discussion here, we can't reasonably require a namespace to be ready for cross-namespace clients. We are going to have to remove namespace from this call probably. I will update with more info after discussion.

@cretz
Copy link
Member Author

cretz commented Jan 27, 2022

I will be waiting for after next server release and after next SDK release before coming back to this

…info

# Conflicts:
#	internal/grpc_dialer_test.go
@cretz
Copy link
Member Author

cretz commented Feb 25, 2022

This has now been updated to remove namespace from the call and is ready for review/inclusion.


// HealthCheckTimeout defines how long client should be sending health check requests to the server before concluding
// that it is unavailable. Defaults to 10s, once this timeout is reached error will be propagated to the client.
HealthCheckTimeout time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with not exposing GetSystemInfoTimeout here until it's requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement server capabilities and replace health check
3 participants