Provide custom keepalive ping handler #2166

gyuho · 2018-06-20T17:36:14Z

What version of gRPC are you using?

v1.12.2

What version of Go are you using (`go version`)?

Go 1.10.3

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Use keepalive.

What did you expect to see?

Customizable keepalive ping handler.

What did you see instead?

Internal to gRPC.

Currently, server can only configure keepalive "duration":

grpc-go/keepalive/keepalive.go

Lines 41 to 45 in 2ab07fd

    
           // ServerParameters is used to set keepalive and max-age parameters on the server-side. 
        
           type ServerParameters struct { 
        
           	// MaxConnectionIdle is a duration for the amount of time after which an idle connection would be closed by sending a GoAway. 
        
           	// Idleness duration is defined since the most recent time the number of outstanding RPCs became zero or the connection establishment. 
        
           	MaxConnectionIdle time.Duration // The current default value is infinity.

And its keepalive handler is internal to gRPC:

grpc-go/transport/http2_server.go

Lines 854 to 857 in b28608a

    
           func (t *http2Server) keepalive() { 
        
           	p := &ping{} 
        
           	var pingSent bool 
        
           	maxIdle := time.NewTimer(t.kp.MaxConnectionIdle)

etcd uses this keepalive ping to detect unavailable server or client for stream RPCs (e.g. Watch API).

etcd is typically clustered of 3 to 5 nodes. Client watch stream may failover to other nodes, when a node become unavailable. However, once a stream is made to a node, which then becomes network-partitioned, the watch client would get stuck with the partitioned node.

Related issues are:

Network-partition aware etcd health service Network-partition aware health service etcd-io/etcd#8673
Kubernetes is vulnerable to stale reads Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees kubernetes/kubernetes#59848

etcd's current workaround is to specify WithRequireLeader in client context to pass metadata to the server interceptor; but, this is optional in etcd. Another workaround in etcd is to add add stream progress request etcd-io/etcd#9869 to check if the node is up-to-date.

It would be very useful if etcd can configure server-side keepalive ping handler to do the following:

receive client ping
receiver node checks if the node has an active leader
if the node lost the leader, close the client connection

Do we have any plan to expose keepalive ping handler in gRPC?

/cc @jpbetz @xiang90

The text was updated successfully, but these errors were encountered:

dfawley · 2018-06-21T20:46:20Z

@gyuho, no, there are no current plans to do this. Is there a reason you want to perform this check when you get keepalive pings as opposed to alternatives (e.g. polling periodically)? It may be possible to do something similar using a custom handshaker on the server that kills the connection when certain conditions arise, or by making an RPC initiated by the client that takes the place of the keepalive pings. I'm not sure I fully understand your design, however, so this may or may not help.

gyuho · 2018-06-22T17:50:48Z

@dfawley I wasn't clear enough.

Our main use case for keepalive ping is to detect unresponsive "server" from "client" side for long-running stream RPCs. Keepalive is used for detecting disconnects in transport layer, but no way to extend it to support application-layer conditions (as mentioned above, partitioned node keepalive server handler will still receive keepalive pings from server, thus marked as active and making client stream be stuck with the partitioned node).

Is there a reason you want to perform this check when you get keepalive pings as opposed to alternatives (e.g. polling periodically)?

We want to use keepalive since it's light (8-byte HTTP/2 ping) and built-in to gRPC server. We have a similar use case in gRPC server interceptor layer where every RPC goes through leader-check to see if there's an active leader or not. I would imagine we could do the same thing for keepalive handler.

We haven't tried https://godoc.org/google.golang.org/grpc/credentials#TransportCredentials but seems like this is for connection handshake?

Thanks!

dfawley · 2018-06-22T21:07:35Z

@guyho,

If you were to kill the connection, how would the client know which node to reconnect to?

Could you perform this check in a server interceptor and kill the watch stream (instead of the connection itself) when the partitioning happens? Note that your streaming interceptor has full control of the stream and can terminate it early with an error at any time -- not just at the start of the RPC or when traffic happens. E.g. something like this might work:

func interceptor(...) error {
  errChan := make(chan error, 1)
  go func() {
    errChan<-handler(srv, ss)
    close(errChan)
  }()
  select {
    case <-partitioned:  // written to by something monitoring for partitioning
      return partitionedErr
    case err := <-errChan:
      return err
    }
  }
}

We haven't tried godoc.org/google.golang.org/grpc/credentials#TransportCredentials but seems like this is for connection handshake?

TransportCredentials is intended for auth, but could be used to do anything at a connection-level that was desired as well. That would be a bit of an abuse, though, so probably not a great idea. Instead, if you really wanted connection-level control, the Listener you give to gRPC could be wrapped in something that wraps the Conns it hands out that do checks like this.

gyuho · 2018-06-22T21:24:45Z

@dfawley

If you were to kill the connection, how would the client know which node to reconnect to?

Once disconnected, etcd clientv3 balancer picker policy will either round robin or switch other nodes based on latest node health status to retry the requests. We just need some indication that this node is not responding.

Could you perform this check in a server interceptor and kill the watch stream (instead of the connection itself) when the partitioning happens?

Yes, and we support that as an option (user can specify require-leader metadata in their contexts).

Instead, if you really wanted connection-level control, the Listener you give to gRPC could be wrapped in something that wraps the Conns it hands out that do checks like this.

I see. We will look into this.

Was just curious if gRPC team has any plan to support the custom keepalive handler. If you recommend server interceptor pattern, we are also happy with it for now.

Thanks for response! And please feel free to close it if there's no plan to support the custom keepalive handler in the near future. We will revisit when we find other use cases.

dfawley · 2018-06-22T21:50:16Z

Yes, I think the interceptor pattern would be recommended for this. Keepalives should only be necessary for our internal implementation, so I would like to avoid exporting a hook for them unless there is a strong enough need to justify it (i.e. there's no other/better way to accomplish the same thing). Thanks for the request.

gyuho mentioned this issue Jun 20, 2018

clientv3: clarify "WithRequireLeader" for network partition etcd-io/etcd#9872

Merged

lyuxuan added the Type: Feature New features or improvements in behavior label Jun 20, 2018

dfawley added the Status: Requires Reporter Clarification label Jun 21, 2018

dfawley removed the Status: Requires Reporter Clarification label Jun 22, 2018

dfawley closed this as completed Jun 22, 2018

lock bot locked as resolved and limited conversation to collaborators Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide custom keepalive ping handler #2166

Provide custom keepalive ping handler #2166

gyuho commented Jun 20, 2018 •

edited

Loading

dfawley commented Jun 21, 2018

gyuho commented Jun 22, 2018 •

edited

Loading

dfawley commented Jun 22, 2018

gyuho commented Jun 22, 2018

dfawley commented Jun 22, 2018

Provide custom keepalive ping handler #2166

Provide custom keepalive ping handler #2166

Comments

gyuho commented Jun 20, 2018 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

dfawley commented Jun 21, 2018

gyuho commented Jun 22, 2018 • edited Loading

dfawley commented Jun 22, 2018

gyuho commented Jun 22, 2018

dfawley commented Jun 22, 2018

gyuho commented Jun 20, 2018 •

edited

Loading

What version of Go are you using (`go version`)?

gyuho commented Jun 22, 2018 •

edited

Loading