Disable gRPC keepalive dial option on default #238

wnxn · 2019-11-09T01:13:39Z

What happened:

When I use csi-sanity test to test my CSI driver through UDS, the gRPC transport is closed in DeleteSnapshot testcase and the sanity test reports failed in the testcase.

------------------------------
• Failure [80.008 seconds]
DeleteSnapshot [Controller Server]
/Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/tests.go:44
  should return appropriate values (no optional values added) [It]
  /Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1909

  Unexpected error:
      <*status.statusError | 0xc00001e2d0>: {
          Code: 14,
          Message: "transport is closing",
          Details: nil,
          XXX_NoUnkeyedLiteral: {},
          XXX_unrecognized: nil,
          XXX_sizecache: 0,
      }
      rpc error: code = Unavailable desc = transport is closing
  occurred

  /Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1927
------------------------------

After the problem shown as above occuring, the subsequent testcases failed for the TransientFailure.

  Unexpected error:
      <*status.statusError | 0xc00001e820>: {
          Code: 14,
          Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
          Details: nil,
          XXX_NoUnkeyedLiteral: {},
          XXX_unrecognized: nil,
          XXX_sizecache: 0,
      }
      rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
  occurred

What you expected to happen:

For the first problem, it is should not report transport is closing.
For the second problem, the transport closing problem should not affect subsequent test cases.

How to reproduce it (as minimally and precisely as possible):

Add time.Sleep(100*time.Second) in ControllerDeleteSnapshot

csi-test/mock/service/controller.go

Line 504 in d305288

s.snapshots.Delete(i)
For save time, we could use FIt to run focused testcase.

csi-test/pkg/sanity/controller.go

Line 1909 in d305288

It("should return appropriate values (no optional values added)", func() {
Run ./hack/e2e.sh

Output:

...
gRPCCall: {"Method":"/csi.v1.Controller/DeleteVolume","Request":{"volume_id":"14"},"Response":{},"Error":""}
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: creating a volume from source snapshot
STEP: cleaning up deleting the volume created from snapshot
STEP: cleaning up deleting the snapshot

------------------------------
• Failure [79.925 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should create volume from an existing source snapshot [It]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:722

    Unexpected error:
        <*status.statusError | 0xc000357540>: {
            Code: 14,
            Message: "transport is closing",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = transport is closing
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:762
------------------------------
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
• Failure in Spec Setup (BeforeEach) [0.001 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should fail when the volume source snapshot is not found [BeforeEach]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:770

    Unexpected error:
        <*status.statusError | 0xc000357680>: {
            Code: 14,
            Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:82
------------------------------

Anything else we need to know?:

After my experiment, I believe it caused by two reasons.

My CSI driver DeleteSnapshot RPC call taking too much time about 100 seconds in some particular circumstance. But, in my opinion, it is reasonable in the automatic test. For example, storage system must ensure leasing info ready before deleting snapshot.
gRPC keepalive mechanism.

gRPC server side logs

ERROR: 2019/11/09 07:11:46 transport: Got too many pings from the client, closing the connection.
ERROR: 2019/11/09 07:11:46 transport: loopyWriter.run returning. Err: transport: Connection closing
WARNING: 2019/11/09 07:11:46 transport: http2Server.HandleStreams failed to read frame: read unix //root/csi.sock->@: use of closed network connection

gRPC client side logs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSWARNING: 2019/11/09 07:10:26 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2019/11/09 07:10:26 parsed scheme: ""
INFO: 2019/11/09 07:10:26 scheme "" not registered, fallback to default scheme
INFO: 2019/11/09 07:10:26 ccResolverWrapper: sending new addresses to cc: [{/root/csi.sock 0  <nil>}]
INFO: 2019/11/09 07:10:26 ClientConn switching balancer to "pick_first"
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, CONNECTING
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, READY
INFO: 2019/11/09 07:11:46 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2019/11/09 07:11:46 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/11/09 07:11:46 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, TRANSIENT_FAILURE
STEP: connecting to CSI driver
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: cleaning up deleting the snapshot

In my opinion,
For the first problem, I tried to remove keepalive dial option and the transport is not closing anymore.

csi-test/utils/grpcutil.go

Lines 47 to 48 in d305288

    
           dialOptions = append(dialOptions, 
        
           	grpc.WithKeepaliveParams(keepalive.ClientParameters{PermitWithoutStream: true}))

So, we could add a flag to add keepalive to dial options and disable it on default.

For the second problem, we can update gRPC dependency to the latest released version.
grpc/grpc-go#2669

Environment:

OS (e.g: cat /etc/os-release): Ubuntu 16.04
csi-test: latest version of master branch

The text was updated successfully, but these errors were encountered:

wnxn · 2019-11-09T01:14:15Z

/assign @pohly

pohly · 2019-11-11T07:19:54Z

Thanks for the detailed analysis, that helped a lot. The keepalive feature was enabled for testing over TCP because we have some tests in a custom testsuite in the PMEM-CSI driver where we intentionally reboot the target system and gRPC did not detect that without keepalive.

As the comment says (

csi-test/utils/grpcutil.go

Lines 44 to 46 in d305288

    
           // This is necessary when connecting via TCP and does not hurt 
        
           // when using Unix domain sockets. It ensures that gRPC detects a dead connection 
        
           // in a timely manner.

), that seemed to work fine also for Unix Domain Sockets, but as you pointed out here, that isn't quite the case. Let's change this so that keepalive is not enabled for UDS, okay?

pohly · 2019-11-11T07:20:01Z

/assign

pohly · 2019-11-11T08:38:55Z

Let's change this so that keepalive is not enabled for UDS, okay?

Hmm, but then it still fails for TCP when the driver is slow.

wnxn · 2019-11-11T08:44:48Z

I did not test it for TCP. Do you reproduce it for TCP?

wnxn · 2019-11-11T09:20:27Z

I wrote a small program to test this issue and reproduce it for TCP just now.
https://github.com/wnxn/gostudy/blob/master/pkg/grpc/csi-client/main.go

When the driver is slow, it still fails.

# ./main -endpoint "tcp://localhost:8000"
E1111 17:05:28.727709   12508 main.go:46] rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address tcp://localhost:8000: too many colons in address"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x814ac1]

goroutine 1 [running]:
main.DeleteSnapshot(0x9d4040, 0xc0000100a0)
	/root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:49 +0x1e1
main.main()
	/root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:28 +0x239

After I remove the keepalive dial option, it successes.

# ./main -endpoint "localhost:8000"
I1111 17:14:36.125973   23093 main.go:63] 
I1111 17:14:47.833281   23093 main.go:71]

pohly · 2019-11-11T10:48:09Z

did not test it for TCP. Do you reproduce it for TCP?

Yes. I had to enhance the mock driver first such that it also allows TCP, but then your reproducer fails also in that mode.

Re-reading https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters I found a comment (again) about matching it with the server options. I think when I read that the first time, I interpreted it as "timeout values must match" and assumed that the defaults should be fine on the client, as we don't change those on the server. My second reading made me wonder whether it means that keepalive must be enabled explicitly on the server. But I just tried that and it doesn't fix the issue.

At this point I am tempted to just turn the canonical client/server example from gRPC into a reproducer and report this upstream.... done: grpc/grpc-go#3171

As a hot-fix we could make keepalive optional, but I'd prefer to fully understand the problem first.

wnxn · 2019-11-11T12:22:50Z

Thanks for reporting this upstream. I agree with you that fully understand the problem.
It seems urgently to release v3.0.0. Shall I fix it by making keepalive optional now?

pohly · 2019-11-11T15:33:28Z

I can do that after #233 got merged. Otherwise we'll just have merge conflicts.

The gRPC defaults for a client do not work for a gRPC server that also uses the defaults (grpc/grpc-go#3171) which caused connection drops when a CSI driver is slow to respond and the client starts sending keepalive pings too quickly (kubernetes-csi#238). The sanity package no longer uses gRPC keepalive. Instead, custom test suites can enable that via some new configuration options if needed and when it is certain that the CSI driver under test can handle it.

k8s-ci-robot assigned pohly Nov 9, 2019

pohly mentioned this issue Nov 11, 2019

Release v3.0.0 to make the new major release importable #237

Closed

pohly mentioned this issue Nov 12, 2019

keepalive API #239

Merged

k8s-ci-robot closed this as completed in #239 Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable gRPC keepalive dial option on default #238

Disable gRPC keepalive dial option on default #238

wnxn commented Nov 9, 2019

wnxn commented Nov 9, 2019

pohly commented Nov 11, 2019

pohly commented Nov 11, 2019

pohly commented Nov 11, 2019

wnxn commented Nov 11, 2019

wnxn commented Nov 11, 2019 •

edited

Loading

pohly commented Nov 11, 2019

wnxn commented Nov 11, 2019

pohly commented Nov 11, 2019

Disable gRPC keepalive dial option on default #238

Disable gRPC keepalive dial option on default #238

Comments

wnxn commented Nov 9, 2019

wnxn commented Nov 9, 2019

pohly commented Nov 11, 2019

pohly commented Nov 11, 2019

pohly commented Nov 11, 2019

wnxn commented Nov 11, 2019

wnxn commented Nov 11, 2019 • edited Loading

pohly commented Nov 11, 2019

wnxn commented Nov 11, 2019

pohly commented Nov 11, 2019

wnxn commented Nov 11, 2019 •

edited

Loading