Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable gRPC keepalive dial option on default #238

Closed
wnxn opened this issue Nov 9, 2019 · 9 comments · Fixed by #239
Closed

Disable gRPC keepalive dial option on default #238

wnxn opened this issue Nov 9, 2019 · 9 comments · Fixed by #239
Assignees

Comments

@wnxn
Copy link
Contributor

wnxn commented Nov 9, 2019

What happened:

  1. When I use csi-sanity test to test my CSI driver through UDS, the gRPC transport is closed in DeleteSnapshot testcase and the sanity test reports failed in the testcase.
------------------------------
• Failure [80.008 seconds]
DeleteSnapshot [Controller Server]
/Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/tests.go:44
  should return appropriate values (no optional values added) [It]
  /Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1909

  Unexpected error:
      <*status.statusError | 0xc00001e2d0>: {
          Code: 14,
          Message: "transport is closing",
          Details: nil,
          XXX_NoUnkeyedLiteral: {},
          XXX_unrecognized: nil,
          XXX_sizecache: 0,
      }
      rpc error: code = Unavailable desc = transport is closing
  occurred

  /Users/wangxin/mygo/src/github.com/kubernetes-csi/csi-test/pkg/sanity/controller.go:1927
------------------------------
  1. After the problem shown as above occuring, the subsequent testcases failed for the TransientFailure.
  Unexpected error:
      <*status.statusError | 0xc00001e820>: {
          Code: 14,
          Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
          Details: nil,
          XXX_NoUnkeyedLiteral: {},
          XXX_unrecognized: nil,
          XXX_sizecache: 0,
      }
      rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
  occurred

What you expected to happen:

For the first problem, it is should not report transport is closing.
For the second problem, the transport closing problem should not affect subsequent test cases.

How to reproduce it (as minimally and precisely as possible):

  1. Add time.Sleep(100*time.Second) in ControllerDeleteSnapshot

    s.snapshots.Delete(i)

  2. For save time, we could use FIt to run focused testcase.

    It("should return appropriate values (no optional values added)", func() {

  3. Run ./hack/e2e.sh

Output:

...
gRPCCall: {"Method":"/csi.v1.Controller/DeleteVolume","Request":{"volume_id":"14"},"Response":{},"Error":""}
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: creating a volume from source snapshot
STEP: cleaning up deleting the volume created from snapshot
STEP: cleaning up deleting the snapshot

------------------------------
• Failure [79.925 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should create volume from an existing source snapshot [It]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:722

    Unexpected error:
        <*status.statusError | 0xc000357540>: {
            Code: 14,
            Message: "transport is closing",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = transport is closing
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:762
------------------------------
STEP: reusing connection to CSI driver at /tmp/e2e-csi-sanity.sock
STEP: creating mount and staging directories
• Failure in Spec Setup (BeforeEach) [0.001 seconds]
Controller Service [Controller Server]
/root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/tests.go:44
  CreateVolume
  /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:390
    should fail when the volume source snapshot is not found [BeforeEach]
    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:770

    Unexpected error:
        <*status.statusError | 0xc000357680>: {
            Code: 14,
            Message: "all SubConns are in TransientFailure, latest connection error: <nil>",
            Details: nil,
            XXX_NoUnkeyedLiteral: {},
            XXX_unrecognized: nil,
            XXX_sizecache: 0,
        }
        rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
    occurred

    /root/mygo/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:82
------------------------------

Anything else we need to know?:

After my experiment, I believe it caused by two reasons.

  1. My CSI driver DeleteSnapshot RPC call taking too much time about 100 seconds in some particular circumstance. But, in my opinion, it is reasonable in the automatic test. For example, storage system must ensure leasing info ready before deleting snapshot.

  2. gRPC keepalive mechanism.

  • gRPC server side logs
ERROR: 2019/11/09 07:11:46 transport: Got too many pings from the client, closing the connection.
ERROR: 2019/11/09 07:11:46 transport: loopyWriter.run returning. Err: transport: Connection closing
WARNING: 2019/11/09 07:11:46 transport: http2Server.HandleStreams failed to read frame: read unix //root/csi.sock->@: use of closed network connection
  • gRPC client side logs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSWARNING: 2019/11/09 07:10:26 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2019/11/09 07:10:26 parsed scheme: ""
INFO: 2019/11/09 07:10:26 scheme "" not registered, fallback to default scheme
INFO: 2019/11/09 07:10:26 ccResolverWrapper: sending new addresses to cc: [{/root/csi.sock 0  <nil>}]
INFO: 2019/11/09 07:10:26 ClientConn switching balancer to "pick_first"
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, CONNECTING
INFO: 2019/11/09 07:10:26 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, READY
INFO: 2019/11/09 07:11:46 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2019/11/09 07:11:46 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/11/09 07:11:46 pickfirstBalancer: HandleSubConnStateChange: 0xc0003d2010, TRANSIENT_FAILURE
STEP: connecting to CSI driver
STEP: creating mount and staging directories
STEP: creating a volume
STEP: creating a snapshot
STEP: cleaning up deleting the snapshot

In my opinion,
For the first problem, I tried to remove keepalive dial option and the transport is not closing anymore.

dialOptions = append(dialOptions,
grpc.WithKeepaliveParams(keepalive.ClientParameters{PermitWithoutStream: true}))

So, we could add a flag to add keepalive to dial options and disable it on default.

For the second problem, we can update gRPC dependency to the latest released version.
grpc/grpc-go#2669

Environment:

  • OS (e.g: cat /etc/os-release): Ubuntu 16.04
  • csi-test: latest version of master branch
@wnxn
Copy link
Contributor Author

wnxn commented Nov 9, 2019

/assign @pohly

@pohly
Copy link
Contributor

pohly commented Nov 11, 2019

Thanks for the detailed analysis, that helped a lot. The keepalive feature was enabled for testing over TCP because we have some tests in a custom testsuite in the PMEM-CSI driver where we intentionally reboot the target system and gRPC did not detect that without keepalive.

As the comment says (

// This is necessary when connecting via TCP and does not hurt
// when using Unix domain sockets. It ensures that gRPC detects a dead connection
// in a timely manner.
), that seemed to work fine also for Unix Domain Sockets, but as you pointed out here, that isn't quite the case. Let's change this so that keepalive is not enabled for UDS, okay?

@pohly
Copy link
Contributor

pohly commented Nov 11, 2019

/assign

@pohly
Copy link
Contributor

pohly commented Nov 11, 2019

Let's change this so that keepalive is not enabled for UDS, okay?

Hmm, but then it still fails for TCP when the driver is slow.

@wnxn
Copy link
Contributor Author

wnxn commented Nov 11, 2019

I did not test it for TCP. Do you reproduce it for TCP?

@wnxn
Copy link
Contributor Author

wnxn commented Nov 11, 2019

I wrote a small program to test this issue and reproduce it for TCP just now.
https://github.com/wnxn/gostudy/blob/master/pkg/grpc/csi-client/main.go

  • When the driver is slow, it still fails.
# ./main -endpoint "tcp://localhost:8000"
E1111 17:05:28.727709   12508 main.go:46] rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address tcp://localhost:8000: too many colons in address"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x814ac1]

goroutine 1 [running]:
main.DeleteSnapshot(0x9d4040, 0xc0000100a0)
	/root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:49 +0x1e1
main.main()
	/root/mygo/src/github.com/wnxn/gostudy/pkg/grpc/csi-client/main.go:28 +0x239
  • After I remove the keepalive dial option, it successes.
# ./main -endpoint "localhost:8000"
I1111 17:14:36.125973   23093 main.go:63] 
I1111 17:14:47.833281   23093 main.go:71] 

@pohly
Copy link
Contributor

pohly commented Nov 11, 2019

did not test it for TCP. Do you reproduce it for TCP?

Yes. I had to enhance the mock driver first such that it also allows TCP, but then your reproducer fails also in that mode.

Re-reading https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters I found a comment (again) about matching it with the server options. I think when I read that the first time, I interpreted it as "timeout values must match" and assumed that the defaults should be fine on the client, as we don't change those on the server. My second reading made me wonder whether it means that keepalive must be enabled explicitly on the server. But I just tried that and it doesn't fix the issue.

At this point I am tempted to just turn the canonical client/server example from gRPC into a reproducer and report this upstream.... done: grpc/grpc-go#3171

As a hot-fix we could make keepalive optional, but I'd prefer to fully understand the problem first.

@wnxn
Copy link
Contributor Author

wnxn commented Nov 11, 2019

Thanks for reporting this upstream. I agree with you that fully understand the problem.
It seems urgently to release v3.0.0. Shall I fix it by making keepalive optional now?

@pohly
Copy link
Contributor

pohly commented Nov 11, 2019

I can do that after #233 got merged. Otherwise we'll just have merge conflicts.

pohly added a commit to pohly/csi-test that referenced this issue Nov 12, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
@pohly pohly mentioned this issue Nov 12, 2019
pohly added a commit to pohly/csi-test that referenced this issue Nov 12, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
pohly added a commit to pohly/csi-test that referenced this issue Nov 14, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants