Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: use gRPC server GracefulStop #7743

Merged
merged 6 commits into from
Apr 18, 2017
Merged

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Apr 14, 2017

Example output

2017-04-14 09:20:53.599484 I | etcdserver/api: enabled capabilities for version 3.2
^C2017-04-14 09:20:55.974829 N | pkg/osutil: received interrupt signal, shutting down...
2017-04-14 09:20:55.974857 I | etcdserver: skipped leadership transfer for single member cluster
2017-04-14 09:20:55.974875 W | embed: gracefully stopping gRPC server
2017-04-14 09:20:55.974883 W | embed: gracefully stopped gRPC server
2017-04-14 09:20:55.975058 W | embed: server stopped with "grpc: the server has been stopped"
2017-04-14 09:20:55.975482 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975503 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975523 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975540 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975559 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975598 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2017-04-14 09:20:55.975628 W | embed: server stopped with "accept tcp 127.0.0.1:2379: use of closed network connection"
2017-04-14 09:20:55.975667 W | embed: server stopped with "mux: listener closed"
2017-04-14 09:20:55.975712 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975729 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}
2017-04-14 09:20:55.975743 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}

Fix #7322.

@heyitsanthony
Copy link
Contributor

This should fix the inflight op crashes, right? Can there be a test?

@gyuho
Copy link
Contributor Author

gyuho commented Apr 14, 2017

@heyitsanthony Yes, I will try to verify this fixes that problem by adding tests or reproduce.

@gyuho gyuho force-pushed the shutdown-grpc-server branch 3 times, most recently from 32d93e3 to 6686634 Compare April 14, 2017 18:12
@gyuho
Copy link
Contributor Author

gyuho commented Apr 14, 2017

@heyitsanthony Test added. Confirmed that it fixes the issue (use reqN := 500 in the test, and comment out graceful stop part, then it will panic in boltdb). PTAL.

@gyuho gyuho removed the WIP label Apr 14, 2017
embed/serve.go Outdated
@@ -52,11 +52,12 @@ type serveCtx struct {

userHandlers map[string]http.Handler
serviceRegister func(*grpc.Server)
stopGRPCc chan func()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpcServers []*gprc.Server

embed/serve.go Outdated
@@ -74,6 +75,12 @@ func (sctx *serveCtx) serve(s *etcdserver.EtcdServer, tlscfg *tls.Config, handle

if sctx.insecure {
gs := v3rpc.Server(s, nil)
sctx.stopGRPCc <- func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sctx.grpcServers = sctx.append(sctx.grpcServers, gs)

embed/serve.go Outdated
@@ -103,6 +110,12 @@ func (sctx *serveCtx) serve(s *etcdserver.EtcdServer, tlscfg *tls.Config, handle

if sctx.secure {
gs := v3rpc.Server(s, tlscfg)
sctx.stopGRPCc <- func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sctx.grpcServers = sctx.append(sctx.grpcServers, gs)

@@ -61,6 +61,12 @@ type ServerConfig struct {
ClientCertAuthEnabled bool

AuthToken string

// OnShutdown gracefully stops gRPC server on shutdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a generic thing instead of talking about grpc, etc

// OnShutdown is called immediately before releasing etcd server resources.

@@ -75,3 +76,38 @@ func TestV3MaintenanceDefragmentInflightRange(t *testing.T) {

<-donec
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were some changes to the mvcc code so TestV3MaintenanceHashInflight would appear to work. Namely, TestStoreHashAfterForceCommit and the stopc logic in Hash should probably be removed.

kvc := toGRPC(cli).KV

if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}); err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatal(err)

embed/etcd.go Outdated
@@ -137,6 +139,11 @@ func StartEtcd(inCfg *Config) (e *Etcd, err error) {
if err = e.serve(); err != nil {
return
}
e.Server.Cfg.OnShutdown = func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnShutdown = func() {
    for _, sctx := range e.sctxs {
        for _, gs := range sctx.grpcServers {
            gs.GracefulStop()
        }
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not sync with serve routine? We populate sctx.grpcServers
in (e *Etcd) serve() (which calls (sctx *serveCtx) serve that creates *grpc.Server.
But (e *Etcd) serve() returns calling (sctx *serveCtx) serve in goroutines.

Copy link
Contributor Author

@gyuho gyuho Apr 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, we somehow need a way to sync that slice anyway.

embed/etcd.go Outdated
@@ -343,6 +350,10 @@ func (e *Etcd) serve() (err error) {
}

func (e *Etcd) errHandler(err error) {
if transport.IsClosedConnError(err) || err == grpc.ErrServerStopped {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? shouldn't stopc be closed before calling etcdserver.Stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// in embed/etcd.go
func StartEtcd(inCfg *Config) (e *Etcd, err error) {
	if err = inCfg.Validate(); err != nil {
		return nil, err
	}
	e = &Etcd{cfg: *inCfg, stopc: make(chan struct{})}
	cfg := &e.cfg
	defer func() {
		if e != nil && err != nil {
			e.Close()
			e = nil
		}
	}()

We close stopc by calling e.Close() here, but StartEtcd returns with nil error, so it is not called in our use case?

// stop accepting new connections, RPCs,
// and blocks until all pending RPCs are finished
if s.Cfg != nil && s.Cfg.OnShutdown != nil {
s.Cfg.OnShutdown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to avoid having this in Cfg entirely-- this function could be called prior to calling HardStop/Stop; similar to how the listeners are closed in embed.Etcd.Close() before calling Server.Stop()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let me re-organize the code.

@gyuho gyuho force-pushed the shutdown-grpc-server branch 3 times, most recently from f62b98b to 6779d6d Compare April 15, 2017 01:22
@gyuho gyuho added the WIP label Apr 15, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Apr 15, 2017

I think we still need GracefulStop in etcdserver because embed.Etcd.Close won't be triggered unless there's an error at the beginning of serving. And etcdserver.EtcdServer.Stop is the handler registered for OS interrupt signals.

// etcdmain/etcd.go

// startEtcd runs StartEtcd in addition to hooks needed for standalone etcd.
func startEtcd(cfg *embed.Config) (<-chan struct{}, <-chan error, error) {
	if cfg.Metrics == "extensive" {
		grpc_prometheus.EnableHandlingTimeHistogram()
	}

	e, err := embed.StartEtcd(cfg)
	if err != nil {
		return nil, nil, err
	}
	osutil.RegisterInterruptHandler(e.Server.Stop)

@heyitsanthony
Copy link
Contributor

why not have osutil.RegisterInterruptHandler(e.Close)?

@codecov-io
Copy link

codecov-io commented Apr 15, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@0d52598). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #7743   +/-   ##
=========================================
  Coverage          ?   75.73%           
=========================================
  Files             ?      331           
  Lines             ?    26058           
  Branches          ?        0           
=========================================
  Hits              ?    19735           
  Misses            ?     4899           
  Partials          ?     1424
Impacted Files Coverage Δ
mvcc/kvstore.go 87.89% <ø> (ø)
embed/etcd.go 67.89% <100%> (ø)
embed/serve.go 74.33% <100%> (ø)
etcdmain/etcd.go 45.49% <100%> (ø)
integration/cluster.go 85.51% <100%> (ø)
mvcc/backend/batch_tx.go 78.57% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d52598...5000d29. Read the comment docs.

embed/etcd.go Outdated
@@ -147,6 +161,7 @@ func (e *Etcd) Config() Config {

func (e *Etcd) Close() {
e.closeOnce.Do(func() { close(e.stopc) })
e.OnShutdown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the function be inlined here instead of needing a separate OnShutdown field?

embed/etcd.go Outdated
// RPCs, and blocks until all pending RPCs are finished
for _, sctx := range e.sctxs {
for gs := range sctx.grpcServerC {
plog.Warning("gracefully stopping gRPC server")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't warn / print anything? this should be part of the normal shutdown process

}
// t.tx.DB()==nil if 'CommitAndStop' calls 'batchTx.commit(true)',
// which initializes *bolt.Tx.db and *bolt.Tx.meta as nil; panics t.tx.Size().
// Server must make sure 'batchTx.commit(false)' does not follow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't mention the etcd server or gRPC. The contract is independent of all that-- don't have any operations inflight when closing the backend.

mvc := toGRPC(cli).Maintenance
mvc.Defragment(context.Background(), &pb.DefragmentRequest{})
// simulate 'embed.Etcd.Close()' with '*grpc.Server.GracefulStop'
clus.Members[0].grpcServer.GracefulStop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clus.Members[0].Stop()

@@ -518,20 +518,6 @@ func newTestKeyBytes(rev revision, tombstone bool) []byte {
return bytes
}

// TestStoreHashAfterForceCommit ensures that later Hash call to
// closed backend with ForceCommit does not panic.
func TestStoreHashAfterForceCommit(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also remove the select in mvcc.store.Hash, which was faking this

@gyuho gyuho added the WIP label Apr 17, 2017
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Fix etcd-io#7322.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
This reverts commit 994e8e4.

Since now etcdserver gracefully shuts down the gRPC server
Revert etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Revert change in etcd-io@33acbb6.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@FingerLiu
Copy link

@gyuho is there a workaround in early versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

boltdb panic while removing member from cluster
4 participants