Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.4] embed: fix nil pointer dereference when stopServer #16195

Merged
merged 2 commits into from
Jul 11, 2023

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Jul 7, 2023

Since v3.4.25, ETCD server introduces http-only urls flag to expose gRPC-only endpoints. When user enables this feature, the stopServer will panic during terminating. If the server is leader, it won't have chance to transfer the leadership.

Jul 07 14:43:04  etcd[11502]: received terminated signal, shutting down...
Jul 07 14:43:04  etcd[11502]: WARNING: 2023/07/07 14:43:04 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting...Jul 07 14:43:04  etcd[11502]: WARNING: 2023/07/07 14:43:04 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting...
Jul 07 14:43:04  etcd[11502]: panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                                                                           Jul 07 14:43:04  etcd[11502]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x130 pc=0x9ccd45]
Jul 07 14:43:04  etcd[11502]: goroutine 225 [running]:
Jul 07 14:43:04  etcd[11502]: google.golang.org/grpc.(*Server).Stop(0x0)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1390 +0x45
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.stopServers.func1()
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:431 +0x3c
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.stopServers({0x115a558, 0xc000278b70}, 0xc00024f248)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:438 +0x7d
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.(*Etcd).Close(0xc0004d6600)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:392 +0x835
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/pkg/osutil.HandleInterrupts.func1()
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/pkg/osutil/interrupt_unix.go:70 +0x284
Jul 07 14:43:04  etcd[11502]: created by go.etcd.io/etcd/pkg/osutil.HandleInterrupts
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/pkg/osutil/interrupt_unix.go:53 +0xce
Jul 07 14:43:04  systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@fuweid fuweid requested a review from ahrtr July 7, 2023 06:51
embed/etcd.go Outdated Show resolved Hide resolved
@ahrtr ahrtr mentioned this pull request Jul 7, 2023
5 tasks
@ahrtr
Copy link
Member

ahrtr commented Jul 7, 2023

Could you please also take a look at the failed test case TestWatchDelayForEvent/SeparateHttpNoTLS?

@fuweid
Copy link
Member Author

fuweid commented Jul 7, 2023

Could you please also take a look at the failed test case TestWatchDelayForEvent/SeparateHttpNoTLS?

2023-07-07T06:58:47.3638136Z === RUN   TestWatchDelayForEvent/SeparateHttpNoTLS
2023-07-07T06:58:47.8329448Z     watch_delay_test.go:178: Got watch response, since last: 222.604µs
2023-07-07T06:58:47.9341932Z     watch_delay_test.go:178: Got watch response, since last: 101.561766ms
2023-07-07T06:58:48.0539685Z     watch_delay_test.go:178: Got watch response, since last: 119.57196ms
2023-07-07T06:58:48.2278973Z     watch_delay_test.go:178: Got watch response, since last: 173.907852ms
2023-07-07T06:58:48.3629701Z     watch_delay_test.go:178: Got watch response, since last: 134.82321ms
2023-07-07T06:58:48.4749811Z     watch_delay_test.go:178: Got watch response, since last: 112.078238ms
2023-07-07T06:58:48.6125484Z     watch_delay_test.go:178: Got watch response, since last: 137.385153ms
2023-07-07T06:58:48.7315046Z     watch_delay_test.go:178: Got watch response, since last: 118.827948ms
2023-07-07T06:58:48.8435664Z     watch_delay_test.go:227: Generating read load around 350.1 MB/s
2023-07-07T06:58:48.8635390Z     watch_delay_test.go:178: Got watch response, since last: 132.028265ms
2023-07-07T06:58:48.9833257Z     watch_delay_test.go:178: Got watch response, since last: 119.635162ms
2023-07-07T06:58:49.0902674Z     watch_delay_test.go:178: Got watch response, since last: 106.73155ms
2023-07-07T06:58:49.2181111Z     watch_delay_test.go:178: Got watch response, since last: 124.43814ms
2023-07-07T06:58:49.3379306Z     watch_delay_test.go:178: Got watch response, since last: 123.091818ms
2023-07-07T06:58:49.4613127Z     watch_delay_test.go:178: Got watch response, since last: 106.734851ms
2023-07-07T06:58:49.5869947Z     watch_delay_test.go:178: Got watch response, since last: 125.734762ms
2023-07-07T06:58:49.7258288Z     watch_delay_test.go:178: Got watch response, since last: 138.780676ms
2023-07-07T06:58:49.8371572Z     watch_delay_test.go:227: Generating read load around 395.1 MB/s
2023-07-07T06:58:49.8605727Z     watch_delay_test.go:178: Got watch response, since last: 134.278101ms
2023-07-07T06:58:49.9725185Z     watch_delay_test.go:178: Got watch response, since last: 111.832433ms
2023-07-07T06:58:50.0908311Z     watch_delay_test.go:178: Got watch response, since last: 118.201339ms
2023-07-07T06:58:50.2008287Z     watch_delay_test.go:178: Got watch response, since last: 108.402978ms
2023-07-07T06:58:50.3131929Z     watch_delay_test.go:178: Got watch response, since last: 112.307141ms
2023-07-07T06:58:50.4433255Z     watch_delay_test.go:178: Got watch response, since last: 130.010132ms
2023-07-07T06:58:50.5759509Z     watch_delay_test.go:178: Got watch response, since last: 132.657675ms
2023-07-07T06:58:50.7077837Z     watch_delay_test.go:178: Got watch response, since last: 131.75646ms
2023-07-07T06:58:50.8254817Z     watch_delay_test.go:178: Got watch response, since last: 117.518227ms
2023-07-07T06:58:50.8324378Z     watch_delay_test.go:227: Generating read load around 445.1 MB/s
2023-07-07T06:58:50.9390185Z     watch_delay_test.go:178: Got watch response, since last: 113.287358ms
2023-07-07T06:58:51.0960666Z     watch_delay_test.go:178: Got watch response, since last: 157.068776ms
2023-07-07T06:58:51.2257770Z     watch_delay_test.go:178: Got watch response, since last: 129.079717ms
2023-07-07T06:58:51.3674246Z     watch_delay_test.go:178: Got watch response, since last: 140.955711ms
2023-07-07T06:58:51.4950230Z     watch_delay_test.go:178: Got watch response, since last: 127.618893ms
2023-07-07T06:58:51.7128057Z     watch_delay_test.go:176: Unexpected watch response delayed over allowed threshold 100ms, delay: 117.646569ms
2023-07-07T06:58:51.8343210Z     watch_delay_test.go:227: Generating read load around 405.1 MB/s
2023-07-07T06:58:51.8630326Z     watch_delay_test.go:178: Got watch response, since last: 149.523552ms
2023-07-07T06:58:52.0262137Z     watch_delay_test.go:178: Got watch response, since last: 163.150175ms
2023-07-07T06:58:52.1565605Z     watch_delay_test.go:178: Got watch response, since last: 130.298037ms
2023-07-07T06:58:52.2827062Z     watch_delay_test.go:178: Got watch response, since last: 125.477758ms
2023-07-07T06:58:52.4227591Z     watch_delay_test.go:178: Got watch response, since last: 140.032696ms
2023-07-07T06:58:52.5751175Z     watch_delay_test.go:178: Got watch response, since last: 151.970792ms
2023-07-07T06:58:52.7428694Z     watch_delay_test.go:178: Got watch response, since last: 167.401045ms
2023-07-07T06:58:52.8331980Z     watch_delay_test.go:227: Generating read load around 425.1 MB/s

It seems related to the test environment. I don't have idea about why it slows down yet, since we don't enable EXPECT_DEBUG for the test.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @fuweid

Good catch!

@fuweid
Copy link
Member Author

fuweid commented Jul 7, 2023

@ahrtr let me check it again. I think I need to add one-more if-condition for gracefulstop.

Since v3.4.25, ETCD server introduces http-only urls flag to expose
gRPC-only endpoints. When user enables this feature, the stopServer will
panic during terminating. If the server is leader, it won't have chance
to transfer the leadership.

```
Jul 07 14:43:04  etcd[11502]: received terminated signal, shutting down...
Jul 07 14:43:04  etcd[11502]: WARNING: 2023/07/07 14:43:04 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting...Jul 07 14:43:04  etcd[11502]: WARNING: 2023/07/07 14:43:04 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting...
Jul 07 14:43:04  etcd[11502]: panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                                                                           Jul 07 14:43:04  etcd[11502]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x130 pc=0x9ccd45]
Jul 07 14:43:04  etcd[11502]: goroutine 225 [running]:
Jul 07 14:43:04  etcd[11502]: google.golang.org/grpc.(*Server).Stop(0x0)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1390 +0x45
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.stopServers.func1()
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:431 +0x3c
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.stopServers({0x115a558, 0xc000278b70}, 0xc00024f248)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:438 +0x7d
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/embed.(*Etcd).Close(0xc0004d6600)
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/embed/etcd.go:392 +0x835
Jul 07 14:43:04  etcd[11502]: go.etcd.io/etcd/pkg/osutil.HandleInterrupts.func1()
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/pkg/osutil/interrupt_unix.go:70 +0x284
Jul 07 14:43:04  etcd[11502]: created by go.etcd.io/etcd/pkg/osutil.HandleInterrupts
Jul 07 14:43:04  etcd[11502]:         /home/fuwei/go/src/go.etcd.io/etcd/pkg/osutil/interrupt_unix.go:53 +0xce
Jul 07 14:43:04  systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
```

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
embed/etcd.go Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented Jul 7, 2023

@ahrtr
Copy link
Member

ahrtr commented Jul 10, 2023

3.4.27 is blocked on this PR. PTAL

@chaochn47
Copy link
Member

chaochn47 commented Jul 10, 2023

Nice catch!

I think I need to add one-more if-condition for gracefulstop.

Is it still needed? @fuweid

I think we can follow up use SIGTERM by default in a separate CR to capture similar issues.

@fuweid
Copy link
Member Author

fuweid commented Jul 11, 2023

Is it still needed?

No. I already aligned the code with main branch.

I think we can follow up use SIGTERM by default in a separate CR to capture similar issues.

Sure. We already use SIGTERM in main branch. I can file PR for v3.4 as follow-up.

@ahrtr
Copy link
Member

ahrtr commented Jul 11, 2023

@fuweid can you please also add a changelog for 3.4?

@fuweid
Copy link
Member Author

fuweid commented Jul 11, 2023

@ahrtr Added it, please review it. Thanks

@serathius serathius merged commit f836291 into etcd-io:release-3.4 Jul 11, 2023
12 checks passed
@fuweid fuweid deleted the fix-panic-for-http-only branch July 11, 2023 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants