Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: http2 via Forward Proxy block on getClientConn because of deadlock #49377

Closed
zzkcode opened this issue Nov 5, 2021 · 5 comments
Closed
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@zzkcode
Copy link
Contributor

zzkcode commented Nov 5, 2021

What version of Go are you using (go version)?

~ go version
go version go1.15.3 linux/amd64

Does this issue reproduce with the latest release?

Haven't try it yet, since we are unable to reproduce this issue now. But it almost happens one time on PRD per day.

And maybe related to this all: update golang.org/x/net to pull in CL 353390 · golang/go@7109323 (github.com) which release since go.1.16.9 and go.1.17.2.

What operating system and processor architecture are you using (go env)?

Production Server, no go environment.

[root@ip-X-X-X-X /]# uname -a
Linux ip-X-X-X-X.us-west-2.compute.internal 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

Send requests using custom http client through a Forward Proxy which is a Squid process running on another instance.

// http client
client := &http.Client{Transport:  &http.Transport{
			DialContext:           (&defaultDialer).DialContext,
			DisableKeepAlives:     false,
			DisableCompression:    false,
			ForceAttemptHTTP2:     true,
			MaxIdleConns:          5000,
			MaxIdleConnsPerHost:   1000,
			IdleConnTimeout:       0 * time.Second,
			ExpectContinueTimeout: 1 * time.Second,
			TLSClientConfig:       &tls.Config{InsecureSkipVerify: true},
}

// dialer
defaultDialer = net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second,
		DualStack: true,
}

What did you expect to see?

Send requests and receive responses without any problems.

What did you see instead?

A brief summary: our application hang and OOM soon.

We used go pprof and dump goroutines and all goroutines' stacks, and had found where the application hang and finally OOM.

By goroutines graph, we found all goroutines try to acquire http2clientConnPool's mu, pick one of them as an example:

goroutine 173132507 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc0061edf5c, 0xc00de5b200, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0061edf58)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
net/http.(*http2clientConnPool).getClientConn(0xc0061edf50, 0xc00e8fb000, 0xc00de5b2a0, 0x1f, 0x0, 0xc0100bf308, 0x7767c0, 0xc004b972c0)
        /usr/local/go/src/net/http/h2_bundle.go:763 +0x2f5
net/http.http2noDialClientConnPool.GetClientConn(0xc0061edf50, 0xc00e8fb000, 0xc00de5b2a0, 0x1f, 0xc00de5b2a0, 0x1f, 0x8)
        /usr/local/go/src/net/http/h2_bundle.go:953 +0x4e
net/http.(*http2Transport).RoundTripOpt(0xc004b972c0, 0xc00e8fb000, 0xfc4800, 0xc0061edf80, 0xc0100bf300, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6940 +0x106
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6908
net/http.http2noDialH2RoundTripper.RoundTrip(0xc004b972c0, 0xc00e8fb000, 0x12c0e80, 0xc004b972c0, 0xc009664d80)
        /usr/local/go/src/net/http/h2_bundle.go:9108 +0x3e
net/http.(*Transport).roundTrip(0xc002c56140, 0xc00e8fac00, 0x30, 0x7f821cc584b8, 0x150)
        /usr/local/go/src/net/http/transport.go:527 +0xd8c
net/http.(*Transport).RoundTrip(0xc002c56140, 0xc00e8fac00, 0xc002c56140, 0x0, 0x0)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc00e8fac00, 0x12bf860, 0xc002c56140, 0x0, 0x0, 0x0, 0xc002a6b390, 0x203005, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0xc00033c030, 0xc00e8fac00, 0x0, 0x0, 0x0, 0xc002a6b390, 0x0, 0x1, 0x203005)
        /usr/local/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0xc00033c030, 0xc00e8fac00, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:586
// ...

And by digging into the goroutine stacks, we found:

  1. goroutineA(172922339) had acquired http2clientConnPool's mu and try to acquire http2ClientConn's mu which is hold in goroutineB(172921412).
  2. goroutineB(172921412) had acquired the http2ClientConn's mu, but got stuck within cc.writeHeaders(more specific, cc.bw.Flush()) so it won't release cc.mu forever.

so it's a dead lock now and all other goroutines got stuck in getClientConn, please see code below(all code based on go1.15.3):

// go1.15.3

// 1. goroutineA(172922339)
// /usr/local/go/src/net/http/h2_bundle.go:752
func (p *http2clientConnPool) getClientConn(req *Request, addr string, dialOnMiss bool) (*http2ClientConn, error) {
	if http2isConnectionCloseRequest(req) && dialOnMiss {
		// ... omit code here
	}
	p.mu.Lock()
	for _, cc := range p.conns[addr] {
    // goroutineA(172922339) got p.mu, but got stuck within cc.idleState()
		if st := cc.idleState(); st.canTakeNewRequest {
			if p.shouldTraceGetConn(st) {
				http2traceGetConn(req, addr)
			}
			p.mu.Unlock()
			return cc, nil
		}
	}

// /usr/local/go/src/net/http/h2_bundle.go:7217
func (cc *http2ClientConn) idleState() http2clientConnIdleState {
  // goroutineA(172922339) could not acquire cc.mu since it had been acquired by goroutineB(172921412)
	cc.mu.Lock()
	defer cc.mu.Unlock()
	return cc.idleStateLocked()
}

// 2. goroutineB(172921412)
// /usr/local/go/src/net/http/h2_bundle.go:7474
func (cc *http2ClientConn) roundTrip(req *Request) (res *Response, gotErrAfterReqBodyWrite bool, err error) {
	// ... omit code here

	cc.mu.Lock()
	if err := cc.awaitOpenSlotForRequest(req); err != nil {
		cc.mu.Unlock()
		return nil, false, err
	}

	// ... omit code here

	cc.wmu.Lock()
	endStream := !hasBody && !hasTrailers
	// goroutineB(172921412) got cc.mu, but got stuck within cc.writeHeaders so it won't release cc.mu forever
	werr := cc.writeHeaders(cs.ID, endStream, int(cc.maxFrameSize), hdrs)
	cc.wmu.Unlock()
	http2traceWroteHeaders(cs.trace)
	cc.mu.Unlock()

	// ... omit code here

}

goroutineA(172922339) stacks:

goroutine 172922339 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00810b6d4, 0x0, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00810b6d0)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
net/http.(*http2ClientConn).idleState(0xc00810b680, 0xc00bcd0000)
        /usr/local/go/src/net/http/h2_bundle.go:7218 +0xad
net/http.(*http2clientConnPool).getClientConn(0xc0061edf50, 0xc010a40900, 0xc015603260, 0x19, 0x0, 0xc00ff16408, 0x7767c0, 0xc004b972c0)
        /usr/local/go/src/net/http/h2_bundle.go:765 +0x105
net/http.http2noDialClientConnPool.GetClientConn(0xc0061edf50, 0xc010a40900, 0xc015603260, 0x19, 0xc015603260, 0x19, 0x4)
        /usr/local/go/src/net/http/h2_bundle.go:953 +0x4e
net/http.(*http2Transport).RoundTripOpt(0xc004b972c0, 0xc010a40900, 0xfc4800, 0xc0061edf80, 0xc00ff16400, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6940 +0x106
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6908
net/http.http2noDialH2RoundTripper.RoundTrip(0xc004b972c0, 0xc010a40900, 0x12c0e80, 0xc004b972c0, 0xc013b91600)
        /usr/local/go/src/net/http/h2_bundle.go:9108 +0x3e
net/http.(*Transport).roundTrip(0xc002c56140, 0xc010a40900, 0x30, 0x7f821b56bd78, 0x150)
        /usr/local/go/src/net/http/transport.go:527 +0xd8c
net/http.(*Transport).RoundTrip(0xc002c56140, 0xc010a40900, 0xc002c56140, 0x0, 0x0)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc010a40900, 0x12bf860, 0xc002c56140, 0x0, 0x0, 0x0, 0xc000530e80, 0x203004, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0xc00033c030, 0xc010a40900, 0x0, 0x0, 0x0, 0xc000530e80, 0x0, 0x1, 0x203000)
        /usr/local/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0xc00033c030, 0xc010a40900, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:586

goroutineB(172921412) stacks:

// goroutine 172921412
goroutine 172921412 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f821dc262a8, 0x77, 0x12c1140)
        /usr/local/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc00e4ae918, 0x77, 0x12c1100, 0x1b07d70, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc00e4ae900, 0xc0090eb000, 0x334, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:275 +0x2d0
net.(*netFD).Write(0xc00e4ae900, 0xc0090eb000, 0x334, 0x1000, 0x12e0a80, 0x13, 0x1161994)
        /usr/local/go/src/net/fd_posix.go:73 +0x4f
net.(*conn).Write(0xc00079e150, 0xc0090eb000, 0x334, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:194 +0x8e
crypto/tls.(*Conn).write(0xc00f756a80, 0xc0090eb000, 0x334, 0x1000, 0xc00c022000, 0x31e, 0x1000)
        /usr/local/go/src/crypto/tls/conn.go:914 +0x162
crypto/tls.(*Conn).writeRecordLocked(0xc00f756a80, 0x17, 0xc00c022000, 0x31e, 0x1000, 0xc0012a4840, 0xc00c202b00, 0xc00c202b00)
        /usr/local/go/src/crypto/tls/conn.go:963 +0x2c5
crypto/tls.(*Conn).Write(0xc00f756a80, 0xc00c022000, 0x31e, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:1130 +0x19a
net/http.http2stickyErrWriter.Write(0x7f821b573330, 0xc00f756a80, 0xc00810b7f0, 0xc00c022000, 0x31e, 0x1000, 0xc0037d4e00, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:6871 +0x68
bufio.(*Writer).Flush(0xc004b654c0, 0x137f, 0xc00ecb1500)
        /usr/local/go/src/bufio/bufio.go:607 +0x7b
net/http.(*http2ClientConn).writeHeaders(0xc00810b680, 0x10000137f, 0x4000, 0xc00ecb1815, 0x0, 0x11b2, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7720 +0x1eb
net/http.(*http2ClientConn).roundTrip(0xc00810b680, 0xc005267600, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7537 +0x59d
net/http.(*http2Transport).RoundTripOpt(0xc004b972c0, 0xc005267600, 0xfc4800, 0xc0061edf80, 0xc00eb6f400, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6947 +0x170
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6908
net/http.http2noDialH2RoundTripper.RoundTrip(0xc004b972c0, 0xc005267600, 0x12c0e80, 0xc004b972c0, 0xc011863980)
        /usr/local/go/src/net/http/h2_bundle.go:9108 +0x3e
net/http.(*Transport).roundTrip(0xc002c56140, 0xc005267600, 0x30, 0x7f821b57cc40, 0x150)
        /usr/local/go/src/net/http/transport.go:527 +0xd8c
net/http.(*Transport).RoundTrip(0xc002c56140, 0xc005267600, 0xc002c56140, 0x0, 0x0)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc005267600, 0x12bf860, 0xc002c56140, 0x0, 0x0, 0x0, 0xc002b5e9b8, 0x203001, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0xc00033c030, 0xc005267600, 0x0, 0x0, 0x0, 0xc002b5e9b8, 0x0, 0x1, 0x203000)
        /usr/local/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0xc00033c030, 0xc005267600, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:586

blocked within cc.bw.Flush():

// /usr/local/go/src/net/http/h2_bundle.go:7695
// requires cc.wmu be held
func (cc *http2ClientConn) writeHeaders(streamID uint32, endStream bool, maxFrameSize int, hdrs []byte) error {
	first := true // first frame written (HEADERS is first, then CONTINUATION)
	for len(hdrs) > 0 && cc.werr == nil {
		chunk := hdrs
		if len(chunk) > maxFrameSize {
			chunk = chunk[:maxFrameSize]
		}
		hdrs = hdrs[len(chunk):]
		endHeaders := len(hdrs) == 0
		if first {
			cc.fr.WriteHeaders(http2HeadersFrameParam{
				StreamID:      streamID,
				BlockFragment: chunk,
				EndStream:     endStream,
				EndHeaders:    endHeaders,
			})
			first = false
		} else {
			cc.fr.WriteContinuation(streamID, endHeaders, chunk)
		}
	}
	// TODO(bradfitz): this Flush could potentially block (as
	// could the WriteHeaders call(s) above), which means they
	// wouldn't respond to Request.Cancel being readable. That's
	// rare, but this should probably be in a goroutine.
	cc.bw.Flush()
	return cc.werr
}

So it maybe similar questions about:

And maybe fixed by this commit: all: update golang.org/x/net to pull in CL 353390 · golang/go@7109323 (github.com).

But as I mentioned before, since we could not reproduce it right now, so we are not sure.

And we found all other streams from this conn are trying to writeStreamReset after ctx.Done(), so maybe this conn had broken? Pick one of them:

goroutine 172918797 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00810b7ec, 0xfeaac7400000200, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00810b7e8)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
net/http.(*http2ClientConn).writeStreamReset(0xc00810b680, 0x800001377, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:8929 +0xbe
net/http.(*http2ClientConn).roundTrip(0xc00810b680, 0xc00ba4ea00, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7609 +0xde5
net/http.(*http2Transport).RoundTripOpt(0xc004b972c0, 0xc00ba4ea00, 0xfc4800, 0xc0061edf80, 0xc00875b180, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6947 +0x170
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6908
net/http.http2noDialH2RoundTripper.RoundTrip(0xc004b972c0, 0xc00ba4ea00, 0x12c0e80, 0xc004b972c0, 0xc0047f6800)
        /usr/local/go/src/net/http/h2_bundle.go:9108 +0x3e
net/http.(*Transport).roundTrip(0xc002c56140, 0xc00ba4ea00, 0x30, 0x7f821cfdf408, 0x150)
        /usr/local/go/src/net/http/transport.go:527 +0xd8c
net/http.(*Transport).RoundTrip(0xc002c56140, 0xc00ba4ea00, 0xc002c56140, 0x0, 0x0)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc00ba4ea00, 0x12bf860, 0xc002c56140, 0x0, 0x0, 0x0, 0xc000cae7a0, 0x203002, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0xc00033c030, 0xc00ba4ea00, 0x0, 0x0, 0x0, 0xc000cae7a0, 0x0, 0x1, 0x203001)
        /usr/local/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0xc00033c030, 0xc00ba4ea00, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:586

So is there any suggestions or comments that why cc.bw.Flush() got stuck? And if you need more information, please let me know. Thanks in advanced.

@seankhliao
Copy link
Member

1.15 is outside of our support cycle, and given the fixes you've already found which may fix the issue, I don't think there's much to do unless you can provide a reproducer and verify if it works with the fixes above.

@seankhliao seankhliao changed the title http2 via Forward Proxy block on getClientConn because of deadlock net/http: http2 via Forward Proxy block on getClientConn because of deadlock Nov 5, 2021
@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 5, 2021
@zzkcode
Copy link
Contributor Author

zzkcode commented Nov 5, 2021

@seankhliao Thanks for your quick comment.

Yes, we are still capturing the input traffics which would trigger the output traffics via our http client. Once we can reproduce this issue, we will verify it.

It would take some time for us to upgrade from 1.15 to 1.16 or 1.17 on Production, so before this, we still need to investigate why it got stuck on cc.bw.Flush() and why this only occur via Forward-Proxy. And to see if it has some workarounds under 1.15.

We will update this when we have a reproducer. Thanks.

@neild
Copy link
Contributor

neild commented Nov 5, 2021

Note that you can use a more recent HTTP/2 client without upgrading to a newer Go release with something like this:

import "golang.org/x/net/http2"

if err := http2.ConfigureTransport(http.DefaultTransport.(*http.Transport)); err != nil {
        panic(err)
}

This will use the HTTP/2 client implementation in golang.org/x/net/http2 instead of the one vendored into net/http.

@zzkcode
Copy link
Contributor Author

zzkcode commented Nov 6, 2021

@neild Thanks for your great suggestion.

As I can tell from X Repositories and http2 package which says as below, it may not suitable for our Production Services, but I think we will definitely use it for verifying if it could resolve our issue once we have a reproducer. Great thanks again.

They are developed under looser compatibility requirements than the Go core. In general, they will support the previous two releases and tip.

This package is low-level and intended to be used directly by very few people.

@zzkcode
Copy link
Contributor Author

zzkcode commented Mar 6, 2022

We had avoided this issue by tuning some parameters on the Application and System level to abort 'abnormal' TCP connections asap, and so far so good.

So I'm going to close this issue and please help to remove the label 'WaitingForInfo' if needed. Last but not least, many thanks for all the suggestions.

@zzkcode zzkcode closed this as completed Mar 6, 2022
@golang golang locked and limited conversation to collaborators Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants