x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

gwik · 2018-01-25T23:50:05Z

What version of Go are you using (`go version`)?

go version go1.9.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/<redacted>/Dev/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cd/_7rcv5812531s2lswhn6kp680000gp/T/go-build436968147=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

Also happened on linux/amd64, see below.

What did you do?

A bug in production showed an http2 client hanging more than 250 seconds without respect to the request context which was set to timeout after 5 sec.
All possible timeouts are set on connection and TLS handshake and I didn't see many dials (they are monitored).
Latency graph in ms:

Clients are using net/http2.Transport directly.

What did you expect to see?

The requests should have timed out after 5s.

What did you see instead?

No or very long timeouts (I believe the server has reset the connection or TCP timed out).

The synchronous write cc.writeHeaders to the connection in ClientConn.roundTrip does not set any deadline on the connection which can block forever (or being timed out by TCP) if the server or network hangs:
https://github.com/golang/net/blob/0ed95abb35c445290478a5348a7b38bb154135fd/http2/transport.go#L833

I wrote a test that demonstrates this:
gwik/net@e4c191a

// TestTransportTimeoutServerHangs demonstrates that client can hang forever
// when the server hangs and the headers exceed the conn buffer size (forcing a
// flush). Without respect to the context's deadline.
func TestTransportTimeoutServerHangs(t *testing.T) {
	clientDone := make(chan struct{})
	ct := newClientTester(t)
	ct.client = func() error {
		defer ct.cc.(*net.TCPConn).CloseWrite()
		defer close(clientDone)

		buf := make([]byte, 1<<19)
		_, err := rand.Read(buf)
		if err != nil {
			t.Fatal("fail to gen random data")
		}
		headerVal := hex.EncodeToString(buf)

		req, err := http.NewRequest("PUT", "https://dummy.tld/", nil)
		if err != nil {
			return err
		}

		ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
		defer cancel()
		req = req.WithContext(ctx)
		req.Header.Add("Authorization", headerVal)
		_, err = ct.tr.RoundTrip(req)
		if err == nil {
			return errors.New("error should not be nil")
		}
		if ne, ok := err.(net.Error); !ok || !ne.Timeout() {
			return fmt.Errorf("error should be a net error timeout was: +%v", err)
		}
		return nil
	}
	ct.server = func() error {
		ct.greet()
		select {
		case <-time.After(5 * time.Second):
		case <-clientDone:
		}
		return nil
	}
	ct.run()
}

Setting the write deadline fixes the test:
gwik/net@052de95

It seems to fail when the header value exceeds 1MB. I might miss something here, the default buffer size of bufio.Writer is 4096 bytes, I had expected to see it fail around that value, maybe compression and/or TCP buffers...
Also I don't think it sent 1MB headers when it failed in production something else must have fill the buffer.
The buffer is on the connection and shared among streams the buffer can be filled by other requests on the same connection.

Besides this particular call which is synchronous, no write nor read to the connection has a deadline set. Can't this lead to goroutine leaks and http2 streams being stuck in the background ?

The text was updated successfully, but these errors were encountered:

odeke-em · 2018-01-26T01:42:26Z

/cc @bradfitz @tombergan

gwik · 2018-01-26T17:30:00Z

Working on a patch to fix this, I realized that using the context deadline to timeout the i/o calls may not be appropriate.
Timing out the I/O might leave the framing protocol in an inconsistent state and the only safe option would be to close the whole connection. Probably not was the caller intends when the context deadline times out. It would work if the deadline was set on context only for this purpose just before doing the request but that's usually not the case.
I think there should be an I/O timeout option to the transport that would serve the purpose of timing out the I/O calls and every I/O calls would set the deadline to now + timeout and reset it after the call to Read or Write.
I'm going to implement this for now. Let me know what you think about it.

Addresses hanging transport when on blocking I/O. There are many scenario where the roundtrip hangs on write or read and won't be unlocked by current cancelation systems (context, Request.Cancel, ...). This adds read and write deadlines support. The writer disables the read deadline and enables the write deadline, then after the write is successful, it disables the write deadline and re-enables the read deadline. The read loop also sets its read deadline after a successful read since the next frame is not predictable. It guarantees that an I/O will not timeout before IOTimeout and will timeout after a complete block before at least IOTimeout. See issue: golang/go#23559

Addresses hanging transport when on blocking I/O. There are many scenario where the roundtrip hangs on write or read and won't be unlocked by current cancelation systems (context, Request.Cancel, ...). This adds read and write deadlines support. The writer disables the read deadline and enables the write deadline, then after the write is successful, it disables the write deadline and re-enables the read deadline. The read loop also sets its read deadline after a successful read since the next frame is not predictable. It guarantees that an I/O will not timeout before IOTimeout and will timeout after a complete block before at least IOTimeout. See issue: golang/go#23559 Change-Id: If618a63857cc32d8c3175c0d9bef1f8bf83c89df

The transport can hang forever if the server hangs after accepting the connection and the request headers (or control frames) exceeds the connection write buffer. Without respect to the context's deadline. The ClientConn writes to the connection without setting a write deadline thus blocks forever if the buffer is flushed to the socket. go test -v -run TestTransportTimeoutServerHangs ./http2

bradfitz · 2018-07-12T20:11:02Z

Any update here?

Keep in mind that there are 3 goroutines per Transport connection to a server:

the writer (which might block forever)
the reader (blocked by default, waiting for frames from server)
the control one, which never blocks, and owns all state

I think the behavior we want is:

if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?
don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD
if the write appears totally hung, nuke the whole connection and all streams still attached to it

gwik · 2018-07-16T10:35:49Z

if the user said 5 seconds timeout on a request, the control channel should notice and return an error to the user's RoundTrip. I'd be surprised if this part isn't already working?

There are cases where this isn't working. I believe the graph and tests demonstrates it.

don't accept new streams (new RoundTrips) on that ClientConn if the conn has been stuck in a write for "too long", for some definition TBD

Unfortunately, the writes almost never time out I don't know if this is specific to my use particular use case or if buffering makes it very long to happen.

I run this CL in production for a very specific service that only makes requests to APNS and Firebase push gateways.

The patch helps to recover from stuck connections and kills it but it's slow to converge, is incomplete and may not work properly for the general use case.

What happens is that there are situations where the APNS gateway is blocking the connection, it behaves like a firewall that drops packet never reset the connection. Most of the time the requests are timing out correctly, the context allow to unlock the client stream goroutine and return to the caller however the connection read loop never times out nor receive any other error causing all requests to time out. (Also not that the APNS gateway doesn't allow pings).

What I see most often is the writes are not timing out. I believe that the only way to guarantee to time out the connection is have a way to unblock the read loop. The reads (ClientConn) are asynchronous to the writes (ClientStream) so we have to find a way to add a deadline on the reads that will follow the writes (requests) or that follow a read (e.g reading the rest of the body) from the client stream side.

The CL takes a simple approach and avoids synchronization between the client stream and the connection read loop. It resets the read deadline and sets a write deadline before every write from the client stream. After the write is done, it sets the read deadline.
This kinda work, I see the that connections are no longer stuck forever but it is slow to time out the connection. Resetting the read deadline before every write continuously pushes back the deadline.
Also there are no read timeout when waiting for subsequent body reads for example.

I wanted to experiment the following algorithm:

Upon waiting for something from the network on the ClientStream / caller side, push a deadline in a heap data structure (probably under ClientConn.wmu lock) keeping track of the stream that pushes it.
If there was no deadline, unlock the read loop by setting one in the past an set the deadline before next read.
Before every read look at the root of the deadline heap and set it.
After every read remove the deadline of the stream that has successfully completed the read. We need some sort of reverse index to find the entry in the heap and remove it (maybe every client stream can track it's unique read deadline entry in the heap).

What do you think ?

stevenh · 2019-05-28T23:00:14Z

This is actually worse than just a single connection hanging, it can cause the entire app to hang.

The blocking behaviour in both the cc.bw.Flush() and cc.fr.WriteHeaders(...) called within cc.writeHeaders(...) is done while holding cc.mu which is required for cc.idleState() called by p.getClientConn() while holding the pool look.

This means a single connection can block all new requests potentially forever.

gwik · 2019-05-29T15:52:01Z

@stevenh Could you please share a bit more on the context ? What your app does, which kind of services it talks to and when this kind of behavior happens ?

stevenh · 2019-05-29T18:36:05Z

I'd have to defer to a colleague to answer that but the high-level basics are its making http2 calls and the entire app hung due to this.

The relevant stacks from the hung app are:

-- goroutine blocked in writeHeaders -> Flush
1 @ 0x43bddc 0x436eea 0x4364e7 0x4a201e 0x4a210d 0x4a3a04 0x577812 0x589dbd 0x5d1887 0x5d1da0 0x5d2cbb 0x662cea 0x6ad1d4 0x52bdce 0x6678dc 0x66680b 0x663215 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637
#    0x4364e6    internal/poll.runtime_pollWait+0x56                                                    /usr/local/go/src/runtime/netpoll.go:173
#    0x4a201d    internal/poll.(*pollDesc).wait+0xad                                                    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#    0x4a210c    internal/poll.(*pollDesc).waitWrite+0x3c                                                /usr/local/go/src/internal/poll/fd_poll_runtime.go:94
#    0x4a3a03    internal/poll.(*FD).Write+0x243                                                        /usr/local/go/src/internal/poll/fd_unix.go:227
#    0x577811    net.(*netFD).Write+0x51                                                            /usr/local/go/src/net/fd_unix.go:220
#    0x589dbc    net.(*conn).Write+0x6c                                                            /usr/local/go/src/net/net.go:188
#    0x5d1886    crypto/tls.(*Conn).write+0x1a6                                                        /usr/local/go/src/crypto/tls/conn.go:832
#    0x5d1d9f    crypto/tls.(*Conn).writeRecordLocked+0x35f                                                /usr/local/go/src/crypto/tls/conn.go:909
#    0x5d2cba    crypto/tls.(*Conn).Write+0x1ba                                                        /usr/local/go/src/crypto/tls/conn.go:1068
#    0x662ce9    net/http.http2stickyErrWriter.Write+0x79                                                /usr/local/go/src/net/http/h2_bundle.go:6803
#    0x6ad1d3    net/http.(*http2stickyErrWriter).Write+0x73                                                <autogenerated>:1
#    0x52bdcd    bufio.(*Writer).Flush+0x7d                                                        /usr/local/go/src/bufio/bufio.go:567
#    0x6678db    net/http.(*http2ClientConn).writeHeaders+0x1eb                                                /usr/local/go/src/net/http/h2_bundle.go:7436
#    0x66680a    net/http.(*http2ClientConn).RoundTrip+0x46a                                                /usr/local/go/src/net/http/h2_bundle.go:7289
#    0x663214    net/http.(*http2Transport).RoundTripOpt+0x144                                                /usr/local/go/src/net/http/h2_bundle.go:6858
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                    /usr/local/go/src/net/http/transport.go:371
#    0x63b818    net/http.send+0x1a8                                                            /usr/local/go/src/net/http/client.go:249
#    0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
...

-- goroutine waiting for connection lock, held above, while holding the pool lock
1 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x66546f 0x6457d8 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#    0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                    /usr/local/go/src/runtime/sema.go:71
#    0x48115d    sync.(*Mutex).Lock+0xed                                                            /usr/local/go/src/sync/mutex.go:134
#    0x66546e    net/http.(*http2ClientConn).CanTakeNewRequest+0x3e                                            /usr/local/go/src/net/http/h2_bundle.go:7083
#    0x6457d7    net/http.(*http2clientConnPool).getClientConn+0xe7                                            /usr/local/go/src/net/http/h2_bundle.go:739
#    0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                            /usr/local/go/src/net/http/h2_bundle.go:927
#    0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#    0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                                /usr/local/go/src/net/http/h2_bundle.go:6820
#    0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                            /usr/local/go/src/net/http/h2_bundle.go:990
#    0x690a16    net/http.(*Transport).RoundTrip+0xd46
...

-- goroutines blocked waiting for the pool lock held above
13033 @ 0x43bddc 0x43bece 0x44d2c4 0x44cfdd 0x48115e 0x645754 0x646dce 0x6631c7 0x662dda 0x6476b9 0x690a17 0x63b819 0x63b4ad 0x63cb4d 0xb1d000 0xe641df 0xe6213e 0xe8245c 0xe7f42e 0xe7ad28 0xe61e87 0xf598ed 0xf59637 0x46b1f1
#   0x44cfdc    sync.runtime_SemacquireMutex+0x3c                                                   /usr/local/go/src/runtime/sema.go:71
#   0x48115d    sync.(*Mutex).Lock+0xed                                                         /usr/local/go/src/sync/mutex.go:134
#   0x645753    net/http.(*http2clientConnPool).getClientConn+0x63                                          /usr/local/go/src/net/http/h2_bundle.go:737
#   0x646dcd    net/http.http2noDialClientConnPool.GetClientConn+0x4d                                           /usr/local/go/src/net/http/h2_bundle.go:927
#   0x6631c6    net/http.(*http2Transport).RoundTripOpt+0xf6                                                /usr/local/go/src/net/http/h2_bundle.go:6852
#   0x662dd9    net/http.(*http2Transport).RoundTrip+0x39                                               /usr/local/go/src/net/http/h2_bundle.go:6820
#   0x6476b8    net/http.http2noDialH2RoundTripper.RoundTrip+0x38                                           /usr/local/go/src/net/http/h2_bundle.go:990
#   0x690a16    net/http.(*Transport).RoundTrip+0xd46                                                   /usr/local/go/src/net/http/transport.go:371
#   0x63b818    net/http.send+0x1a8                                                         /usr/local/go/src/net/http/client.go:249
#   0x63b4ac    net/http.(*Client).send+0xfc                                                        /usr/local/go/src/net/http/client.go:173
#   0x63cb4c    net/http.(*Client).Do+0x28c                                                     /usr/local/go/src/net/http/client.go:602

evanj · 2019-07-28T15:06:34Z

I believe we just hit this same issue or extremely similar issue with go1.12.7. I have thread stacks from the "blocked" threads from kill -ABRT below. To be clear: The caller of http.Client.Do is stuck in http2ClientConn.roundTrip on cc.wmu.Lock() so it can call cc.writeHeaders. The holder of that lock is from scheduleBodyWrite/http2clientStream.writeRequestBody, which is doing an actual network write that is hung.

As a result: Client.Do ignores any context Deadlines in this code path.

Possible fixes:

The RoundTrip thread can't try to lock cc.wmu since it can get stuck on network I/O, so writing headers needs to be done with some Go routines so it can use something like awaitRequestCancel.
Don't attempt to reuse HTTP/2 connections while another request is using it to send

Details

Stripe Veneur periodically sends a set of HTTP2 POSTs and waits for them to complete. When it is getting CPU overloaded, we have observed these calls to Client.Do getting "stuck" for ~10 minutes, even though the HTTP requests set a 10 second deadline. I think the following is happening:

One thread calls http.Client.Do, but then times out waiting for the response. However, it has already called scheduleBodyWrite.
scheduleBodyWrite's goroutine holds cc.wmu while trying to send the POST body. Because the system is overloaded, this is now stuck waiting for some sort of TCP timer.
Another thread starts http.Client.Do. It gets the same connection from the pool, since it is an HTTP2 connection that supports multiple streams.
This eventually calls:

    cc.wmu.Lock()
    endStream := !hasBody && !hasTrailers
    werr := cc.writeHeaders(cs.ID, endStream, int(cc.maxFrameSize), hdrs)
    cc.wmu.Unlock()

The client gets stuck waiting on cc.wmu.Lock(), and will wait until the io operation completes, ignoring the context

Stacks

The first thread is the writer, holding cc.wmu. The second is the OTHER writer, blocked waiting on cc.wmu. The third is the read loop, for completeness (this lets us see that the http2Framer belongs to the connection, so these are in fact the threads involved in this issue.

internal/poll.runtime_pollWait(0x7f2818062bb0, 0x77, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc003c06298, 0x77, 0x100, 0x19f, 0xffffffffffffffff)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitWrite(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc003c06280, 0xc004ed7000, 0x401d, 0x401d, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:276 +0x26e
net.(*netFD).Write(0xc003c06280, 0xc004ed7000, 0x401d, 0x401d, 0x401d, 0x4018, 0x0)
    /usr/local/go/src/net/fd_unix.go:220 +0x4f
net.(*conn).Write(0xc000013460, 0xc004ed7000, 0x401d, 0x401d, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:189 +0x69
crypto/tls.(*Conn).write(0xc005b38e00, 0xc004ed7000, 0x401d, 0x401d, 0xc009306000, 0x4000, 0x40000)
    /usr/local/go/src/crypto/tls/conn.go:896 +0x160
crypto/tls.(*Conn).writeRecordLocked(0xc005b38e00, 0x15f4217, 0xc009306000, 0x10009, 0x40000, 0x3, 0x3, 0xc000c24daa)
    /usr/local/go/src/crypto/tls/conn.go:945 +0x2a1
crypto/tls.(*Conn).Write(0xc005b38e00, 0xc0092fe000, 0x10009, 0x48000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:1114 +0x184
net/http.http2stickyErrWriter.Write(0x7f281c2393f0, 0xc005b38e00, 0xc00427bad8, 0xc0092fe000, 0x10009, 0x48000, 0x1, 0xc01f3b20c0, 0xc000c24d40)
    /usr/local/go/src/net/http/h2_bundle.go:6792 +0x68
bufio.(*Writer).Write(0xc004686480, 0xc0092fe000, 0x10009, 0x48000, 0xc000052a70, 0xc000052a00, 0x0)
    /usr/local/go/src/bufio/bufio.go:622 +0x145
net/http.(*http2Framer).endWrite(0xc00330fb20, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:1605 +0xab
net/http.(*http2Framer).WriteDataPadded(0xc00330fb20, 0x3bccb, 0xc0084ce000, 0x10000, 0x80000, 0x0, 0x0, 0x0, 0x1525493, 0xc000c24f88)
    /usr/local/go/src/net/http/h2_bundle.go:1927 +0x25d
net/http.(*http2Framer).WriteData(...)
    /usr/local/go/src/net/http/h2_bundle.go:1885
net/http.(*http2clientStream).writeRequestBody(0xc003362000, 0x7f281c239488, 0xc000084d70, 0x7f281c2394a8, 0xc000084d70, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:7690 +0x337
net/http.(*http2Transport).getBodyWriterState.func1()
    /usr/local/go/src/net/http/h2_bundle.go:8878 +0xc2
created by net/http.http2bodyWriterState.scheduleBodyWrite
    /usr/local/go/src/net/http/h2_bundle.go:8925 +0xa3

goroutine 3263511 [semacquire, 10 minutes]:
sync.runtime_SemacquireMutex(0xc00427bad4, 0x40df00)
    /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc00427bad0)
    /usr/local/go/src/sync/mutex.go:134 +0x109
net/http.(*http2ClientConn).roundTrip(0xc00427b980, 0xc005448300, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/h2_bundle.go:7444 +0x467
net/http.(*http2Transport).RoundTripOpt(0xc002094180, 0xc005448300, 0xc01ea24600, 0x6dedb6, 0x0, 0xc000914240)
    /usr/local/go/src/net/http/h2_bundle.go:6867 +0x159
net/http.(*http2Transport).RoundTrip(...)
    /usr/local/go/src/net/http/h2_bundle.go:6829
net/http.http2noDialH2RoundTripper.RoundTrip(0xc002094180, 0xc005448300, 0xc000838000, 0x5, 0xc0009142c8)
    /usr/local/go/src/net/http/h2_bundle.go:8959 +0x3f
net/http.(*Transport).roundTrip(0xc00038c480, 0xc005448300, 0xc02eee6240, 0xc029bcd9bc, 0xc029bcd9c0)
    /usr/local/go/src/net/http/transport.go:430 +0xe90
net/http.(*Transport).RoundTrip(0xc00038c480, 0xc005448300, 0xc00038c480, 0xbf4650ff70c64558, 0x459b8945e32c2)
    /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc005448200, 0x1791b80, 0xc00038c480, 0xbf4650ff70c64558, 0x459b8945e32c2, 0x23e7c00, 0xc029e7a3a8, 0xbf4650ff70c64558, 0x1, 0x0)
    /usr/local/go/src/net/http/client.go:250 +0x461
net/http.(*Client).send(0xc0003aa660, 0xc005448200, 0xbf4650ff70c64558, 0x459b8945e32c2, 0x23e7c00, 0xc029e7a3a8, 0x0, 0x1, 0x0)
    /usr/local/go/src/net/http/client.go:174 +0xfb
net/http.(*Client).do(0xc0003aa660, 0xc005448200, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:641 +0x279
net/http.(*Client).Do(...)
    /usr/local/go/src/net/http/client.go:509
github.com/stripe/veneur/http.PostHelper(0x17c4ce0, 0xc000492ed0, 0xc0003aa660, 0xc000440400, 0x1515d91, 0x4, 0xc000838000, 0x50, 0x135a860, 0xc000493200, ...)
    /go/src/github.com/stripe/veneur/http/http.go:229 +0xc94
github.com/stripe/veneur/sinks/datadog.(*DatadogMetricSink).flushPart(0xc0001e2240, 0x17c4ce0, 0xc000492ed0, 0xc034f00000, 0x4ebf, 0x4ebf, 0xc00821a850)
    /go/src/github.com/stripe/veneur/sinks/datadog/datadog.go:356 +0x30f
created by github.com/stripe/veneur/sinks/datadog.(*DatadogMetricSink).Flush
    /go/src/github.com/stripe/veneur/sinks/datadog/datadog.go:144 +0x40a

goroutine 1319 [IO wait, 13 minutes]:
internal/poll.runtime_pollWait(0x7f2818062bb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc003c06298, 0x72, 0x1900, 0x199a, 0xffffffffffffffff)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc003c06280, 0xc003c1af80, 0x199a, 0x199a, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc003c06280, 0xc003c1af80, 0x199a, 0x199a, 0x203003, 0x0, 0x198d)
    /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc000013460, 0xc003c1af80, 0x199a, 0x199a, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:177 +0x69
crypto/tls.(*atLeastReader).Read(0xc00e6373e0, 0xc003c1af80, 0x199a, 0x199a, 0xc0126b6068, 0x2, 0xc0005929e0)
    /usr/local/go/src/crypto/tls/conn.go:761 +0x60
bytes.(*Buffer).ReadFrom(0xc005b39058, 0x178fae0, 0xc00e6373e0, 0x40b805, 0x1397300, 0x14c5ee0)
    /usr/local/go/src/bytes/buffer.go:207 +0xbd
crypto/tls.(*Conn).readFromUntil(0xc005b38e00, 0x1791ae0, 0xc000013460, 0x5, 0xc000013460, 0x1)
    /usr/local/go/src/crypto/tls/conn.go:783 +0xf8
crypto/tls.(*Conn).readRecordOrCCS(0xc005b38e00, 0x15f4200, 0xc005b38f38, 0xc000592d58)
    /usr/local/go/src/crypto/tls/conn.go:590 +0x125
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:558
crypto/tls.(*Conn).Read(0xc005b38e00, 0xc006189000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:1236 +0x137
bufio.(*Reader).Read(0xc0003f3380, 0xc00330fb58, 0x9, 0x9, 0x407e54, 0xc028298480, 0xc000592d58)
    /usr/local/go/src/bufio/bufio.go:223 +0x23e
io.ReadAtLeast(0x178f920, 0xc0003f3380, 0xc00330fb58, 0x9, 0x9, 0x9, 0x178fcc0, 0xc02e520ad0, 0xc000084060)
    /usr/local/go/src/io/io.go:310 +0x88
io.ReadFull(...)
    /usr/local/go/src/io/io.go:329
net/http.http2readFrameHeader(0xc00330fb58, 0x9, 0x9, 0x178f920, 0xc0003f3380, 0x0, 0x0, 0xbf4650d0587ebd97, 0x4598cb66a5319)
    /usr/local/go/src/net/http/h2_bundle.go:1476 +0x88
net/http.(*http2Framer).ReadFrame(0xc00330fb20, 0xc03755a150, 0x0, 0x0, 0x1)
    /usr/local/go/src/net/http/h2_bundle.go:1734 +0xa1
net/http.(*http2clientConnReadLoop).run(0xc000592fb8, 0x15f3a90, 0xc0005427b8)
    /usr/local/go/src/net/http/h2_bundle.go:8102 +0x8f
net/http.(*http2ClientConn).readLoop(0xc00427b980)
    /usr/local/go/src/net/http/h2_bundle.go:8030 +0x76
created by net/http.(*http2Transport).newClientConn
    /usr/local/go/src/net/http/h2_bundle.go:7093 +0x637

evanj · 2019-07-28T15:10:54Z

It looks like CL https://go-review.googlesource.com/c/net/+/181457/ mentioned on issue #32388 will probably address my issue.

due to lots of issues with x/net/http2, as well as the hundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

* fix: metacache should only rename entries during cleanup (minio#11503) To avoid large delays in metacache cleanup, use rename instead of recursive delete calls, renames are cheaper move the content to minioMetaTmpBucket and then cleanup this folder once in 24hrs instead. If the new cache can replace an existing one, we should let it replace since that is currently being saved anyways, this avoids pile up of 1000's of metacache entires for same listing calls that are not necessary to be stored on disk. * turn off http2 for TLS setups for now (minio#11523) due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now * fix: save ModTime properly in disk cache (minio#11522) fix minio#11414 * fix: osinfos incomplete in case of warnings (minio#11505) The function used for getting host information (host.SensorsTemperaturesWithContext) returns warnings in some cases. Returning with error in such cases means we miss out on the other useful information already fetched (os info). If the OS info has been succesfully fetched, it should always be included in the output irrespective of whether the other data (CPU sensors, users) could be fetched or not. * fix: avoid timed value for network calls (minio#11531) additionally simply timedValue to have RWMutex to avoid concurrent calls to DiskInfo() getting serialized, this has an effect on all calls that use GetDiskInfo() on the same disks. Such as getOnlineDisks, getOnlineDisksWithoutHealing * fix: support IAM policy handling for wildcard actions (minio#11530) This PR fixes - allow 's3:versionid` as a valid conditional for Get,Put,Tags,Object locking APIs - allow additional headers missing for object APIs - allow wildcard based action matching * fix: in MultiDelete API return MalformedXML upon empty input (minio#11532) To follow S3 spec * Update yaml files to latest version RELEASE.2021-02-14T04-01-33Z * fix: multiple pool reads parallelize when possible (minio#11537) * Add support for remote tier management With this change MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 and Google Cloud Storage as remote tier storage backends. Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io> Co-authored-by: Krishnan Parthasarathi <kp@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io> Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com> Co-authored-by: Shireesh Anjal <355479+anjalshireesh@users.noreply.github.com> Co-authored-by: Anis Elleuch <vadmeste@users.noreply.github.com> Co-authored-by: Minio Trusted <trusted@minio.io> Co-authored-by: Krishna Srinivas <krishna.srinivas@gmail.com> Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io>

due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now

TyeMcQueen · 2021-03-01T23:12:04Z

So we have a total deadlock scenario that hasn't been fixed for years now. This regularly break hundreds of requests for us because a whole process has to be killed because new requests coming in just sit behind the lock and do not get serviced so we kill the process and abandon any in-progress requests in that process in order to stop creating even more locked requests.

I could quickly fix this if sync.Mutex supported TryLock() but that request (only a few lines of code) was rejected because a super strong use case for having it was required.

Perhaps being able to fix a very serious bug that can't be fixed for years is a strong enough use case? (Or should we argue about whether the connection mutex should be replaced with a channel instead?)

(TryLock proposal: #6123)

So, if there is a chance this would be accepted, I'm happy to submit a code change proposal that adds sync.Mutex.TryLock() and then uses it to not deadlock when looking for a connection to reuse (by not considering connections that are currently locked and so really can't be reused with any guarantee of timeliness anyway).

TyeMcQueen · 2021-03-01T23:41:51Z

FWIW, I concur with the quibble about the name and would be happy to call it LockIfNoWait().

king40or · 2021-03-01T23:51:53Z

@TyeMcQueen what all is this pertaining with, I'm dealing with a bug on my PC that I'm not exactly sure how to execute the correct code for the fix

TyeMcQueen · 2021-03-02T00:35:01Z

FWIW, I am currently testing my 14-line fix for this problem (but it is not my first priority so that testing will have to be done as I can find time over the next while).

+++ mutex.go	2021-03-01 15:50:28.000000000 -0800
@@ -81,6 +81,18 @@
 	m.lockSlow()
 }

+// LockIfNoWait either locks m immediately if it is not in use, returning
+// true, or, when a lock on m is already held, immediately returns false.
+func (m *Mutex) LockIfNoWait() bool {
+	if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
+		if race.Enabled {
+			race.Acquire(unsafe.Pointer(m))
+		}
+		return true
+	}
+	return false
+}
+
 func (m *Mutex) lockSlow() {```

```--- h2_bundle.go-orig	2020-02-25 10:31:46.000000000 -0800
+++ h2_bundle.go	2021-03-01 15:54:39.000000000 -0800
@@ -829,7 +829,7 @@
 func (p *http2clientConnPool) addConnIfNeeded(key string, t *http2Transport, c *tls.Conn) (used bool, err error) {
 	p.mu.Lock()
 	for _, cc := range p.conns[key] {
-		if cc.CanTakeNewRequest() {
+		if cc.CanTakeNewRequestNow() {
 			p.mu.Unlock()
 			return false, nil
 		}
@@ -7208,6 +7208,17 @@
 	return cc.canTakeNewRequestLocked()
 }

+// CanTakeNewRequestNow reports whether the connection can take a new request,
+// like CanTakeNewRequest(), except if connection's mutex is currently locked,
+// it immediately returns false rather than waiting to obtain the lock.
+func (cc *http2ClientConn) CanTakeNewRequestNow() bool {
+	if ! cc.mu.LockIfNoWait() {
+		return false
+	}
+	defer cc.mu.Unlock()
+	return cc.canTakeNewRequestLocked()
+}
+
 // clientConnIdleState describes the suitability of a client
 // connection to initiate a new RoundTrip request.
 type http2clientConnIdleState struct {```

TyeMcQueen · 2021-03-02T00:49:22Z

I have found two other places where cc.mu.Lock() can be called while p.mu.Lock() is in effect and will apply similar fixes. Though I don't have evidence of those particular flows causing lock-up in the wild. The logic is still sound for all 3 cases: Locked connections should just be skipped until the next time.

gopherbot · 2021-09-02T20:25:58Z

Change https://golang.org/cl/347299 mentions this issue: http2: handle request cancelation when writing headers

neild · 2021-09-14T17:48:34Z

I don't believe the TryLock approach is the right fix here.

In the usual case where the server is responsive and we do not exceed the server per-connection stream limit, we want to create a single HTTP/2 connection and use it for all requests. Creating a new connection whenever we are unable to unlock any existing connection means that we will create new connections much more aggressively. If, for example, we make ten simultaneous requests, we will be very likely to create several new connections as all the requests contend for the connection lock.

The TryLock-based patch above also completely breaks the (http2.Transport).StrictMaxConcurrentStreams configuration knob.

Possibly we should have a way to create multiple HTTP/2 connections to a single server even before we exceed the stream limit on a single connection, but if so, this should be done more deliberately than creating new connections every time two requests briefly contend on a mutex.

The right fix to avoid connection pool deadlocks (#32388) is to avoid holding the connection mutex indefinitely during blocking operations, such as header writes. I have a proposed fix for that in CL 349594.

The right fix to avoid requests hanging past the request deadline when writing headers blocks (this bug, as distinct from #32388) is to watch for cancelation while writing. Unfortunately, this means we need to move header writes to a separate goroutine, since there is no simple way to bound a write to a net.Conn with a context. I have a change which does this in CL 347299.

gopherbot · 2021-10-04T17:53:52Z

Change https://golang.org/cl/353870 mentions this issue: all: update golang.org/x/net to pull in CL 353390

neild · 2021-10-06T16:54:08Z

@gopherbot please open backport issues.

gopherbot · 2021-10-06T16:54:22Z

Backport issue(s) opened: #48822 (for 1.16), #48823 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

odeke-em changed the title ~~x/net/http2: client can hang forever~~ x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time Jan 26, 2018

bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 3, 2018

bradfitz added this to the Go1.11 milestone Feb 3, 2018

bradfitz modified the milestones: Go1.11, Unplanned Jul 12, 2018

stevenh mentioned this issue Jun 2, 2019

x/net/http2: pool deadlock #32388

Closed

kerberosYao mentioned this issue Jul 9, 2019

net/http: a "bad" https connection which uses Header.Set could potentially block all https requests more than 15 mins in production environments #33006

Closed

steven-zou mentioned this issue Apr 16, 2020

Replication job might be hang when doing copy to some big blobs (after enable tls) goharbor/harbor#11642

Closed

rs mentioned this issue May 22, 2020

Randomly hang once a day nextdns/nextdns#196

Closed

dtelyukh mentioned this issue Nov 12, 2020

net/http: bundled x/net/http2.responseWriter hangs forever #42534

Open

harshavardhana mentioned this issue Feb 11, 2021

turn off http2 for TLS setups for now minio/minio#11523

Merged

7 tasks

TyeMcQueen mentioned this issue Apr 7, 2021

sync: add Mutex.TryLock #45435

Closed

gopherbot closed this as completed in 7109323 Oct 6, 2021

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 27, 2021

dmitshur modified the milestones: Unplanned, Go1.18 Oct 27, 2021

dmitshur changed the title ~~x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time~~ x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time Oct 27, 2021

zzkcode mentioned this issue Nov 5, 2021

net/http: http2 via Forward Proxy block on getClientConn because of deadlock #49377

Closed

t-chab mentioned this issue Mar 28, 2022

http_proxy environment variables support for doh nextdns/nextdns#670

Open

ncw mentioned this issue Jun 6, 2022

x/net/http2: handle servers that delay sending END_STREAM in HEAD responses #53253

Closed

golang locked and limited conversation to collaborators Oct 27, 2022

gopherbot added the FrozenDueToAge label Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

gwik commented Jan 25, 2018 •

edited

Loading

odeke-em commented Jan 26, 2018

gwik commented Jan 26, 2018

bradfitz commented Jul 12, 2018

gwik commented Jul 16, 2018

stevenh commented May 28, 2019

gwik commented May 29, 2019

stevenh commented May 29, 2019 •

edited

Loading

evanj commented Jul 28, 2019

evanj commented Jul 28, 2019

TyeMcQueen commented Mar 1, 2021

TyeMcQueen commented Mar 1, 2021

king40or commented Mar 1, 2021

TyeMcQueen commented Mar 2, 2021 •

edited

Loading

TyeMcQueen commented Mar 2, 2021

gopherbot commented Sep 2, 2021

neild commented Sep 14, 2021

gopherbot commented Oct 4, 2021

neild commented Oct 6, 2021

gopherbot commented Oct 6, 2021

x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

x/net/http2, net/http: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time #23559

Comments

gwik commented Jan 25, 2018 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

odeke-em commented Jan 26, 2018

gwik commented Jan 26, 2018

bradfitz commented Jul 12, 2018

gwik commented Jul 16, 2018

stevenh commented May 28, 2019

gwik commented May 29, 2019

stevenh commented May 29, 2019 • edited Loading

evanj commented Jul 28, 2019

Details

Stacks

evanj commented Jul 28, 2019

TyeMcQueen commented Mar 1, 2021

TyeMcQueen commented Mar 1, 2021

king40or commented Mar 1, 2021

TyeMcQueen commented Mar 2, 2021 • edited Loading

TyeMcQueen commented Mar 2, 2021

gopherbot commented Sep 2, 2021

neild commented Sep 14, 2021

gopherbot commented Oct 4, 2021

neild commented Oct 6, 2021

gopherbot commented Oct 6, 2021

gwik commented Jan 25, 2018 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

stevenh commented May 29, 2019 •

edited

Loading

TyeMcQueen commented Mar 2, 2021 •

edited

Loading