Fixed processor dispose race condition #135

echistyakov · 2024-12-17T00:08:21Z

Fix race condition in RequestResponse (with processor Dispose())

Motivation:

I caught this bug while stress-testing RSocket-based Client/Server implementation in Facebook Thrift: https://github.com/facebook/fbthrift/blob/main/thrift/lib/go/thrift/stress/server_test.go

It's a pretty simple stress test - just 100K concurrent RequestResponse's to 1 server.
At most 100 concurrent RSocket connections at a time, a fresh connection is created for each request.

The symptom of the stress-test failure was that a small portion of requests would fail with use of closed network connection error.

2024/12/16 15:21:02 flush failed drain: flush failed: write tcp [::1]:43993->[::1]:40658: use of closed network connection
2024/12/16 15:21:02 flush failed drain: flush failed: write tcp [::1]:43993->[::1]:35820: use of closed network connection
2024/12/16 15:21:02 flush failed drain: flush failed: write tcp [::1]:43993->[::1]:36604: use of closed network connection
2024/12/16 15:21:02 flush failed drain: flush failed: write tcp [::1]:43993->[::1]:36780: use of closed network connection
2024/12/16 15:21:02 flush failed drain: flush failed: write tcp [::1]:43993->[::1]:35856: use of closed network connection

This is a race condition between two clients and it goes like this:

Client makes a RequestResponse type of request here:

rsocket-go/internal/socket/duplex.go

Line 247 in 473989b

func (dc *DuplexConnection) RequestResponse(req payload.Payload) (res mono.Mono) {
A mono processor is created by pulling it from a global processor pool:

rsocket-go/internal/socket/duplex.go

Line 266 in 473989b

m, s, _ := mono.NewProcessor(dc.reqSche, onFinally)
- (Corresponding global pool code in the reactor-go repo.)
A callback handler is registered, the processor is assigned to its sink field:

rsocket-go/internal/socket/duplex.go

Lines 267 to 269 in 473989b

handler.sink = s

dc.register(sid, handler)
The client gets a response back from the server (no issues).

The processor invokes the onFinally callback (since the Stream Sequence is now complete):

rsocket-go/internal/socket/duplex.go

Lines 257 to 264 in 473989b

    
           onFinally := func(s reactor.SignalType, d reactor.Disposable) { 
        
           	common.TryRelease(handler.cache) 
        
           	d.Dispose() 
        
           	if s == reactor.SignalTypeCancel { 
        
           		dc.sendFrame(framing.NewWriteableCancelFrame(sid)) 
        
           	} 
        
           	dc.unregister(sid) 
        
           }

The processor gets disposed on the following line (it is placed back into the global processor pool and becomes available for any other RSocket client to use):

rsocket-go/internal/socket/duplex.go

Line 259 in 473989b

d.Dispose()
Immediately after the above line - our current Go-routine gets pre-emptied. It does not get a chance to unregister the handler callback (which still holds a pointer to the sink we just released into the global pool):

rsocket-go/internal/socket/duplex.go

Line 263 in 473989b

dc.unregister(sid)
Another Go-routine starts running.
a. This Go-routine creates a completely separate RSocket client to make a separate RequestResponse.
b. This RSocket client happens to get the same sink/processor (that we just disposed earlier) from the global pool.

We call Close() on our original client from earlier steps (since the RequestResponse sequence had already been completed).
a. A destroyHandler method gets invoked:

rsocket-go/internal/socket/duplex.go

Lines 133 to 138 in 473989b

    
           err := dc.GetError() 
        
           if err == nil { 
        
           	dc.destroyHandler(errSocketClosed) 
        
           } else { 
        
           	dc.destroyHandler(err) 
        
           }

rsocket-go/internal/socket/duplex.go

Line 163 in 473989b

func (dc *DuplexConnection) destroyHandler(err error) {

b. It in turn invokes stopWithError method of our handler (which we did not yet unregister because we got pre-emptied in step 8):

rsocket-go/internal/socket/callback.go

Lines 33 to 36 in 473989b

    
           func (s requestResponseCallback) stopWithError(err error) { 
        
           	s.sink.Error(err) 
        
           	common.TryRelease(s.cache) 
        
           }

c. However, the sink is already being used by another client. We are sending Error to a completely unrelated client!!! Race condition!
d. The other (unrelated) client gets a false-positive error that the socket is closed!

At some point after Close() executes, the Go routine from step 8 is scheduled and is finally able to unregister the handler - but it's too late - the race condition already occurred.

Modifications/Fix:

Correctly ordered the relevant operations to avoid the race condition:

Unregister handler callback with its sink (i.e. processor) first.
Dispose (place back into the global pool) of the sink (processor) last.

Result:

The stress test succeeds after this change.

Fixed processor dispose race condition

7979dfe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed processor dispose race condition #135

Fixed processor dispose race condition #135

echistyakov commented Dec 17, 2024

	onFinally := func(s reactor.SignalType, d reactor.Disposable) {
	common.TryRelease(handler.cache)
	d.Dispose()
	if s == reactor.SignalTypeCancel {
	dc.sendFrame(framing.NewWriteableCancelFrame(sid))
	}
	dc.unregister(sid)
	}

	err := dc.GetError()
	if err == nil {
	dc.destroyHandler(errSocketClosed)
	} else {
	dc.destroyHandler(err)
	}

	func (s requestResponseCallback) stopWithError(err error) {
	s.sink.Error(err)
	common.TryRelease(s.cache)
	}

	handler.sink = s

	dc.register(sid, handler)

Fixed processor dispose race condition #135

Are you sure you want to change the base?

Fixed processor dispose race condition #135

Conversation

echistyakov commented Dec 17, 2024

Motivation:

Modifications/Fix:

Result: