Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: Fix deadlock in transport caused by GOAWAY race with new stream creation #5652

Merged
merged 4 commits into from
Sep 21, 2022

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Sep 13, 2022

Fixes #5644.

RELEASE NOTES:

  • transport: Fix deadlock in transport caused by GOAWAY race with new stream creation (add add'l comment about how it only occurs with a non-well-behaved HTTP/2 server implementation)

@zasweq zasweq added this to the 1.50 Release milestone Sep 13, 2022
@zasweq zasweq requested a review from dfawley September 13, 2022 03:16
@zasweq zasweq force-pushed the fix-transport-deadlock branch 2 times, most recently from 9bd4117 to 05537c1 Compare September 13, 2022 05:04
Comment on lines 1236 to 1239
activeStreams := make(map[uint32]*Stream)
for streamID, stream := range t.activeStreams {
activeStreams[streamID] = stream
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a slice; only add streams to be closed to it. Early exit if t.activeStreams is zero?

t.prevGoAwayID = id

if len(t.activeStreams) == 0 {
	t.mu.Unlock()
	t.Close(...)
	return
}

streamsToClose := make([]*Stream, 0, len(t.activeStreams))
for streamID, stream := range t.activeStreams {
	if streamID > id && streamID <= upperLimit {
		streamsToClose = append(streamsToClose, stream)
	}
}
t.mu.Unlock()
for _, stream := range(streamsToClose) {
	atomic.StoreUint32()
	t.closeStream()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to have the cap(streamsToClose) len(t.activeStreams)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I tried this way and for some reason it still induced deadlock. I'm going to keep it as is. I couldn't figure out why your way wouldn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Verified on my e2e_test I checked in and also on the transport test which I didn't check in since it wrote directly to framer.

@dfawley dfawley assigned zasweq and unassigned dfawley Sep 14, 2022
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Sorry I couldn't get your solution working. The e2e test with the new helper class that actually induced it took everything out of me.

Comment on lines 1236 to 1239
activeStreams := make(map[uint32]*Stream)
for streamID, stream := range t.activeStreams {
activeStreams[streamID] = stream
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to have the cap(streamsToClose) len(t.activeStreams)?

Comment on lines 1236 to 1239
activeStreams := make(map[uint32]*Stream)
for streamID, stream := range t.activeStreams {
activeStreams[streamID] = stream
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I tried this way and for some reason it still induced deadlock. I'm going to keep it as is. I couldn't figure out why your way wouldn't work.

@zasweq zasweq assigned dfawley and unassigned zasweq Sep 15, 2022
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Just a few really small things.

return
}

streamsToClose := make([]*Stream, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var streamsToClose []*Stream

Or delete the ,0 but make is unnecessary for a slice either way if you aren't allocating anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched. You initially suggested make for the capacity length. Why was that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify the capacity then it won't need to do reallocations as you add to it. In this case, since we don't have any idea how many active streams are before/after the ID (and it isn't worth computing it first) then this should be fine. It will need reallocations (Go doubles the slice capacity as you add so it's O(logN)), but this is a rare case and not worth optimizing.

t.mu.Unlock()
if active == 0 {
t.Close(connectionErrorf(true, nil, "received goaway and there are no active streams"))
for _, stream := range streamsToClose {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment about why this is called outside t.mu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test/clienttester.go Show resolved Hide resolved
t.Errorf("error in lis.Accept(): %v", err)
}
ct := newClientTester(t, conn)
ct.greet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fold this into new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I'm doing this very begrudingly. ServerTester is done this way, and I feel like it' made a lot clearer to the user of the clientTester struct that it gets to the point where the client is happy with it's HTTP2 connection establisment.

Comment on lines 8167 to 8170
ct, ok := val.(*clientTester)
if !ok {
t.Fatalf("value received not a clientTester")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let it panic and keep the code simpler. This is just a test and the panic is as informative as this error, and it's a coding error if it happens. ct := val.(*clientTester)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +8177 to +8179
if i == 10 {
<-goAwayWritten.Done()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do that is different from blocking on <-goAwayWritten.Done() immediately after someStreamsCreate.Fire() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly was just playing around with this until it induced deadlock, it was hard to induce on master. This allows it to induce it because the only way to induce it is to concurrently create a stream (controlBuf.mu then transport.mu), where the closeStream was (transport.mu then controlBuf.mu). If I block where you suggested, there are no concurrent streams trying to be created while the goAway is being sent from server.

@dfawley dfawley assigned zasweq and unassigned dfawley Sep 21, 2022
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments :D!

return
}

streamsToClose := make([]*Stream, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched. You initially suggested make for the capacity length. Why was that?

Comment on lines 8167 to 8170
ct, ok := val.(*clientTester)
if !ok {
t.Fatalf("value received not a clientTester")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +8177 to +8179
if i == 10 {
<-goAwayWritten.Done()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly was just playing around with this until it induced deadlock, it was hard to induce on master. This allows it to induce it because the only way to induce it is to concurrently create a stream (controlBuf.mu then transport.mu), where the closeStream was (transport.mu then controlBuf.mu). If I block where you suggested, there are no concurrent streams trying to be created while the goAway is being sent from server.

t.Errorf("error in lis.Accept(): %v", err)
}
ct := newClientTester(t, conn)
ct.greet()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I'm doing this very begrudingly. ServerTester is done this way, and I feel like it' made a lot clearer to the user of the clientTester struct that it gets to the point where the client is happy with it's HTTP2 connection establisment.

test/clienttester.go Show resolved Hide resolved
t.mu.Unlock()
if active == 0 {
t.Close(connectionErrorf(true, nil, "received goaway and there are no active streams"))
for _, stream := range streamsToClose {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zasweq zasweq assigned dfawley and unassigned zasweq Sep 21, 2022
return
}

streamsToClose := make([]*Stream, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify the capacity then it won't need to do reallocations as you add to it. In this case, since we don't have any idea how many active streams are before/after the ID (and it isn't worth computing it first) then this should be fine. It will need reallocations (Go doubles the slice capacity as you add so it's O(logN)), but this is a rare case and not worth optimizing.

test/clienttester.go Show resolved Hide resolved
@dfawley dfawley changed the title transport: Fix deadlock in transport transport: Fix deadlock in transport caused by GOAWAY race with new stream creation Sep 21, 2022
@dfawley dfawley assigned zasweq and unassigned dfawley Sep 21, 2022
@zasweq zasweq merged commit b1d7f56 into grpc:master Sep 21, 2022
@zasweq
Copy link
Contributor Author

zasweq commented Oct 20, 2022

I clearly see this in the 1.50.x commit history: https://github.com/grpc/grpc-go/commits/v1.50.x. Scroll to 29 days ago.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock on grpc transport
2 participants