Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't unlock the receive stream mutex for copying from STREAM frames #3290

Merged
merged 2 commits into from
Oct 19, 2021

Conversation

marten-seemann
Copy link
Member

@marten-seemann marten-seemann commented Oct 16, 2021

Fixes #3257. @arashpayan can you review this PR please? I'm planning to cut a new quic-go release early next week, and would love to get this fix in.

The problem here was the mutex in the receiveStream:
https://github.com/lucas-clemente/quic-go/blob/a22a9eae30618491bf6c40ad20918650e2a2ab7b/receive_stream.go#L169-L175
We unlocked this mutex when copying data from the STREAM frame to the b slice of the Read call. If during this time CancelRead is called:
https://github.com/lucas-clemente/quic-go/blob/a22a9eae30618491bf6c40ad20918650e2a2ab7b/receive_stream.go#L200-L209
This would call onStreamCompleted for the first time. onStreamCompleted removes the stream from the streams map.

If the STREAM frame copied in the Read call was the last STREAM frame of the stream (i.e. the one carrying the FIN bit), we'd call onStreamCompleted again, trying to delete the previously deleted stream, leading to an INTERNAL_ERROR.

This PR also adds an integration test which is capable of reproducing the bug reliable within just a few repetitions.

@codecov
Copy link

codecov bot commented Oct 16, 2021

Codecov Report

Merging #3290 (2c8b939) into master (bb8d484) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3290      +/-   ##
==========================================
- Coverage   85.69%   85.69%   -0.00%     
==========================================
  Files         132      132              
  Lines        9776     9774       -2     
==========================================
- Hits         8377     8375       -2     
  Misses       1024     1024              
  Partials      375      375              
Impacted Files Coverage Δ
receive_stream.go 95.95% <ø> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb8d484...2c8b939. Read the comment docs.

@arashpayan
Copy link
Contributor

I'll be able to test this out either tomorrow or Monday, @marten-seemann.

@marten-seemann
Copy link
Member Author

An alternative to this PR would be to check if the stream was reset after https://github.com/lucas-clemente/quic-go/blob/a22a9eae30618491bf6c40ad20918650e2a2ab7b/receive_stream.go#L169-L175
i.e. adding another

if s.canceledRead {
	return false, 0, s.cancelReadErr
}

The code, as well as a benchmark, are in this branch.
As far as I can see, unlocking and locking the mutex takes more time than just keeping the mutex locked throughout the copy:

benchmark                     old ns/op     new ns/op     delta
BenchmarkStreamRead128-16     5951574       4955524       -16.74%
BenchmarkStreamRead256-16     3375841       2981710       -11.68%
BenchmarkStreamRead512-16     2079287       1891235       -9.04%

benchmark                     old allocs     new allocs     delta
BenchmarkStreamRead128-16     40043          38129          -4.78%
BenchmarkStreamRead256-16     20042          20033          -0.04%
BenchmarkStreamRead512-16     10057          10031          -0.26%

benchmark                     old bytes     new bytes     delta
BenchmarkStreamRead128-16     3918941       3719895       -5.08%
BenchmarkStreamRead256-16     3281680       3258756       -0.70%
BenchmarkStreamRead512-16     3012187       2934542       -2.58%

where old is the version with the added canceledRead check new is the version proposed in this PR.

@arashpayan
Copy link
Contributor

I just tried this branch out and ran the problematic job with it for an hour. It never failed once. Usually it would fail 2-3 times in an hour, so this seems to have fixed it. Thank you!

@marten-seemann marten-seemann merged commit b935a54 into master Oct 19, 2021
@aschmahmann aschmahmann mentioned this pull request Dec 1, 2021
80 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

INTERNAL_ERROR while writing/reading to/from a stream
3 participants