-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core dump with "double free detected in tcache 2" or assertion failure when using http2 client via @grpc/grpc-js #38964
Comments
Any chance you could provide a repro that doesn't include npm dependencies (or downloading a potentially sketchy zip file 🙃)? |
Indeed, I can reproduce on macOS 11.4 with node v15.5.1 (that's what I last had 🙂). ❯ npm run client
> server@1.0.0 client
> node client.js
duplexEcho response: { message: 'Hello duplex world!' }
unaryEcho response: { message: 'Hello unary world!' }
duplexEcho error Error: 1 CANCELLED: Cancelled on client
at Object.callErrorFromStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client.js:390:49)
at Object.onReceiveStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (node:internal/process/task_queues:75:11)
duplexEcho status {
code: 1,
details: 'Cancelled on client',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
node(48717,0x104f70e00) malloc: *** error for object 0x104b31db0: pointer being freed was not allocated
node(48717,0x104f70e00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort npm run client I'll investigate some more and when I come to a conclusion, I'll pull in some guys. |
|
It seems like a bug in HTTP2. cc @jasnell @addaleax because you're familiar with HTTP2 in node. I think the line that causes the abort is https://github.com/grpc/grpc-node/blob/663fe77e7286910da0ae067cde5622dd2036bb30/packages/grpc-js/src/call-stream.ts#L679 |
Apologies, I thought the zip of the repro would be most appropriate. I've posted the repo at https://github.com/mdouglass/repro-node-crash as well if you'd rather get it there. I don't know enough about the @grpc/grpc-js internals to be able to create a repro without that lib quickly. If it turns out to be necessary to investigate it further, I can try to figure it out. |
@mdouglass save yourself the time — I'm investigating right now |
That reminded me that I should have tried to run this with the grpc-node tracing turned on via:
If I do that, the last log line out is indeed the one immediately before https://github.com/grpc/grpc-node/blob/663fe77e7286910da0ae067cde5622dd2036bb30/packages/grpc-js/src/call-stream.ts#L679 that @bl-ue identified. See https://gist.github.com/mdouglass/a4c521167daf8e5e0246bb862b419f36 |
Yep, I tried those vars too which is what got me that line.
|
So, after the long (37min) build, I tried running the server which worked like normal, and the client eventually (this 393mb node takes an extraordinarily long time to start up 😄) started executing, and printed this error instead of the core dump or the assertion failure (both of which I've gotten earlier and got a bunch more on the next runs):
@jasnell does that look like a real problem or some unrelated issue caused by debugging? node/deps/nghttp2/lib/nghttp2_stream.c Line 777 in e46c680
|
ping @jasnell |
I just built with |
Wow, that's a lot of cool output — I didn't get any of that 🤔
Hmm, that almost looks like a bug in nghttp2 — but I'm not experienced enough to say that. ping @jasnell I just know you're the best to take a look at this :) |
Adding a bit more context, I turned on the nghttp2 debug as well and got this: https://gist.github.com/mdouglass/fe9a159b6d9b167ac1da14386bf773a5. I'm suspicious of this:
I'm seeing that the same stream was closed twice and that the address of that stream is within a short distance of where the heap-use-after-free. The first close is right before I receive the unary response and the second close is right after I call cancel on the duplex stream. So to my eyes that looks like the two streams getting confused somehow. Will keep poking at it. |
Continuing to poke at this as I can. I added logging at the top and bottom of In the non-working version of the repro, we enter bad version:
The second call never exits as it ultimately crashes with the use after free call seen above. I tried reading through the nghttp2 docs to see if this is allowed and can't see if it is allowed or not. There is a comment that a single edit: Ignore the unknown in the request close line, I am logging that before retrieving the stream pointer since I wanted to make sure the log line was as early as possible. |
So I was trying to figure out how we ended up re-entrant. It looks like: I can't tell (yet) what that maps to, but inside that code we seem to transfer control back to the javascript side and we see the
|
I think this is probably as far as I can take it without finding out on the nghttp2 side whether or not the API is supposed to be re-entrant. If it is, then this seems to become a bug in nghttp2. If it is not, then this seems like a flaw in node_http2.cc which would need to specifically delay any callbacks to JavaScript code until after we finish processing the callback from nghttp2 to node_http2 in order to prevent any chance of JavaScript code triggering reentrancy. @jasnell would really appreciate you weighing in or or pointing me to someone appropriate to hound. Thanks 🙏 |
Last note on the reentrancy aspect -- both the good/bad repro cases are making two calls to In the good case, the second call is not reentrant and the call to In the bad case, the second call is reentrant (it happens during the If you want to support reentrancy it seems like we need to mark the stream as in-process closing (maybe via |
So, according to nghttp2/nghttp2#1590 and https://nghttp2.org/documentation/programmers-guide.html#remarks, it is not legal in nghttp2 to call these functions from an nghttp2 callback: In node, when in the In this particular bug, I close an nghttp2 stream, which ultimately causes A fix for the problem is a more complex issue. I believe the safest approach would be to set a flag when entering/exiting any of the nghttp2 callbacks and have An alternative would be to just address the identified problem with close and have I'm happy to try to make this change, but I'd love someone to verify that there's not a better approach or something I'm missing before I get started. |
…acks The nghttp2 functions nghttp2_session_send, nghttp2_session_mem_send, nghttp2_session_recv and nghttp2_session_mem_recv are documented as not being able to be called from nghttp2 callback functions. Add a tracker so we can tell when we are within the scope of an nghttp2 callback and early return from ConsumeHTTP2Data and SendPendingData which are the only two methods that use the prohibited methods. This prevents a use-after-free crash bug that can be triggered otherwise. Fixes: nodejs#38964 Refs: https://nghttp2.org/documentation/programmers-guide.html#remarks
…acks The nghttp2 functions nghttp2_session_send, nghttp2_session_mem_send, nghttp2_session_recv and nghttp2_session_mem_recv are documented as not being able to be called from nghttp2 callback functions. Add a tracker so we can tell when we are within the scope of an nghttp2 callback and early return from ConsumeHTTP2Data and SendPendingData which are the only two methods that use the prohibited methods. This prevents a use-after-free crash bug that can be triggered otherwise. Fixes: nodejs#38964 Refs: https://nghttp2.org/documentation/programmers-guide.html#remarks
The nghttp2 functions nghttp2_session_send, nghttp2_session_mem_send, nghttp2_session_recv and nghttp2_session_mem_recv are documented as not being able to be called from nghttp2 callback functions. Add a tracker so we can tell when we are within the scope of an nghttp2 callback and early return from ConsumeHTTP2Data and SendPendingData which are the only two methods that use the prohibited methods. This prevents a use-after-free crash bug that can be triggered otherwise. Fixes: nodejs#38964 Refs: https://nghttp2.org/documentation/programmers-guide.html#remarks
…acks The nghttp2 functions nghttp2_session_send, nghttp2_session_mem_send, nghttp2_session_recv and nghttp2_session_mem_recv are documented as not being able to be called from nghttp2 callback functions. Add a tracker so we can tell when we are within the scope of an nghttp2 callback and early return from ConsumeHTTP2Data and SendPendingData which are the only two methods that use the prohibited methods. This prevents a use-after-free crash bug that can be triggered otherwise. Fixes: nodejs#38964 Refs: https://nghttp2.org/documentation/programmers-guide.html#remarks
Hi @mdouglass i still reproduce |
Hi @kapouer, Do you see it only with v12.22.4? The patch was backported for v12.x. I will test it with the repo. |
@kumarak i see it with 12.21.0 and 12.22.4 when built for debian (shared libs...) and with official 12.22.4 binary. |
Thanks, @kapouer for reporting! I had a different version of scripts causing the same double-free error and reported. That got fixed but it seems the issue still exists. The reason for the issue is same. The handling of |
So just to confirm, this is essentially stating that the fix release does not actually fix this? |
Hi @tcullum-rh, The issue is not completely fixed with the release. There could still be a version of the scripts that can cause double-free issues. The previous fix had checks based on stream state which is not perfect and can be bypassed. I also find the stream states are not consistent with the different versions of node LTS. I think the best solution is to delay the handling of streams on receiving all rst_stream frames with the error code The reason I choose to handle |
PR-URL: nodejs#39423 Fixes: nodejs#38964 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
I still see this crash in node 16.14.2, which includes the second fix that @kumarak did #39622 , when using grpc-js with tls. (NB: only reproducible with tls; the same loadtest with tls turned off has no crashes). Stacktrace is either
or in other cases, just
I have no specific event orderings that reproduce, but I can make it happen reasonably reliably it by running a loadtest which involves a lot of grpc call/client work and churn. |
@kumarak could you take a look? |
@SimonWoolf, Do you have a script that can help me reproduce the issue with v16.14.2? The fix delays the processing of |
I'm afraid I don't have a script that I'm able to share, sorry (it's a loadtest of a non-open-source distributed system, I haven't been able to generate a minimal reproduction). But if you have some debug flags I could turn on, or a custom build of node with extra logging or something that you'd like me to run the tests against to get more information when the crash happens, I'd be happy to do so. |
@SimonWoolf I'm looking into this issue right now. Is there any chance you could run the tests against a build with ASAN? The stack trace provided by it would give us a better clue of where the problematic |
Sure - unfortunately I'm about to go on holiday; I'll be back and hope to be able to get you that info on the 7th/8th of June. |
Sorry for the delay; some other things came up (plus I had some trouble compiling the asan build). Running the loadtest with asan build of node, most of the instances quickly died with the following stacktrace. This is a different error to the above; no idea of whether this is the same bug being exposed earlier by the asan build or whether this is some other issue, hopefully you do 🙂
|
@facutuesca I have managed to reproduce with an ASAN build, see:
|
@Matt-Esch any possibility of having code that reproduces the issue? |
@santigimeno I can only reproduce this on a large proprietary codebase, I've not been able to isolate this to a simple repro unfortunately. I do think the stack traces give us a lot of context. My feeling is that the stream should have been removed from the session in the first close call, the second call to close (on incoming RST) should not have been possible. So either the close callback is re-entrant (nghttp2 says explicitly not to do that) or the stream is not removed from that session, which would raise the question why is the session not removed on RST even if the underlying data is freed. |
Looking over the I am going to instrument these points to see if they are the root cause. |
Logs from nghttp2 when segfault occurs on stream 65:
Looks like the transmission of a frame completes after the stream has already been closed. |
What steps will reproduce the bug?
repro-node-crash.zip
Run the following steps on the same machine:
npm ci
npm run server
npm run client
How often does it reproduce? Is there a required condition?
On my machine this is 100% reproducible in node v16.0 and up. I have not tried previous pre-v16 versions.
What is the expected behavior?
It is expected that the client process runs to completion without error.
What do you see instead?
The client process dies with a core dump or an assertion failure:
Additional information
This is related to grpc/grpc-node#1464.
There is a workaround available in my example client.js if you comment out line 45 and comment in line 46 (delaying
stream.cancel
via the use ofsetImmediate
).It seems to require this particular pattern of calls in the @grpc/grpc-js library as well:
The text was updated successfully, but these errors were encountered: