Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NNG]: reap woker reaps nni_pipe twice. #1411

Closed
Nereuxofficial opened this issue Aug 7, 2023 · 22 comments · Fixed by nanomq/NanoNNG#707
Closed

[NNG]: reap woker reaps nni_pipe twice. #1411

Nereuxofficial opened this issue Aug 7, 2023 · 22 comments · Fixed by nanomq/NanoNNG#707
Assignees
Labels
bug feedback Waiting for futher feedbacks

Comments

@Nereuxofficial
Copy link

Nereuxofficial commented Aug 7, 2023

Describe the bug
While fuzzing the broker with my self-written fuzzer it quickly crashes with this end output:

Expected behavior
The broker should not crash upon receiving any message.

Actual Behavior
The Broker crashes. Here is the last output to console after crashing:

Warning: close pipe & kick client due to KeepAlive timeout!panic: pthread_mutex_lock: Invalid argumentThis message is indicative of a BUG.Report this at https://github.com/emqx/nanomqnanomq(nni_panic+0x117) [0x559ef6e85757]nanomq(nni_task_wait+0x21) [0x559ef6e8ac01]nanomq(+0x59ef5) [0x559ef6e99ef5]nanomq(+0x4583b) [0x559ef6e8583b]nanomq(+0x4639a) [0x559ef6e8639a]nanomq(+0x4afeb) [0x559ef6e8afeb]nanomq(+0x4d8a6) [0x559ef6e8d8a6]/usr/lib/libc.so.6(+0x8c9eb) [0x7f0db3e8c9eb]/usr/lib/libc.so.6(+0x11123c) [0x7f0db3f1123c]

To Reproduce
This is where it gets a bit funky. Since it is a locking bug it is not guaranteed to be reproducible on every run of the program. But when fuzzing i've always gotten it so far and with replaying it happens 6 times out of 10.

Here are two logs reproducing the behaviour(the second one with log_level debug):
https://gist.github.com/Nereuxofficial/2aef73e8403a6445cca91e6c34d3b296

And here is another log with fuzz mode from my server

Should be noted that last stdout and last stderr is the complete output of the broker. I've just forgotten to update the logging.

I would like to do a small C program here but due to this requiring to rewrite the fuzzer in C this is not an option. The easiest way to reproduce this is with the fuzzer:

After making sure your system nanomq is the newest version do this:

  1. Install Rust
  2. Clone rusty-FUME and checkout repro_nanomq_mutex_lock via:
git clone https://github.com/MCloudTT/rusty-FUME
cd rusty-FUME
git checkout repro_nanomq_mutex_lock
  1. Run: cargo run -r -- --broker-command "nanomq start" replay (Here you can also attach a log_level, though in my experience the crash rate was lower this way...)

If 3 doesn't work like on my server try to find the bug again using fuzzing:
3. Run: cargo run -r -- --broker-command "nanomq start" fuzz

** Environment Details **

  • NanoMQ version: NanoMQ Messaging Engine for Edge Computing & Messaging bus v0.19.5-2 (latest git)
  • Operating system and version: Garuda Linux with Linux 6.4.8-zen1-1-zen
  • Compiler and language used: GCC as built with nanomq-git
  • Fuzzing with the process described above

Additional context
If you run into problems with reproducing or using the fuzzer feel free to ask me for help and also if you have suggestions for improvements to the fuzzer. Also I'd be really interested in the fix and what caused it so it would be nice to know what caused this under what circumstances.(Also if you happen to stumble on other bugs with this)

@JaylinYu
Copy link
Member

JaylinYu commented Aug 8, 2023

Thanks for spotting the issue of NAnoMQ. But I am unable to compile & run your repo.
my rust version is
rustc 1.55.0-nightly (5a7834050 2021-06-23)

And I get
error: Found argument '-r' which wasn't expected, or isn't valid in this context

or

error: failed to parse manifest at /mnt/xvdc/data/Develop/github/rusty-FUME/Cargo.toml

Caused by:
feature edition2021 is required

consider adding cargo-features = ["edition2021"] to the manifest

while trying to run your test.

@JaylinYu JaylinYu added the Investigation quiestion remains ambiguous label Aug 8, 2023
@Nereuxofficial
Copy link
Author

Nereuxofficial commented Aug 8, 2023

Thanks for the detailed error logs.
rustup update(Or otherwise updating via the package manager) Should do the trick. edition2021 is stabilized in stable 1.56. As for -r, that should work in newer rust versions too but is just short for --release if that doesn't work

@JaylinYu
Copy link
Member

JaylinYu commented Aug 9, 2023

Forget about my Rust env issue.

After careful examination, The issue you report is really interesting, I found several other issues too.
But it should be an NNG problem, highly suspicious as a Bug.

For anyone whom may be concerned:
Somehow, the nni_pipe got destroyed before it was even created(Actually, destroy a pipe twice, put it to link node twice). This brings the whole NNG state machine down.
As for any other symptoms, they are all side effects of this problem.

As the log shows, 0x7fff900198e0 got destroyed twice.

2023-08-08 20:47:05 [164526] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:277  pipe_create: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fff900198e0 created!!!!!!!!!!!!!!!!!!!!!!!!!
2023-08-08 20:47:05 [164535] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:72  pipe_destroy: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fffac000f90 destroy !!!!!!!!!!!!!!!!!!!!!!!!!
2023-08-08 20:47:05 [164527] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:277  pipe_create: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fffa4007460 created!!!!!!!!!!!!!!!!!!!!!!!!!
2023-08-08 20:47:05 [164535] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:72  pipe_destroy: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fff900198e0 destroy !!!!!!!!!!!!!!!!!!!!!!!!!
2023-08-08 20:47:05 [164527] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:277  pipe_create: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fffa400be10 created!!!!!!!!!!!!!!!!!!!!!!!!!
2023-08-08 20:47:05 [164535] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:72  pipe_destroy: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fff900198e0 destroy !!!!!!!!!!!!!!!!!!!!!!!!!
free(): double free detected in tcache 2
2023-08-08 20:47:05 [164527] ERROR /mnt/xvdc/data/Develop/github/emqx/nanomq/nng/src/core/pipe.c:277  pipe_create: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pipe 0x7fffa4010ae0 created!!!!!!!!!!!!!!!!!!!!!!!!!

@Nereuxofficial May I ask about your fuzzing methodology? Are you only playing with MQTT? or also TCP part?
Because this looks like a low-level bug to me, the reaper worker is a single thread. This is not a vulnerability issue, not even a data racing issue.
Could you do the same fuzzing to the NNG TCP port? Thus I can find out the difference.

Gonna discuss with @gdamore about this.

Anyway, I will tag it as help wanted for now. This is not an easy issue to address. Need to work with upstream

@JaylinYu JaylinYu changed the title Bug: panic: pthread_mutex_lock: Invalid argument [NNG]: reap woker reaps nni_pipe before it was created Aug 9, 2023
@JaylinYu
Copy link
Member

JaylinYu commented Aug 9, 2023

ASAN also tells same story

==3056238==ERROR: AddressSanitizer: heap-use-after-free on address 0x61d000145b88 at pc 0x55cbbe4bb85c bp 0x7fb7243cac20 sp 0x7fb7243cac10
READ of size 8 at 0x61d000145b88 thread T38
    #0 0x55cbbe4bb85b in reap_worker /home/maoyi/github/emqx/nanomq/nng/src/core/reap.c:57
    #1 0x55cbbe4c7642 in nni_thr_wrap /home/maoyi/github/emqx/nanomq/nng/src/core/thread.c:94
    #2 0x55cbbe4d07eb in nni_plat_thr_main /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:266
    #3 0x7fb73a094b42 in start_thread nptl/pthread_create.c:442
    #4 0x7fb73a1269ff  (/lib/x86_64-linux-gnu/libc.so.6+0x1269ff)

0x61d000145b88 is located 264 bytes inside of 2304-byte region [0x61d000145a80,0x61d000146380)
freed by thread T38 here:
    #0 0x7fb73a4b4537 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:127
    #1 0x55cbbe4cbeb9 in nni_free /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_alloc.c:33
    #2 0x55cbbe4b9ac3 in pipe_destroy /home/maoyi/github/emqx/nanomq/nng/src/core/pipe.c:83
    #3 0x55cbbe4bb873 in reap_worker /home/maoyi/github/emqx/nanomq/nng/src/core/reap.c:58
    #4 0x55cbbe4c7642 in nni_thr_wrap /home/maoyi/github/emqx/nanomq/nng/src/core/thread.c:94
    #5 0x55cbbe4d07eb in nni_plat_thr_main /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:266
    #6 0x7fb73a094b42 in start_thread nptl/pthread_create.c:442

previously allocated by thread T33 here:
    #0 0x7fb73a4b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55cbbe4cbe90 in nni_zalloc /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_alloc.c:26
    #2 0x55cbbe4ba24a in pipe_create /home/maoyi/github/emqx/nanomq/nng/src/core/pipe.c:252
    #3 0x55cbbe4badaa in nni_pipe_create_listener /home/maoyi/github/emqx/nanomq/nng/src/core/pipe.c:329
    #4 0x55cbbe4c293f in nni_listener_add_pipe /home/maoyi/github/emqx/nanomq/nng/src/core/socket.c:1661
    #5 0x55cbbe4b2a49 in listener_accept_cb /home/maoyi/github/emqx/nanomq/nng/src/core/listener.c:359
    #6 0x55cbbe4c628a in nni_taskq_thread /home/maoyi/github/emqx/nanomq/nng/src/core/taskq.c:50
    #7 0x55cbbe4c7642 in nni_thr_wrap /home/maoyi/github/emqx/nanomq/nng/src/core/thread.c:94
    #8 0x55cbbe4d07eb in nni_plat_thr_main /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:266
    #9 0x7fb73a094b42 in start_thread nptl/pthread_create.c:442

Thread T38 created by T0 here:
    #0 0x7fb73a458685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x55cbbe4d091b in nni_plat_thr_init /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:279
    #2 0x55cbbe4c78ee in nni_thr_init /home/maoyi/github/emqx/nanomq/nng/src/core/thread.c:121
    #3 0x55cbbe4bbaf4 in nni_reap_sys_init /home/maoyi/github/emqx/nanomq/nng/src/core/reap.c:110
    #4 0x55cbbe4b08ed in nni_init_helper /home/maoyi/github/emqx/nanomq/nng/src/core/init.c:36
    #5 0x55cbbe4d0cc0 in nni_plat_init /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:422
    #6 0x55cbbe4b0960 in nni_init /home/maoyi/github/emqx/nanomq/nng/src/core/init.c:58
    #7 0x55cbbe50d5c0 in nng_mtx_alloc /home/maoyi/github/emqx/nanomq/nng/src/supplemental/util/platform.c:93
    #8 0x55cbbe4530b6 in log_init /home/maoyi/github/emqx/nanomq/nanomq/mqtt_api.c:226
    #9 0x55cbbe497ddc in broker_start /home/maoyi/github/emqx/nanomq/nanomq/apps/broker.c:1617
    #10 0x55cbbe453db7 in main /home/maoyi/github/emqx/nanomq/nanomq/nanomq.c:142
    #11 0x7fb73a029d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Thread T33 created by T0 here:
    #0 0x7fb73a458685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x55cbbe4d091b in nni_plat_thr_init /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:279
    #2 0x55cbbe4c78ee in nni_thr_init /home/maoyi/github/emqx/nanomq/nng/src/core/thread.c:121
    #3 0x55cbbe4c65ac in nni_taskq_init /home/maoyi/github/emqx/nanomq/nng/src/core/taskq.c:95
    #4 0x55cbbe4c730c in nni_taskq_sys_init /home/maoyi/github/emqx/nanomq/nng/src/core/taskq.c:294
    #5 0x55cbbe4b08df in nni_init_helper /home/maoyi/github/emqx/nanomq/nng/src/core/init.c:35
    #6 0x55cbbe4d0cc0 in nni_plat_init /home/maoyi/github/emqx/nanomq/nng/src/platform/posix/posix_thread.c:422
    #7 0x55cbbe4b0960 in nni_init /home/maoyi/github/emqx/nanomq/nng/src/core/init.c:58
    #8 0x55cbbe50d5c0 in nng_mtx_alloc /home/maoyi/github/emqx/nanomq/nng/src/supplemental/util/platform.c:93
    #9 0x55cbbe4530b6 in log_init /home/maoyi/github/emqx/nanomq/nanomq/mqtt_api.c:226
    #10 0x55cbbe497ddc in broker_start /home/maoyi/github/emqx/nanomq/nanomq/apps/broker.c:1617
    #11 0x55cbbe453db7 in main /home/maoyi/github/emqx/nanomq/nanomq/nanomq.c:142
    #12 0x7fb73a029d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: heap-use-after-free /home/maoyi/github/emqx/nanomq/nng/src/core/reap.c:57 in reap_worker
Shadow bytes around the buggy address:
  0x0c3a80020b20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c3a80020b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c3a80020b50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020b60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c3a80020b70: fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020b80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020b90: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020ba0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020bb0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c3a80020bc0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==3056238==ABORTING


@JaylinYu JaylinYu changed the title [NNG]: reap woker reaps nni_pipe before it was created [NNG]: reap woker reaps nni_pipe twice. Aug 9, 2023
@JaylinYu JaylinYu self-assigned this Aug 9, 2023
JaylinYu added a commit to nanomq/NanoNNG that referenced this issue Aug 9, 2023
Signed-off-by: JaylinYu <letrangerjaylin@gmail.com>
@Nereuxofficial
Copy link
Author

Nereuxofficial commented Aug 9, 2023

@Nereuxofficial May I ask about your fuzzing methodology? Are you only playing with MQTT? or also TCP part?
Because this looks like a low-level bug to me, the reaper worker is a single thread. This is not a vulnerability issue, not even a data racing issue.

That sounds really interesting. So the nni_pipe is reaped before it would be created and then trying to be destroyed? Sounds like memory corruption weirdness without memory corruption... Maybe something reads out of bounds in a union/struct and ASAN cannot detect it, which causes this weird behaviour?

Of course. I am only fuzzing the MQTT part(which is fuzzed via a Markov Model), though i cannot guarantee all connections are closed properly.
The fuzzing itself spawns 100 asynchronous tasks by default(configurable with --tasks num) and fuzzes the target given with --target.

It is possible to fuzz any TCP port, as long as if a new connection is established bytes can be sent, which is a small check implemented to prevent fuzzing invalid targets.(If anything turns out to be a problem though, I'll be happy to provide fixes). Also I'd be really interested in the fixes, as I'd like to include them in the report about my project :)

JaylinYu added a commit to nanomq/NanoNNG that referenced this issue Aug 15, 2023
    This is not final fix.

Signed-off-by: JaylinYu <letrangerjaylin@gmail.com>
JaylinYu added a commit to nanomq/NanoNNG that referenced this issue Aug 15, 2023
    This is not final fix.

Signed-off-by: JaylinYu <letrangerjaylin@gmail.com>
@Nereuxofficial
Copy link
Author

I tested this again and the prevent collateral damage fix did not prevent the issue from occuring on my system. Are there any updates on this?

@JaylinYu
Copy link
Member

JaylinYu commented Sep 7, 2023

I tested this again and the prevent collateral damage fix did not prevent the issue from occuring on my system. Are there any updates on this?

Unfortunately not. That PR is a WIP And Only fix side effect.
I Don't have time for this issue yet. We are busy in QUIC developing. I would suggest you do the same fuzzing on NNG/SP tcp protocol, this would be helpful.
Any helps from Community will be appreciated.

@Nereuxofficial
Copy link
Author

I tested this again and the prevent collateral damage fix did not prevent the issue from occuring on my system. Are there any updates on this?

Unfortunately not. That PR is a WIP And Only fix side effect. I Don't have time for this issue yet. We are busy in QUIC developing. I would suggest you do the same fuzzing on NNG/SP tcp protocol, this would be helpful. Any helps from Community will be appreciated.

I will try to troubleshoot the issue next week and maybe add additional information. Unfortunately this is not really my expertise

@JaylinYu
Copy link
Member

JaylinYu commented Sep 9, 2023

Thanks, it would be great to verify whether the issue is only with MQTT or a pre-existing bug in nng reap. at least set the right direction

@destroyer5656
Copy link

Just wanted to mention that I have also seen this issue now that we upgraded to .19.5 , it is intermittent though and our process scheduler was able to restart nanomq successfully.

@JaylinYu
Copy link
Member

JaylinYu commented Sep 27, 2023

Just wanted to mention that I have also seen this issue now that we upgraded to .19.5 , it is intermittent though and our process scheduler was able to restart nanomq successfully.

interesting, do you mean this doesn't occurs before 0.19.5??
May I ask about the previous version you using?

@destroyer5656
Copy link

destroyer5656 commented Sep 27, 2023

Previous version we were using was:
tag 0.8.3
commit sha:
ca3fd35

Also confirming we did not see this issue on the older tag/sha.

@JaylinYu
Copy link
Member

News update:
Thanks to @destroyer5656 ' hint, we finally resolved this issue. It's not NNG's fault, it is data racing on a session flag.
Thank you guys all for spotting this bug.

@JaylinYu JaylinYu added bug feedback Waiting for futher feedbacks and removed help wanted Hacktoberfest Investigation quiestion remains ambiguous labels Oct 10, 2023
@destroyer5656
Copy link

destroyer5656 commented Oct 10, 2023

Do I have to update the nanomq sha or can I just update the NNG repo sha to get this fix pulled in? Or even better if we could hotfix this into the release branch so we don't have to fork nanomq again.

@JaylinYu
Copy link
Member

Do I have to update the nanomq sha or can I just update the NNG repo sha to get this fix pulled in? Or even better if we could hotfix this into the release branch so we don't have to fork nanomq again.

I recommend to wait for the next release, which is coming in this month. We only do hotfixes for the LTS version. Currently, no LTS is activated (0.6.6 is outdated).

But this is not hard to backport the fix on your own. You only need to update the NanoNNG repo, or use the most updated main branch of nanomq and nanonng.

@destroyer5656
Copy link

Sounds good will wait for the next release for now.

@JaylinYu
Copy link
Member

@Nereuxofficial Looking forward to your feedback and verification also.

@Nereuxofficial
Copy link
Author

@Nereuxofficial Looking forward to your feedback and verification also.

Sorry, i tried it yesterday but i had to fix some stuff first in order to get the fuzzer to work again.

Fixed here too! Good work on the fix!

@destroyer5656
Copy link

Do you have a specific date for the release cut that will contain this fix @JaylinYu ?

@JaylinYu
Copy link
Member

Bug-fix release will be coming out at the end of Oct

@destroyer5656
Copy link

destroyer5656 commented Oct 24, 2023

Will it be out on 10/30?

@JaylinYu
Copy link
Member

Definitely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug feedback Waiting for futher feedbacks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants