Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in paho_mqtt::async_client::Token::on_complete #46

Closed
ryankurte opened this issue Jul 16, 2019 · 11 comments
Closed

segfault in paho_mqtt::async_client::Token::on_complete #46

ryankurte opened this issue Jul 16, 2019 · 11 comments
Labels
fix added A fix was added to an unreleased branch

Comments

@ryankurte
Copy link
Contributor

Using a paho_mqtt::Client from an actix::Actor causes surprise segfaults when automatic_reconnect is specified and the initial connection fails.

Manually managing connection state appears to resolve the issue.

I'm not sure if this is an interaction with the actix/tokio runtime or a standalone issue with the library.

Backtrace from the offending thread(s):

03:43:27 [WARN] Token failure! 0x7698f8, 0x0
03:43:27 [DEBUG] paho_mqtt::async_client: Token completed with code: -1

Thread 2 "redacted" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x76b42040 (LWP 3050)]
__GI___pthread_mutex_lock (mutex=0x2820) at pthread_mutex_lock.c:67
67	pthread_mutex_lock.c: No such file or directory.
(gdb) thread apply all bt

Thread 3 (Thread 0x76341040 (LWP 3051)):
#0  0x76bfef10 in __lll_lock_wait (futex=futex@entry=0x6b3280 <mqttasync_mutex_store>, private=0) at lowlevellock.c:46
#1  0x76bf7bd4 in __GI___pthread_mutex_lock (mutex=0x6b3280 <mqttasync_mutex_store>) at pthread_mutex_lock.c:80
#2  0x004c58fe in MQTTAsync_lock_mutex (amutex=<optimized out>)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.0/paho.mqtt.c/src/MQTTAsync.c:418
#3  0x004c660c in MQTTAsync_cycle (rc=0x763409dc, timeout=1000, sock=0x763409e0)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.0/paho.mqtt.c/src/MQTTAsync.c:3029
#4  MQTTAsync_receiveThread (n=<optimized out>)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.0/paho.mqtt.c/src/MQTTAsync.c:1816
#5  0x76bf4fc4 in start_thread (arg=0x76341040) at pthread_create.c:458
#6  0x76d13038 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x76b42040 (LWP 3050)):
#0  __GI___pthread_mutex_lock (mutex=0x2820) at pthread_mutex_lock.c:67
#1  0x004bf2f8 in std::sys::unix::mutex::Mutex::lock::hf596dabe1f241a1c (self=0x2820)
    at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/sys/unix/mutex.rs:55
#2  std::sys_common::mutex::Mutex::raw_lock::h867ed2eb64a2c1c7 (self=0x2820)
    at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/sys_common/mutex.rs:36
#3  std::sync::mutex::Mutex$LT$T$GT$::lock::heedfcbf47011c178 (self=<optimized out>)
    at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/sync/mutex.rs:220
#4  paho_mqtt::async_client::Token::on_complete::h6aee954d4ca9c6d9 (self=0x769030, cli=0x280b, msgid=0, rc=<optimized out>, msg=...)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-0.5.0/src/async_client.rs:233
#5  0x004bf124 in paho_mqtt::async_client::Token::on_failure::h819733d949619332 (context=<optimized out>, rsp=<optimized out>)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-0.5.0/src/async_client.rs:225
#6  0x004c7f50 in MQTTAsync_processCommand ()
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.0/paho.mqtt.c/src/MQTTAsync.c:1417
#7  MQTTAsync_sendThread (n=<optimized out>)
    at /home/ryan/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.0/paho.mqtt.c/src/MQTTAsync.c:1574
---Type <return> to continue, or q <return> to quit---
#8  0x76bf4fc4 in start_thread (arg=0x76b42040) at pthread_create.c:458
#9  0x76d13038 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
@fpagliughi
Copy link
Contributor

Which version of the crate are you using, the released v0.5? The latest 'develop' branch? Or something else?

The new code, which will hopefully be released soon as v0.6 is a big rewrite of the Token internals. They are now futures... which should interact well with tokio. But the API is fairly backward compatible as well.

I'm also looking in to a bug in the underlying Paho C library which could segfault. A fix for that is due to be released shortly.

@ryankurte
Copy link
Contributor Author

ahh sorry, release 0.5 from crates.io. i saw the futures on the way, looks exciting ^_^

I'm also looking in to a bug in the underlying Paho C library which could segfault. A fix for that is due to be released shortly.

interesting, when i find a moment i’ll have a go at a minimum reproduction to see whether i can determine if the issue occurs without tokio, just needed to get this submitted before i left the office.

@fpagliughi
Copy link
Contributor

I just updated the code on GitHub (master and release) to use Paho C v1.3.1.
I'm hoping this fixes the issue. Let me know either way.

@fpagliughi
Copy link
Contributor

I assume this got fixed in v0.6. Re-open if not.

@ryankurte
Copy link
Contributor Author

i tried the updated version and it appears to segfault on connect, but, haven't had time to dig into whether it's the same or a different issue. will re-open or open a new issue when i have a chance ^_^

@fpagliughi
Copy link
Contributor

If you're using the default persistence and there's a persistence subdirectory, with a name something like <client_id>-<host>-<port_num> delete it and try again.

It could be this:
eclipse-paho/paho.mqtt.c#745

@fpagliughi
Copy link
Contributor

Oh. OK. I think I found it. It seems to be a problem with the pre-generated bindings on 32-bit targets. See #62.

You can can do a quick test by adding the "build_bindgen" feature to the crate in your cargo file. In the next rev, I'm going to move to target-specific bindings to fix this.

As this is ongoing, I'll reopen the issue. Let us know how it works out.

@ryankurte
Copy link
Contributor Author

ahh, important detail i completely forgot (sorry!) from the last comment is that this is cross built for armhf (from x64), also, thanks for all your help with this!

building with build_bindgen doesn't seem to help, but, it certainly does seem binding related. a backtrace from the faulted process:

[New Thread 0x76b42240 (LWP 6161)]
[New Thread 0x76341240 (LWP 6162)]
[New Thread 0x759ff240 (LWP 6163)]

Thread 3 "rf-eval-daemon-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x76341240 (LWP 6162)]
strlen () at ../sysdeps/arm/armv6/strlen.S:26
26	../sysdeps/arm/armv6/strlen.S: No such file or directory.
(gdb) bt
#0  strlen () at ../sysdeps/arm/armv6/strlen.S:26
#1  0x00bc621c in std::ffi::c_str::CStr::from_ptr::ha5f44ec31b359b22 () at src/libstd/ffi/c_str.rs:974
#2  0x00666d18 in paho_mqtt::token::TokenInner::on_complete::hc7624b3a015f3367 (self=0xdb0f20, msgid=0, rc=0, err_msg=..., rsp=0x76340b68)
    at /root/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-0.6.0/src/token.rs:315
#3  0x00665fac in paho_mqtt::token::TokenInner::on_success::h4e604ef4231d4308 (context=0xdb0f20, rsp=0x76340b68)
    at /root/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-0.6.0/src/token.rs:250
#4  0x006793ce in MQTTAsync_receiveThread (n=0xdb008c) at /root/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/paho-mqtt-sys-0.2.1/paho.mqtt.c/src/MQTTAsync.c:2236
#5  0x76bf4fc4 in start_thread (arg=0x76341240) at pthread_create.c:458
#6  0x76d13038 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@fpagliughi
Copy link
Contributor

Yeah, this definitely looks like the same binding problem I just had. Due to the incorrect word sizes, pointers are mis-aligned in the structs coming back from the C lib. If you look at the address of the string getting passed to strlen(), it could be something like 0x04, or some other non-sensical value for an address.

I just went through and generated a bunch of bindings for different targets, including 32 and 64-bit ARM boards, but I was kinda lazy and just did them all natively on the different boards.

I wonder if the problem is from cross-compiling. Maybe bindgen is using the word size of the host and not the target?

I'll look into that.

@fpagliughi fpagliughi reopened this Jan 8, 2020
@fpagliughi fpagliughi added the fix added A fix was added to an unreleased branch label Apr 26, 2020
@fpagliughi
Copy link
Contributor

This should be fixed by the target-specific bindings going in to v0.7

@fpagliughi
Copy link
Contributor

I think there's still an inherent problem cross-compiling with bindgen, but when cross-compiling using the pre-built bindings, it does now get the word size correct, which was the main initial problem here. So far it has been working pretty well.

If something else comes up, feel free to re-open this or start another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix added A fix was added to an unreleased branch
Projects
None yet
Development

No branches or pull requests

2 participants