Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg Fault in cmdMessageIDCompare #880

Closed
atomcraft-th opened this issue May 7, 2020 · 7 comments
Closed

Seg Fault in cmdMessageIDCompare #880

atomcraft-th opened this issue May 7, 2020 · 7 comments
Milestone

Comments

@atomcraft-th
Copy link

Describe the bug
I am seeing a segmentation fault, it isn't easy to hit, although the more threads I have sending through mqtt the more likely it seems to be.
It always seems to occur when two threads are trying to send messages concurrently.
It is always in cmdMessageIDCompare, always with a=0x0 and always called by ListFindItem.

Callstack for seg fault:
#0 0x001fd980 in cmdMessageIDCompare (a=0x0, b=0x7effef70) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3442
#1 0x00211b88 in ListFindItem (aList=0x317dc0, content=0x7effef70, callback=0x1fd960 ) at /home/pi/dev/build/libraries/mqtt/src_c/src/LinkedList.c:178
#2 0x001fdabc in MQTTAsync_assignMsgId (m=0x4019d0) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3470
#3 0x001fe8a0 in MQTTAsync_send (handle=0x4019d0, destinationName=0x471af0 "status/driver/06/32/status", payloadlen=4, payload=0x43c6ec, qos=1, retained=0, response=0x7efff008)
at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3740
#4 0x001fecd4 in MQTTAsync_sendMessage (handle=0x4019d0, destinationName=0x471af0 "status/driver/06/32/status", message=0x415c14, response=0x7efff008) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3827
#5 0x001d525c in mqtt::async_client::publish (this=0x40169c, msg=std::shared_ptr (use count 2, weak count 0) = {...}) at /home/pi/dev/build/libraries/mqtt/src/src/async_client.cpp:565
#6 0x001d4e8c in mqtt::async_client::publish (this=0x40169c, topic=..., payload=0x7efff144, n=4, qos=1, retained=false) at /home/pi/dev/build/libraries/mqtt/src/src/async_client.cpp:539
#7 ...

Call stack for additional send thread
#0 __lll_lock_wait (futex=futex@entry=0x28e4b0 <mqttasync_mutex_store>, private=0) at lowlevellock.c:43
#1 0x76f84f44 in __GI___pthread_mutex_lock (mutex=0x28e4b0 <mqttasync_mutex_store>) at pthread_mutex_lock.c:80
#2 0x0020df58 in Thread_lock_mutex (mutex=0x28e4b0 <mqttasync_mutex_store>) at /home/pi/dev/build/libraries/mqtt/src_c/src/Thread.c:121
#3 0x001f5648 in MQTTAsync_lock_mutex (amutex=0x28e4b0 <mqttasync_mutex_store>) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:541
#4 0x001fda34 in MQTTAsync_assignMsgId (m=0x4019d0) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3465
#5 0x001fe8a0 in MQTTAsync_send (handle=0x4019d0, destinationName=0x73fc55f0 "driver/13/01/measurements/current", payloadlen=1, payload=0x73f7867c, qos=1, retained=0, response=0x750fd8b0)
at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3740
#6 0x001fecd4 in MQTTAsync_sendMessage (handle=0x4019d0, destinationName=0x73fc55f0 "driver/13/01/measurements/current", message=0x73f05b24, response=0x750fd8b0) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3827
#7 0x001d525c in mqtt::async_client::publish (this=0x40169c, msg=std::shared_ptr (use count 2, weak count 0) = {...}) at /home/pi/dev/build/libraries/mqtt/src/src/async_client.cpp:565
#8 0x001d4e8c in mqtt::async_client::publish (this=0x40169c, topic=..., payload=0x750fda3c, n=1, qos=1, retained=false) at /home/pi/dev/build/libraries/mqtt/src/src/async_client.cpp:539
#9 ...

To Reproduce
I'm struggling to reproduce with tracing enabled, I'll keep trying with different options.

Expected behavior
Not a seg fault

** Environment (please complete the following information):**

  • OS: raspian10
  • Version 1.3.2 ( with paho-mqtt-c++ 1.1)
@icraggs
Copy link
Contributor

icraggs commented May 8, 2020

There is a fix in the develop branch for #867 which may affect this.

@atomcraft-th
Copy link
Author

I'll give it a try on Monday and see if it works (Bank Holiday today). Thanks.

@atomcraft-th
Copy link
Author

I applied this patch:
01cbe20

Unfortunately, I still see the issue. I have attached a protocol trace. I can make other traces if necessary.

mqtt_protocol.txt

@mtrensch
Copy link
Contributor

Maybe I can offer an idea, where to check. I have not had this issue myself, but I like to find possible races ;)

#0 0x001fd980 in cmdMessageIDCompare (a=0x0, b=0x7effef70) at /home/pi/dev/build/libraries/mqtt/src_c/src/MQTTAsync.c:3442

For me this looks like the content of a list element (probably the command queue) has already been freed, but not unlinked from list.

Looking at the function MQTTAsync_assignMsgId it might be possible that either command queue or response queue might be affected. I think 1.3.2 did not include the outbound message queue yet:
https://github.com/eclipse/paho.mqtt.c/blob/146d210addda572870ee56accbce6cc7bb5f129b/src/MQTTAsync.c#L3474

Probably also locking the "mqttcommand_mutex" which seems to guard the command list could help.

My assumption is that one thread is currently unlinking an item from the List and after it is being set to NULL it is either interrupted (probably here: https://github.com/eclipse/paho.mqtt.c/blob/146d210addda572870ee56accbce6cc7bb5f129b/src/LinkedList.c#L227) or another thread reaches this element while checking for active commands. Probably another race like in the referenced issue #867.

@icraggs
Copy link
Contributor

icraggs commented May 13, 2020

Well the simplest approaches are to a) @atomcraft-th supply me with a test program or b) @atomcraft-th to try with the command mutex locked. I just tried a build with that lock included, and that passed the regression tests.

@atomcraft-th
Copy link
Author

Thanks, I'll try that.

FYI, when I studied the call stack a little, the 'a' parameter which is null comes from
https://github.com/eclipse/paho.mqtt.c/blob/146d210addda572870ee56accbce6cc7bb5f129b/src/LinkedList.c#L178
The current->content value which is the source of the null a value is not null at the time of the segfault. This is a very close race condition and it looks like it is just too late in updating the value.
I noticed one of my other threads was waiting on a sent message token, so I am also looking at the functions
MQTTAsync_getPendingTokens
and
MQTTAsync_isComplete

@mtrensch
Copy link
Contributor

mtrensch commented May 15, 2020

https://github.com/eclipse/paho.mqtt.c/blob/146d210addda572870ee56accbce6cc7bb5f129b/src/LinkedList.c#L178

The current->content value which is the source of the null a value is not null at the time of the segfault. This is a very close race condition and it looks like it is just too late in updating the value.

That is what I was mentioning above. The List entry is going to be removed. Its content is already set to NULL before being dequeued. When during this time another thread dereferences this element (what cmdMessageIDCompare is obviously doing) it segfaults as it accesses a NULL pointer.

The main issue with the suggested fix (c7f6161) with also taking the command lock, is that the other lists (responses and outboundmsgs) are still unguarded. I don't know if this is a real problem but in theory it is.

Maybe a generic per-list-lock inside List Operations would come in handy.
Like this: mtrensch@25b5a5c
NOTE that it did not pass all tests on my local machine, but I am not sure if this comes from the patch or my setup, as I have not yet run any testcases.
NOTE2: It is a little hacky at some places, especially when creating the lock/mutex

@icraggs icraggs added this to the 1.3.3 milestone May 26, 2020
@icraggs icraggs closed this as completed Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants