Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack smashing #745

Closed
fpagliughi opened this issue Oct 11, 2019 · 15 comments
Closed

Stack smashing #745

fpagliughi opened this issue Oct 11, 2019 · 15 comments

Comments

@fpagliughi
Copy link
Contributor

I've been having a recurring issue with a crash of my sample apps in Linux (Ubuntu 18.04 at the moment) where it complains about stack smashing.

*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

I believe that it happens when I switch between different versions of the Paho C library during development, so I don't think it's a problem most users would commonly see.

I believe that it has something to do with the default file persistence used by the library. One guess may be that if the library tries to access the persistence file created from a different version of the lib, then maybe it gets some bad data and things go ugly. If that's the case, then perhaps the file persistence could use a version tag written to the file?

Here's a max dump when trying to publish a message:

Publishing a message on the 'test' topic
20191011 145742.921 (2722629376)  (1)> MQTTAsync_cycle:3742
20191011 145742.921 (2722629376)   (2)> Socket_getReadySocket:237
20191011 145742.921 (2731022080)  (1)< MQTTAsync_checkTimeouts:1852
20191011 145742.921 (2731022080)  (1)> Thread_wait_cond:413
20191011 145742.921 (2753394816) (0)> MQTTAsync_sendMessage:3546
20191011 145742.921 (2753394816)  (1)> MQTTAsync_send:3477
20191011 145742.921 (2753394816)   (2)> UTF8_validateString:157
20191011 145742.921 (2753394816)    (3)> UTF8_validate:131
20191011 145742.921 (2753394816)    (3)< UTF8_validate:143 (1)
20191011 145742.921 (2753394816)   (2)< UTF8_validateString:159 (1)
20191011 145742.921 Allocating 160 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 3509 ptr 0x564484006b30

20191011 145742.921 Allocating 16 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTProtocolClient.c line 874 ptr 0x564484006eb0

20191011 145742.921 (2753394816)   (2)> MQTTStrncpy:851
20191011 145742.921 (2753394816)   (2)< MQTTStrncpy:861
20191011 145742.921 Allocating 16 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 3527 ptr 0x564484006f40

20191011 145742.921 (2753394816)   (2)> MQTTAsync_addCommand:1050
20191011 145742.921 Allocating 32 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/LinkedList.c line 95 ptr 0x564484007270

20191011 145742.921 (2753394816)    (3)> MQTTAsync_persistCommand:723
20191011 145742.921 Allocating 32 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 794 ptr 0x5644840069b0

20191011 145742.921 Allocating 64 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 795 ptr 0x564484006d50

20191011 145742.921 (2753394816)     (4)> pstput:173
20191011 145742.921 Allocating 48 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTPersistenceDefault.c line 181 ptr 0x56448400d110

20191011 145742.921 Freeing 48 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTPersistenceDefault.c line 203, heap use now 2032 bytes

20191011 145742.921 (2753394816)     (4)< pstput:206 (0)
20191011 145742.921 Freeing 32 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 847, heap use now 1984 bytes

20191011 145742.921 Freeing 64 bytes in heap at file /home/fmp/mqtt/paho-rust/paho-mqtt-sys/paho.mqtt.c/src/MQTTAsync.c line 849, heap use now 1952 bytes

20191011 145742.921 (2753394816)    (3)< MQTTAsync_persistCommand:850 (0)
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

Note that if I remove the persistence directory, 'myclientid-localhost-1883', then the app runs fine.

@icraggs
Copy link
Contributor

icraggs commented Nov 26, 2019

I don't think there is a difference in data written to persistence over the different versions, except for MQTT 3.1.1 vs MQTT 5. I have some tests written for that case, but there could be gaps.

@icraggs
Copy link
Contributor

icraggs commented Mar 2, 2020

In the fix for #815 I realized that this might have been caused by the WebSockets support - altering the persisted data if it was going to be sent over websockets. I have changed it so that doesn't happen any more, but we do need to check that when using websockets, reading persisted data works properly. And that you can persist messages with a websocket connection, and then resend later over TCP, and vice versa - it shouldn't matter.

@icraggs icraggs added this to the 1.3.3 milestone Mar 30, 2020
@icraggs
Copy link
Contributor

icraggs commented Mar 30, 2020

Test persistence left by websockets and used for a non-websockets connection, and vice versa.

@fpagliughi
Copy link
Contributor Author

I will start testing this evening to see what I can find.

@icraggs
Copy link
Contributor

icraggs commented Apr 27, 2020

@fpagliughi did you manage to do any testing?

@fpagliughi
Copy link
Contributor Author

There is definitely an issue with different versions of the persistence file after an upgrade. I haven't tracked it down yet, but at the same time, I didn't consider this bad enough to prevent the Rust release from going forward. I doubt anyone really needs to keep their persistence directory intact through a library upgrade. But I would like to get to the bottom of it. (And apologies if it turns out to be something in the derived libs).

BTW, it was a simple async TCP publisher example.

But I also have another example app in the Rust library that just publishes lots of messages. Maybe 10 or 100 per second (configurable). That will run fine for days, doesn't seem to leak any memory, but then crashes out with a "stack smashing" error. I suspect this one is definitely on the Rust side of things, but will put together a comparable C example just to make sure.

@icraggs
Copy link
Contributor

icraggs commented Apr 27, 2020

Barring any other bugs which I will discover in the process of testing the persistence, the issue I fixed in 1.3.2 with persistence and Websockets would have caused the persistence data to be unrecognized if written by 1.3.1 and websockets. That would be not readable by 1.3.0 (no websockets) or 1.3.2 (websockets or not). But as I said, I'll be trying it out.

@frett27
Copy link

frett27 commented Jun 27, 2020

I face the same issue, with a process that works for days, and got a smashing stack error, using gdb and back trace, the error involve the publish.

Once the process crash, relaunching it , it also smash directly in persistent reading. Relaunching again, works for other days / weeks.
i lost the latest backstrace logs, but wait for more details feedback.

process used is http://github.com/frett27/iotmonitor , with a static link to mqtt paho.

hope to provide a better feedback and analysis

@fpagliughi
Copy link
Contributor Author

fpagliughi commented Jun 27, 2020

Yes, sorry I haven't followed up on this. We need a minimal example to show this, if it's still happening. Something that just publshes data in a fairly tight loop (like a few hundred or thousand messages per second). Leave it running for a day or two and see what happens. I'll try to get to it soon, but if anyone beats me to it, I won't complain. :-)

@icraggs
Copy link
Contributor

icraggs commented Jun 27, 2020

It just occurred to me that maybe it's #571

@icraggs
Copy link
Contributor

icraggs commented Jun 29, 2020

I just reproduced this symptom quickly by reducing the value of PERSISTENCE_MAX_KEY_LENGTH to a smaller value - so I think that confirms it's #571. I've added a fix to the develop branch.

One way to avoid this at the moment is not to use persistence (for testing purposes).

@frett27
Copy link

frett27 commented Jun 29, 2020

I reproduce the issue today, i'don't know yet if it's related:

Thread 2 "iotmonitor" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff79a2700 (LWP 248847)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7d5e859 in __GI_abort () at abort.c:79
#2 0x00007ffff7dc93ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7ef307c "*** %s ***: terminated\n")
at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7e6b9ba in __GI___fortify_fail (msg=msg@entry=0x7ffff7ef3064 "stack smashing detected") at fortify_fail.c:26
#4 0x00007ffff7e6b986 in __stack_chk_fail () at stack_chk_fail.c:24
#5 0x0000000000273238 in MQTTPersistence_persistQueueEntry ()
#6 0x000000000025b536 in Protocol_processPublication ()
#7 0x00000000002682b8 in MQTTProtocol_handlePublishes ()
#8 0x000000000025e946 in MQTTClient_cycle ()
#9 0x000000000025a719 in MQTTClient_run ()
#10 0x00007ffff7bd0609 in start_thread (arg=) at pthread_create.c:477
#11 0x00007ffff7e5b103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

relaunching the process:

Thread 2 "iotmonitor" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff79a2700 (LWP 277005)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7d5e859 in __GI_abort () at abort.c:79
#2 0x00007ffff7dc93ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7ef307c "*** %s ***: terminated\n")
at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7e6b9ba in __GI___fortify_fail (msg=msg@entry=0x7ffff7ef3064 "stack smashing detected") at fortify_fail.c:26
#4 0x00007ffff7e6b986 in __stack_chk_fail () at stack_chk_fail.c:24
#5 0x0000000000274a22 in MQTTPersistence_unpersistQueueEntry ()
#6 0x000000000025ae95 in MQTTClient_run ()
#7 0x00007ffff7bd0609 in start_thread (arg=) at pthread_create.c:477
#8 0x00007ffff7e5b103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@icraggs
Copy link
Contributor

icraggs commented Jun 29, 2020

@frett27 yours is a different case from @fpagliughi as you are using the MQTTClient API rather than MQTTAsync. That will be the same effect for qentry_seqno at lines 665 and 668 of MQTTPersistence.c.

@frett27
Copy link

frett27 commented Jun 29, 2020

Thank's @icraggs i just pulled the fix from develop, and made the change for MQTTClient API. i kept the gdb running to had more investigation on my side, thank's so much

icraggs added a commit that referenced this issue Jun 30, 2020
@frett27
Copy link

frett27 commented Jul 8, 2020

Feedback : Seems this fix the issue for me, my small server has been launch a couple of days before, and no more stack smashed after correcting the MQTTClient API.
Thank's to icraggs, see the fix has been included in develop branch !! works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants