Segmentation Fault in Socket_getReadySocket #10

jpwsutton · 2016-02-04T15:19:51Z

migrated from Bugzilla #447672
status RESOLVED severity normal in component MQTT-C for 1.1
Reported in version 1.0 on platform PC
Assigned to: Ian Craggs

Original attachment names and IDs:

test_AsyncParrallel.c (ID 258620)
paho-trace.log (ID 258621)
paho-trace-with-mutex.log (ID 258646)
MQTTAsync.c.patch (ID 258647)
MQTTAsync-bugfix447672.patch (ID 258650)

On 2014-10-17 05:51:15 -0400, Ian Craggs wrote:

I'm still working on implement a MQTT-SN gateway using the Paho 'MQTT
C Client for Posix and Windows'. And now had another problem. This
time I think it is a concurrency bug of the MQTT.C library:

My gateway run for more than a week an crashed with a Segmentation
fault. My guess is that the Segmentation fault is due to a race
condition caused by a missing lock(mutex) for s.cur_clientsds inside
of Socket.c.

Followed I provide the information and reasoning for my guess:

Info of the segmentation fault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb4509460 (LWP 17845)]
0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
at source/lib/Mqtt.C/src/Socket.c:278

the code there looks like this:

274 if (s.cur_clientsds == NULL)
275 rc = 0;
276 else
277 {
278 rc = ((int)(s.cur_clientsds->content));
279 ListNextElement(s.clientsds, &s.cur_clientsds);
280 }

the backtrace of the crash is:

(gdb) bt

0 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
   at source/lib/Mqtt.C/src/Socket.c:278
1 0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,

rc=0xb4508dc8)
at source/lib/Mqtt.C/src/MQTTClient.c:1547

2 0x00074a90 in MQTTClient_run (n=0x12a9b4)
   at source/lib/Mqtt.C/src/MQTTClient.c:483
3 0xb6e40bfc in start_thread (arg=0xb4509460)
   at pthread_create.c:306
4 0xb6dd5968 in ?? ()
   at ../ports/sysdeps/unix/sysv/linux/arm/nptl/../clone.S:116
and a the variables are:

(gdb) p s
$1 = {rset = {__fds_bits = {2048, 0 <repeats 31 times>}}, rset_saved = {
__fds_bits = {1024, 0 <repeats 31 times>}}, maxfdp1 = 11,
clientsds = 0x12a774, cur_clientsds = 0x0, connect_pending = 0x12a804,
write_pending = 0x12a894, pending_wset = {__fds_bits = {
0 <repeats 32 times>}}}
(gdb) p s.cur_clientsds
$2 = (ListElement *) 0x0

The Segmentation fault is because s.cur_clientsds is NULL when line
278 is executed. Since line 278 is actually inside an else of a
(s.cur_clientsds == NULL) check I guess that s.cur_clientsds must
have be changed by an other thread in between the check (line 274) and
the access (line 278). So I searched where s.cur_clientsds is changed
and found that the Socket_close() function sets it to NULL:

void Socket_close(int socket)
{
FUNC_ENTRY;
Socket_close_only(socket);
FD_CLR(socket, &(s.rset_saved));
if (FD_ISSET(socket, &(s.pending_wset)))
FD_CLR(socket, &(s.pending_wset));
if (s.cur_clientsds != NULL && (int)(s.cur_clientsds->content)
== socket)
s.cur_clientsds = s.cur_clientsds->next;
...

Therefore s.cur_clientsds is accessed and changed by multiple threads:
By the libraries 'worker thread' during the MQTTClient_run() and by a
client thread when the client calls MQTTClient_disconnect() since this
calls Socket_close(). So I checked the call trees for mutex locks: The
Socket_close() is inside a Thread_lock_mutex(mqttclient_mutex) but the
Socket_getReadySocket() is not inside any mutex lock since
MQTTClient_cycle() is explicitly called outside of the
mqttclient_mutex:

Thread_unlock_mutex(mqttclient_mutex);
pack = MQTTClient_cycle(&sock, timeout, &rc); ==> Socket_getReadySocket()
Thread_lock_mutex(mqttclient_mutex);

Therefore I think my guess is correct that a race condition, because
of missing synchronization for s.cur_clientsds, caused the
Segmentation fault.

My quick solution would be to introduce a new socket_mutex inside
Socket.c to protect accesses to :
/**

Structure to hold all socket data for the module
*/
Sockets s;

but I would need to invest some more time to fully grasp the locking
strategy of the library. Therefore should I try to invest some time to
understand the internals and introduce this mutex to create a patch?
Or is it not worth my effort because one of the commiters quickly can
confirm and fix this problem.

On 2014-10-17 05:52:34 -0400, Ian Craggs wrote:

I'm implementing a fix, like you suggested. Thanks for the analysis!

On 2014-10-17 05:58:06 -0400, Franz Schnyder wrote:

*** Bug 447673 has been marked as a duplicate of this bug. ***

On 2014-10-17 06:30:59 -0400, Ian Craggs wrote:

Franz, thanks. I was just about to mark this as a duplicate of yours.

I've just pushed a proposed fix to the development branch.

On 2015-04-24 04:53:23 -0400, Ian Craggs wrote:

*** Bug 458099 has been marked as a duplicate of this bug. ***

On 2015-12-11 09:38:47 -0500, Juergen Kosel wrote:

Created attachment 258620
Simple test program for parallel connections

The test program open 2 connections to an existing broker
and 6 connection which will allays fail, because there is no responding broker.

On 2015-12-11 09:42:14 -0500, Juergen Kosel wrote:

Hello,

please reopen this bug for version 1.3, for plattform Windows, build with Visual Studio 2010.

My application crashes in Socket_getReadySocket() and Socket_close()
because the pointer s.cur_clientsds becomes 0xdddddddd. Which is obviously an
invalid pointer.
It usually happens while some other part (none mqtt part) of my application does
some other socket operation (listen, accept connection, release connection, ...).

I have tried to write some simple test program, which does something similar
to my application. But so far I have not reproduced the crash with the attached
test application. (But maybe you could use it to improve the test collection.)

Greetings
Juergen

On 2015-12-11 09:56:54 -0500, Ian Craggs wrote:

Hi. Can you get a client trace of the situation by setting environment variable

MQTT_C_CLIENT_TRACE=ON (or a filename)?

I presume the "some other socket operation", is on another socket, not one used by the MQTT client library?

On 2015-12-11 10:35:02 -0500, Juergen Kosel wrote:

Created attachment 258621
trace file shortly before the segmentation fault

Appended to the trace you find some more information from the
debugger.

Greetings
Juergen

On 2015-12-11 12:06:27 -0500, Ian Craggs wrote:

So the trace shows a socket being closed because of some sort of socket error or timeout. I assume this is expected?

I don't see why an operation on another socket should have any bearing on this situation unless it causes this socket error/close sequence somehow. Do you have any idea?

On 2015-12-14 03:41:32 -0500, Juergen Kosel wrote:

(In reply to Ian Craggs from comment # 9)

So the trace shows a socket being closed because of some sort of socket
error or timeout. I assume this is expected?

yes,
as in the test_AsyncParrallel.c there are several connect attempts to
unreachable brokers.

I don't see why an operation on another socket should have any bearing on
this situation unless it causes this socket error/close sequence somehow.
Do you have any idea?

The same for me. It seems that the additional task changes / wake ups
make it easier to trigger this bug.

From the pointer list in paho-trace.log after the crash, it looks like a
use after free problem. But I have also seen crashes occurred in
Socket_close():

if (s.cur_clientsds != NULL && (int)(s.cur_clientsds->content) == socket)

When it occured at this location, s.cur_clientsds was also an invalid
pointer (0xddddddd).

Greetings
Juergen

On 2015-12-14 05:55:00 -0500, Juergen Kosel wrote:

Created attachment 258646
Another trace of a segmentation fault - with traces of mutex lock/unlock.

Another trace of a segmentation fault.
This time with additional trace output in the async mutex lock and unlock functions.

On 2015-12-14 05:58:19 -0500, Juergen Kosel wrote:

Created attachment 258647
Patch to trace MQTTAsync_lock_mutex

Patch which were used to generate the trace file of attachment 258646.

On 2015-12-14 07:50:36 -0500, Juergen Kosel wrote:

Created attachment 258650
Fix for the outstanding part of bug 447672

With this patch applied, the segmentation fault does not occur anymore
in my test-setup.

On 2015-12-14 18:28:18 -0500, Ian Craggs wrote:

Thanks Juergen. The same fix is apparently needed in MQTTAsync.c as was applied to MQTTClient.c for this bug. I want to track the fixes separately so I've raised a new bug, 484363. Your fix will work, but will also hold a lock longer than necessary.

Thanks for the traces and diagnostics.

The text was updated successfully, but these errors were encountered:

Properly destroy the underlying MQTTAsync handle by calling MQTTAsync_destroy in the implementation of AsyncClient::drop. Also, we will drop the boxed ClientPersistence trait object inside of the ffi::MQTTClient_persistence. This currently leaks because it is forgotten (via Box::into_raw) during AsyncClient::new(), and it is not freed on the C side. Signed-off-by: Jarred Nicholls <jarred.nicholls@gmail.com>

eclipse-paho#10 Properly destroy the underlying MQTTAsync handle by calling MQTTA…

jpwsutton closed this as completed Feb 4, 2016

ZhiweiLi mentioned this issue Jun 5, 2018

CPU goes to 100% when mqtt server is started with no network #479

Closed

armersong mentioned this issue Oct 26, 2018

command_seqno should wrap #571

Closed

thaeger71 mentioned this issue Mar 12, 2019

high cpu usage on ARM9 plattform with Linux 2.6.11 (Target armv4tl) #624

Closed

tornodo mentioned this issue Mar 27, 2019

Crashed in MQTTAsync_cycle ListRemove #633

Closed

frett27 mentioned this issue Jun 29, 2020

Stack smashing #745

Closed

Lazzaretti pushed a commit to ce-rust/paho.mqtt.c that referenced this issue Nov 8, 2020

Merge pull request eclipse-paho#11 from goefer/asyncclient-destroy

fa5fd16

eclipse-paho#10 Properly destroy the underlying MQTTAsync handle by calling MQTTA…

vortex314 mentioned this issue Nov 19, 2020

Memory exception in MQTTAsync_freeCommand #999

Closed

William-ferns mentioned this issue Oct 20, 2021

aws-sdk segmentation fault #1166

Closed

kenxausten mentioned this issue Nov 3, 2021

send messages very fast and the network is not stable, MQTTClient_publishMessage function will cause block #1171

Closed

can9010 mentioned this issue Feb 19, 2022

Document no API calls in the trace callback #1200

Closed

maikebing mentioned this issue Mar 8, 2024

double free or corruption (!prev) #1459

Open

x152152 mentioned this issue May 20, 2024

There is a suspected premature free during the reconnection process #1483

Open

SeaAndSand mentioned this issue Oct 21, 2024

Question about deleteOldestMessages and qos2 #1533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault in Socket_getReadySocket #10

Segmentation Fault in Socket_getReadySocket #10

jpwsutton commented Feb 4, 2016

0 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)

1 0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,

2 0x00074a90 in MQTTClient_run (n=0x12a9b4)

3 0xb6e40bfc in start_thread (arg=0xb4509460)

4 0xb6dd5968 in ?? ()

Segmentation Fault in Socket_getReadySocket #10

Segmentation Fault in Socket_getReadySocket #10

Comments

jpwsutton commented Feb 4, 2016

0 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)

1 0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,

2 0x00074a90 in MQTTClient_run (n=0x12a9b4)

3 0xb6e40bfc in start_thread (arg=0xb4509460)

4 0xb6dd5968 in ?? ()