Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault in Socket_getReadySocket #10

Closed
jpwsutton opened this issue Feb 4, 2016 · 0 comments
Closed

Segmentation Fault in Socket_getReadySocket #10

jpwsutton opened this issue Feb 4, 2016 · 0 comments

Comments

@jpwsutton
Copy link
Contributor

migrated from Bugzilla #447672
status RESOLVED severity normal in component MQTT-C for 1.1
Reported in version 1.0 on platform PC
Assigned to: Ian Craggs

Original attachment names and IDs:

On 2014-10-17 05:51:15 -0400, Ian Craggs wrote:

I'm still working on implement a MQTT-SN gateway using the Paho 'MQTT
C Client for Posix and Windows'. And now had another problem. This
time I think it is a concurrency bug of the MQTT.C library:

My gateway run for more than a week an crashed with a Segmentation
fault. My guess is that the Segmentation fault is due to a race
condition caused by a missing lock(mutex) for s.cur_clientsds inside
of Socket.c.

Followed I provide the information and reasoning for my guess:

Info of the segmentation fault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb4509460 (LWP 17845)]
0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
at source/lib/Mqtt.C/src/Socket.c:278

the code there looks like this:

274 if (s.cur_clientsds == NULL)
275 rc = 0;
276 else
277 {
278 rc = ((int)(s.cur_clientsds->content));
279 ListNextElement(s.clientsds, &s.cur_clientsds);
280 }

the backtrace of the crash is:

(gdb) bt

0 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)

   at source/lib/Mqtt.C/src/Socket.c:278

1 0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,

rc=0xb4508dc8)
at source/lib/Mqtt.C/src/MQTTClient.c:1547

2 0x00074a90 in MQTTClient_run (n=0x12a9b4)

   at source/lib/Mqtt.C/src/MQTTClient.c:483

3 0xb6e40bfc in start_thread (arg=0xb4509460)

   at pthread_create.c:306

4 0xb6dd5968 in ?? ()

   at ../ports/sysdeps/unix/sysv/linux/arm/nptl/../clone.S:116

and a the variables are:

(gdb) p s
$1 = {rset = {__fds_bits = {2048, 0 <repeats 31 times>}}, rset_saved = {
__fds_bits = {1024, 0 <repeats 31 times>}}, maxfdp1 = 11,
clientsds = 0x12a774, cur_clientsds = 0x0, connect_pending = 0x12a804,
write_pending = 0x12a894, pending_wset = {__fds_bits = {
0 <repeats 32 times>}}}
(gdb) p s.cur_clientsds
$2 = (ListElement *) 0x0

The Segmentation fault is because s.cur_clientsds is NULL when line
278 is executed. Since line 278 is actually inside an else of a
(s.cur_clientsds == NULL) check I guess that s.cur_clientsds must
have be changed by an other thread in between the check (line 274) and
the access (line 278). So I searched where s.cur_clientsds is changed
and found that the Socket_close() function sets it to NULL:

void Socket_close(int socket)
{
FUNC_ENTRY;
Socket_close_only(socket);
FD_CLR(socket, &(s.rset_saved));
if (FD_ISSET(socket, &(s.pending_wset)))
FD_CLR(socket, &(s.pending_wset));
if (s.cur_clientsds != NULL && (int)(s.cur_clientsds->content)
== socket)
s.cur_clientsds = s.cur_clientsds->next;
...

Therefore s.cur_clientsds is accessed and changed by multiple threads:
By the libraries 'worker thread' during the MQTTClient_run() and by a
client thread when the client calls MQTTClient_disconnect() since this
calls Socket_close(). So I checked the call trees for mutex locks: The
Socket_close() is inside a Thread_lock_mutex(mqttclient_mutex) but the
Socket_getReadySocket() is not inside any mutex lock since
MQTTClient_cycle() is explicitly called outside of the
mqttclient_mutex:

Thread_unlock_mutex(mqttclient_mutex);
pack = MQTTClient_cycle(&sock, timeout, &rc); ==> Socket_getReadySocket()
Thread_lock_mutex(mqttclient_mutex);

Therefore I think my guess is correct that a race condition, because
of missing synchronization for s.cur_clientsds, caused the
Segmentation fault.

My quick solution would be to introduce a new socket_mutex inside
Socket.c to protect accesses to :
/**

  • Structure to hold all socket data for the module
    */
    Sockets s;

but I would need to invest some more time to fully grasp the locking
strategy of the library. Therefore should I try to invest some time to
understand the internals and introduce this mutex to create a patch?
Or is it not worth my effort because one of the commiters quickly can
confirm and fix this problem.

On 2014-10-17 05:52:34 -0400, Ian Craggs wrote:

I'm implementing a fix, like you suggested. Thanks for the analysis!

On 2014-10-17 05:58:06 -0400, Franz Schnyder wrote:

*** Bug 447673 has been marked as a duplicate of this bug. ***

On 2014-10-17 06:30:59 -0400, Ian Craggs wrote:

Franz, thanks. I was just about to mark this as a duplicate of yours.

I've just pushed a proposed fix to the development branch.

On 2015-04-24 04:53:23 -0400, Ian Craggs wrote:

*** Bug 458099 has been marked as a duplicate of this bug. ***

On 2015-12-11 09:38:47 -0500, Juergen Kosel wrote:

Created attachment 258620
Simple test program for parallel connections

The test program open 2 connections to an existing broker
and 6 connection which will allays fail, because there is no responding broker.

On 2015-12-11 09:42:14 -0500, Juergen Kosel wrote:

Hello,

please reopen this bug for version 1.3, for plattform Windows, build with Visual Studio 2010.

My application crashes in Socket_getReadySocket() and Socket_close()
because the pointer s.cur_clientsds becomes 0xdddddddd. Which is obviously an
invalid pointer.
It usually happens while some other part (none mqtt part) of my application does
some other socket operation (listen, accept connection, release connection, ...).

I have tried to write some simple test program, which does something similar
to my application. But so far I have not reproduced the crash with the attached
test application. (But maybe you could use it to improve the test collection.)

Greetings
Juergen

On 2015-12-11 09:56:54 -0500, Ian Craggs wrote:

Hi. Can you get a client trace of the situation by setting environment variable

MQTT_C_CLIENT_TRACE=ON (or a filename)?

I presume the "some other socket operation", is on another socket, not one used by the MQTT client library?

On 2015-12-11 10:35:02 -0500, Juergen Kosel wrote:

Created attachment 258621
trace file shortly before the segmentation fault

Appended to the trace you find some more information from the
debugger.

Greetings
Juergen

On 2015-12-11 12:06:27 -0500, Ian Craggs wrote:

So the trace shows a socket being closed because of some sort of socket error or timeout. I assume this is expected?

I don't see why an operation on another socket should have any bearing on this situation unless it causes this socket error/close sequence somehow. Do you have any idea?

On 2015-12-14 03:41:32 -0500, Juergen Kosel wrote:

(In reply to Ian Craggs from comment # 9)

So the trace shows a socket being closed because of some sort of socket
error or timeout. I assume this is expected?

yes,
as in the test_AsyncParrallel.c there are several connect attempts to
unreachable brokers.

I don't see why an operation on another socket should have any bearing on
this situation unless it causes this socket error/close sequence somehow.
Do you have any idea?

The same for me. It seems that the additional task changes / wake ups
make it easier to trigger this bug.

From the pointer list in paho-trace.log after the crash, it looks like a
use after free problem. But I have also seen crashes occurred in
Socket_close():

if (s.cur_clientsds != NULL && (int)(s.cur_clientsds->content) == socket)

When it occured at this location, s.cur_clientsds was also an invalid
pointer (0xddddddd).

Greetings
Juergen

On 2015-12-14 05:55:00 -0500, Juergen Kosel wrote:

Created attachment 258646
Another trace of a segmentation fault - with traces of mutex lock/unlock.

Another trace of a segmentation fault.
This time with additional trace output in the async mutex lock and unlock functions.

On 2015-12-14 05:58:19 -0500, Juergen Kosel wrote:

Created attachment 258647
Patch to trace MQTTAsync_lock_mutex

Patch which were used to generate the trace file of attachment 258646.

On 2015-12-14 07:50:36 -0500, Juergen Kosel wrote:

Created attachment 258650
Fix for the outstanding part of bug 447672

With this patch applied, the segmentation fault does not occur anymore
in my test-setup.

On 2015-12-14 18:28:18 -0500, Ian Craggs wrote:

Thanks Juergen. The same fix is apparently needed in MQTTAsync.c as was applied to MQTTClient.c for this bug. I want to track the fixes separately so I've raised a new bug, 484363. Your fix will work, but will also hold a lock longer than necessary.

Thanks for the traces and diagnostics.

@frett27 frett27 mentioned this issue Jun 29, 2020
Lazzaretti pushed a commit to ce-rust/paho.mqtt.c that referenced this issue Nov 8, 2020
Properly destroy the underlying MQTTAsync handle by calling MQTTAsync_destroy in the implementation of AsyncClient::drop.
Also, we will drop the boxed ClientPersistence trait object inside of the ffi::MQTTClient_persistence. This currently leaks because it is forgotten (via Box::into_raw) during AsyncClient::new(), and it is not freed on the C side.

Signed-off-by: Jarred Nicholls <jarred.nicholls@gmail.com>
Lazzaretti pushed a commit to ce-rust/paho.mqtt.c that referenced this issue Nov 8, 2020
eclipse-paho#10 Properly destroy the underlying MQTTAsync handle by calling MQTTA…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant