-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iio_buffer_destroy() sometimes hangs when items are enqueued #1211
Comments
The problem seems to be that iiod_rw_all() is not returning. @pcercuei do you have a quick idea how to fix this? |
The underlying problem seems to be that recv() is blocking |
|
I guess that only matters for cancellable operations. If that flag is set to true, do_cancel() should do the trick. But if we are not getting into poll(2), then that cancel_fd should not matter. But in that case MSG_DONTWAIT should be set and recv(2) should not block... |
|
it seems like |
something with the cancel_fd is not working |
Yeah, it seems there's some subtle bug in that logic... Because |
The problem is read...
network_recv...
cancellable = 1
wait_cancellable...
destroying...
iio_buffer_cancel...
task_stop
task stop: 0 0 id: 28940624
do cancel...
do cancel done
iio_buffer_cancel done
free_buffer...
free1
pfd[1].revents = 0
task stop: 0 1 id: 28626704 |
something is wrong with the cancel_fd, it has wrong value when poll() is called after cancelling the buffer. |
the other cancel_fd is from the other buffer that my application has open, so thats not the problem. I have to debug a bit more... |
its a bit difficult to debug, because every iio context and every buffer has its own thread which all use functions from network.c and there is no real documentation about the internals of libiio ^^ |
in c++ this would just be a std::vector and so much easier to debug |
@pcercuei [2024-11-06 11:34:23.757] [control_center::iio_buf_reader] [warning] --- dequeue: timed out ---
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
thread = 1786758720, wait_cancellable done
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
x2
task stop: 0 1 id: 31090272
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, read done
thread = 1786758720, read...
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
list item 16 = 178759280
thread = 1778366016, discard reader 181190880
[2024-11-06 11:34:23.758] [control_center::iio_buf_reader] [warning] --- dequeue: timed out ---
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1778366016, adding reader 181190880
list item 0 = 180965088
list item 1 = 179790880
list item 2 = 180897360
list item 3 = 180966064
list item 4 = 180047712
list item 5 = 180763824
list item 6 = 179200736
list item 7 = 180901488
list item 8 = 181200016
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
list item 9 = 183120608
list item 10 = 178781568
list item 11 = 181129104
task stop: 0 1 id: 31090272
list item 12 = 180895728
list item 13 = 179365760
list item 14 = 181301104
list item 15 = 178759280
list item 16 = 179795776
list item 17 = 179795776
list item 18 = 179795776
list item 19 = 179795776
list item 20 = 179795776
list item 21 = 179795776
list item 22 = 179795776
list item 23 = 179795776
list item 24 = 179795776
list item 25 = 179795776
list item 26 = 179795776
list item 27 = 179795776
list item 28 = 179795776
list item 29 = 179795776
list item 30 = 179795776
list item 31 = 179795776
list item 32 = 179795776
list item 33 = 179795776
list item 34 = 179795776
list item 35 = 179795776
list item 36 = 179795776
list item 37 = 179795776
list item 38 = 179795776
list item 39 = 179795776
list item 40 = 179795776
list item 41 = 179795776
list item 42 = 179795776
list item 43 = 179795776
list item 44 = 179795776
list item 45 = 179795776
list item 46 = 179795776
list item 47 = 179795776
list item 48 = 179795776
list item 49 = 179795776
list item 50 = 179795776
list item 51 = 179795776
list item 52 = 179795776
list item 53 = 179795776
list item 54 = 179795776
list item 55 = 179795776
list item 56 = 179795776
list item 57 = 179795776
list item 58 = 179795776
list item 59 = 179795776
list item 60 = 179795776
list item 61 = 179795776
list item 62 = 179795776
list item 63 = 179795776
list item 64 = 179795776
list item 65 = 179795776
list item 66 = 179795776
list item 67 = 179795776
list item 68 = 179795776
list item 69 = 179795776
list item 70 = 179795776
list item 71 = 179795776
list item 72 = 179795776
list item 73 = 179795776
list item 74 = 179795776
list item 75 = 179795776
list item 76 = 179795776
list item 77 = 179795776
list item 78 = 179795776
list item 79 = 179795776
list item 80 = 179795776
list item 81 = 179795776
list item 82 = 179795776
list item 83 = 179795776
list item 84 = 179795776
list item 85 = 179795776
list item 86 = 179795776
list item 87 = 179795776
list item 88 = 179795776
list item 89 = 179795776
list item 90 = 179795776
list item 91 = 179795776
list item 92 = 179795776
list item 93 = 179795776
list item 94 = 179795776
list item 95 = 179795776
list item 96 = 179795776
list item 97 = 179795776
list item 98 = 179795776
list item 99 = 179795776
list item 100 = 179795776 /* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
} |
Interesting findings... We can obviously add a protection so we error out gracefully (or just ignore the entry) if we add an entry that already exists (I guess that should not be possible) even though this is only a real problem if the repeated entries are sequential so that entry->next = entry (which pretty much means list_empty - but not in this case I guess). But more importantly would be to find out why we get the same entry added twice... @catkira thanks for debugging this |
yes I tried this work around, but then it still hangs sometimes at other locations (probably because of the same underlying reason) /* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
}
} |
/* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
exit(1);
}
if (tmp == tmp->r_next) {
printf("loop detected!!!");
exit(1);
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
}
} thread = 2539071040, adding reader 182425776
loop detected!!!
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 3178092864, adding reader 182140576
b ts: 0 1
c ts: 0 1
x1
thread = 2597303872, poll cancel_fd 15
thread = 2597303872, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
task stop: 0 1 id: 32557408
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
terminate called without an active exception
Aborted (core dumped) it must be memory corruption |
Auch... Might very well be but let's hop it's not :). I see your |
BTW, I'm not sure if the code you pasted is the one you used but if it is, I think we need to check for an existing object before doing So |
I think io->r_next is set to NULL, because it will always be inserted at the end of the list. |
When I change the code that adds items to the reader list like this, it does not hang anymore. /* Add it to the readers list */
printf("thread = %u, adding reader %d\n", pthread_self(), io);
bool found = false;
if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; ) {
if (tmp == io)
{
printf("reader is already in list!!\n");
found = true;
exit(1);
}
if (tmp == tmp->r_next) {
printf("loop detected!!!\n");
exit(1);
}
tmp = tmp->r_next;
}
}
if (!found) {
if (!priv->readers) {
priv->readers = io;
} else {
unsigned int i = 0;
for (tmp = priv->readers; tmp->r_next; ) {
printf("list item %d = %d\n", i++,tmp);
tmp = tmp->r_next;
if (i > 100) exit(1);
}
tmp->r_next = io;
printf("io->r_next = %d\n", io->r_next);
io->r_next = NULL;
}
} |
you can see my change in the PR, the second to last commit contains all the debug printfs. |
If you suspect memory corruption, try to compile with GCC's built-in address sanitizer. If there is any use-after-free, or a buffer overflow, it will detect it. |
I tried it. It freezes without any sanitizer output. |
Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do
And this really seems to suggest we're adding an object that is already on the list
Likely a nice thing to add to cmake on a debug build... |
Yeah that's something that is definitely not supported (nor should it be). The real question then, is why is it added twice. Note that there is one I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io. |
yes I also think that it is some kind of race condition, because I only encounter this when I run the code on a fast CPU. |
I tried to do it with mutex locks, but it did not work, my code understanding was probably not deep enough. However getting a deeper code understanding without having programmed it can easily take a week or more. The next problem is that this bug is not easy to reproduce, even on my setup I have to let my application run for about 1 min to trigger it. If you want it fix it, you probably have to create a test app first that reliably triggers the bug. |
I have removed all the debug stuff, the change is pretty small now and well isolated. I admit that it is not the most elegant solution possible BUT it's much better than hidden multi-threading bugs. |
Yeah, I was wondering the same. If we serialize it, there will always be a penalty associated (even if we can make the locking better than your quick tests) it. Not sure if @pcercuei meant something else with the locking approach. And FWIW I have the same issue than you regarding deep understanding of the code (I'm mostly working on the kernel side of things).
We don't do squash commits, sorry. In theory, only fast forward merges would be best (IMO) but github does not allows it (on the gui). But I'm still not sure we have a proper solution for merging... See more below.
Yeah, my bad.
I guess that's true but still not very elegant... But the real problem (not sure if it is in practice) with any pool alternative is the client_id thing. If i'm not missing nothing, now that means that we can, in theory, have multiple threads using the same client_id (0). Hence, I would say (again, in theory) that we can mess up badly in here: https://github.com/analogdevicesinc/libiio/blob/main/iiod-responder.c#L288 I mean, can we safely know which iiod_io should we awake for the incoming reply?! I don't think so... And to make it work with an unique id, I think bigger changes are needed (also on iiod side).
It should not be. The way I understand only readers matter for now so the logical place for me would be after removing the IO out of the readers list which should mean we're done with it. FWIW, I guess the main issue your app demonstrated is that accessing an IIO context for reading/writing attributes from different threads can be problematic. Is that correct? Is the mutex hang that bad? Or would it be feasible a workaround on the app side (like a context per thread - as bad as that sounds)? I mean, I ask it because anything more than serializing default_io, is likely more difficult (and so needs time). |
I had another crash (hang) :/ so my last fix apparently does not work. What seems to work is adding just |
In my app I am continuously reading buffers and also read and set some attributes at the same time. Serializing it with a mutex on the app side would not work, if that should also include the buffer read/writes. If I just had to serialize the attribute access that would be ok. |
My approach to fix this properly would be to slowly migrate the project to c++ and rewrite one compilation unit at a time using std containers and pointers and possibly also c++ mutex and condition variables and also classes instead of priv structs where possible. It would make onboarding new people into the code much easier and faster. |
Could you try that? From what I understood only those accesses are using the default_io which always has client_id = 0. For buffering (and events), I think we have an iiod_io per block (being the block index the IO client_id) so I don't think that would be an issue. |
I am thinking now that the default_io usage is not the only multithreading problem, because otherwise my last fix attempt should have worked. |
What would the "penalty associated" be? Given that right now if you try to do parallel access you get all kinds of issues, you already need to serialize the API calls, so I don't see how adding some locking to enforce everything goes smooth would come with a penalty.
Microcontrollers generally don't support C++. |
c++ is getting more and more popular in the uC community. All my uC code is c++ and it works fine. Microcontrollers just run asm instructions and a c++ compiler can generate asm instructions. What exactly do you mean by "generally don't support c++"? Do you mean there is not a c++ compiler for every uC? |
You're knocking at he wrong door :). You would need to convince others than me for that. There's also plenty of ways to shoot yourself in the foot with C++. Even more when thinking about classes and the way c++ makes it easy for inheritance rather than composition... I know, I know it's trendy to say this but I would consider and internally propose rust if it was supported in more archs. But again, let's not make this an argument between languages :)
@catkira saw his app hanging sometimes... Could we read/write attrs from different threads without any issue in libiio 0.25? If so, this definitely feels like a regression which is never nice for users
That might very well be.... But I'm still not convinced your solution was race free. |
I think sometimes what you can do is very limited and you can't really use all c++ features. But yeah, things might have changed. I don't plain with bare uC for some time now. But I think we're bikeshedding here... I'm fairly sure that rewriting libiio is off the table. |
code without standard containers or wrapped complexity is just hard to read for me, ie this block /* Discard the entry from the readers list */
if (io == priv->readers) {
priv->readers = io->r_next;
} else if (priv->readers) {
for (tmp = priv->readers; tmp->r_next; tmp = tmp->r_next) {
if (tmp->r_next == io) {
tmp->r_next = io->r_next;
break;
}
}
} There is a comment, so if I just assume that the code does what the comment says then its easy. But if I want to verify the code then it becomes difficult. |
Yes. And it is still possible, when talking to a 0.25 IIOD server. This is enforced by a mutex, Now we need to have a similar locking mechanism, but that works at the iiod_io level instead.
Can we see the diff? |
Yeah, I tend to agree. This could be wrapped in some list_add() / list_del() thingy. But this is a simple linked_list so not that messy. And again, arguable... For me, when I look at modern c++ code, it's very hard to understand what it's doing.
Alright, if it was like this before, then the same serialization should not be a problem. And, FWIW, we want to be really fast when streaming...
I think it's somewhere in this discussion... But it's the pool based stuff we're talking about |
Can you test pcercuei@85e8499? |
it immediately goes into a deadlock |
Ok, this one should work better: pcercuei@524e979 |
Yeah, the question is which thread wakes up first when |
The last discard item code block that I posted is pretty clear after looking at it for 1 min, but this code confuses me. First I would put the next = io->r_next into the for loop header or does /* Discard the entry from the readers list */
for (io = priv->readers; io; io = next) {
next = io->r_next;
iiod_responder_signal_io(io, priv->thrd_err_code);
} |
it occasionally hangs |
It absolutely does not matter, does it? |
What if the thread waiting on |
Yes, then change the check in iiod_io_get_response_async() to |
I don't know if its relevant. The hangs in my app occur when the block read request is waiting (with timeout) because my sdr is not sending data. In the meantime I am doing attr rw stuff from another thread. When my sdr is sending data, I never got the hangs. So it probably also has something todo with it. The hangs also seem to be associated with bufffer_cancel() calls, which I occasionally do when there is no data from sdr (because I need to clear the dma buffers). Cancelling a buffer that is in the poll() call waiting for data while at the same time doing rw attr stuff seems to trigger the problem. |
I assume that after your The "free block" and "destroy buffer" commands are done on the iiod-responder's default iiod_io, so those will race against an attr rw command, and that's why it is relevant. |
The problem is that unit tests (if there are any) do not cover these abort cases, because there is no SDR source that emulates this behavior and the normal SDRs like pluto also don't show this behavior, because they either send data or no data at all. A dummy iio device that can be configured for different behavior would be nice to have. It's super difficult and slow to develop code that the developer cannot immediately test. |
Yeah, that should do it but I think your first suggestion for a new busy mutex would likely be easier. Only thing with the mutex approach is that, in theory, an app can mess up in a way that one thread locks the IO busy lock and another one unlocks it which is far from ideal (eg: an app using a thread for enqueuing blocks and another thread for dequeuing - as odd as that looks?!)
There's already a dummy iio device in the kernel but tbh, I'm not sure what you can do with it. Definitely not emulating high speed DMA buffers. |
I think we should integrate IIO-EMU into libiio. |
It could be re-implemented as a Libiio backend. |
This is caused by iiod_responder_wait_done() in iiod-responder.c. The thread is probably not stopped correctly and therefore cannot join.
Testscenario: latest master libiio, network context to sdr.
The text was updated successfully, but these errors were encountered: