Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

hdiethelm · 2018-11-15T20:43:15Z

Same as #589 for develop branch

core/mavlink_commands.cpp

hdiethelm · 2018-11-20T12:49:50Z

Sorry about my force-push, you where to fast... ;-) I had to remove
lock.lock();
because a unique_lock locks it's mutex in the constructor.

hdiethelm · 2018-11-20T13:24:05Z

Hmm, I just checked the code again, the problem is even more complex and lies here:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L45
This now unlocks and relocks the queue for just a really short time, which allows that an other process waiting at borrow_front to continue and creating the problems.

By using _state_mutex in the way I did, this problem is indirectly solved. And by inverting the order back as it was before, you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call borrow_front . So in this case, state_mutex is the correct name but the order locking state_mutex and then calling borrow_front has to be always the same to be deadlock prof.

However, I think that's not all of the problem. I think the actions are:
-receive_command_ack -> borrow_front
-receive_command_ack -> _state_mutex.lock()
-do_work waits in borrow_front
-receive_command_ack -> _work_queue.pop_front();
-pop_front unlocks the queue mutex
-do_work can continue in borrow_front until _state_mutex.lock() and waits there
-receive_command_ack -> _work_queue.pop_front(); waits at lock(_mutex); because this is locked by borrow_front from do_work

Result:
do_work locked at:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/mavlink_commands.cpp#L287
receive_command_ack locked at
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/mavlink_commands.cpp#L151
in:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L47

This is exactly the state I get from my backtrace, I just mapped the line numbers to develop.

I'll suppose a better way of implementing locked_queue as soon as I find time.

julianoes · 2018-11-20T14:39:32Z

This now unlocks and relocks the queue for just a really short time, which allows that an other process waiting at borrow_front to continue and creating the problems.

Agreed, that's actually silly!

you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call ~~borrow_front~~ pop_front

Right ?

I'll suppose a better way of implementing locked_queue as soon as I find time.

I wonder if we can just drop the silly relocking on pop_front? I'll make a PR.

hdiethelm · 2018-11-20T15:24:39Z

you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call borrow_front pop_front

Right ?

No, borrow_front locks the queue. Here I described what would happen if you don't apply my pull request and just fix locked_queue.

Bad (mixed order):
a: lock _state_mutex -> locked
b: borrow_front -> locked
a: borrow front -> locked already -> wait
b: lock _state_mutex -> locked already -> wait -> deadlock

Good (same order):
a: lock _state_mutex -> locked
b: lock _state_mutex -> locked already -> wait
a: borrow_front -> locked
a: pop_front -> unlocked
a: unlock _state_mutex -> unlocked
b: lock _state_mutex -> continues -> locked again
b: borrow_front -> locked
b: pop_front -> unlocked
b: unlock _state_mutex -> unlocked
-> fine

I'll suppose a better way of implementing locked_queue as soon as I find time.

I wonder if we can just drop the silly relocking on pop_front? I'll make a PR.

It's actually not so simple. In pop_front(), you have to check if the mutex is already locked. If not, something went wrong.
I think also https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L38
went in because there was a problem with locking.
Might be there is the possibility that borrow_front() returns a unique_lock() which then needs to be supplied to pop_front() or is automatically deleted at the end of the function calling borrow_front(). So it will be save.

julianoes · 2018-11-20T16:10:58Z

No, borrow_front locks the queue. Here I described what would happen if you don't apply my pull request and just fix locked_queue.

You wrote borrow front and then again borrow front which confused me.

It's actually not so simple.

You're right. What if we make it simple and instead of using this locked queue just have a normal queue and one mutex for both the state and the queue?

hdiethelm · 2018-11-20T18:41:13Z

You're right. What if we make it simple and instead of using this locked queue just have a normal queue and one mutex for both the state and the queue?

That's sounds also good! This way you can always use lock_guards and unique_locks. The way it is implemented now if you forget one return_front or pop_front, the queue is locked forever.

hdiethelm · 2018-11-20T19:17:27Z

I was thinking about some other possibilities, but assuming that return_front can be called from different processes without calling borrow_front first, none of them will work surely. So I think using a normal queue and renaming _state_mutex to _mutex or so is really the best way.

If you make sure borrow_front is always called first and later pop or return, this should also work:
hdiethelm@c62f6e6

However, if you change the assert in return_front to:
if(_queue_borrowed){
_mutex.unlock();
}
one process can lock the queue by calling borrow_front and another can unlock it by calling return_front without borrow_front which is not so nice.

julianoes · 2018-11-21T11:10:40Z

I think we should include this change anyway: hdiethelm/DronecodeSDK@c62f6e6 and adapt all places where it is used.

So I think using a normal queue and renaming _state_mutex to _mutex or so is really the best way.

The question is actually, do we use a mutex because we want to protect the _state field or because we want to not interfere in general, and I think it's the latter so just one mutex makes more sense.

julianoes · 2018-11-21T11:20:44Z

So, for now I'm gonna merge this as is because you convinced me that this fixes the problem for now.
Then, we can always improve it in of the ways that you suggested.

julianoes · 2018-11-21T12:25:38Z

Followed up in #601.

Hannes Diethelm added 2 commits November 15, 2018 21:15

Fix deadlock between do_work, reveive_timeout and receive_command_ack

9ac5101

Fix style

d0fdf1c

hdiethelm mentioned this pull request Nov 16, 2018

Fix deadlock between do_work, reveive_timeout and receive_command_ack #589

Closed

julianoes requested changes Nov 19, 2018

View reviewed changes

core/mavlink_commands.cpp Outdated Show resolved Hide resolved

julianoes previously approved these changes Nov 20, 2018

View reviewed changes

mavlink_commands: Use unique_lock which can be unlocked early

a7948ce

hdiethelm dismissed julianoes’s stale review via a7948ce November 20, 2018 12:48

hdiethelm force-pushed the develop branch from 4549431 to a7948ce Compare November 20, 2018 12:48

julianoes approved these changes Nov 20, 2018

View reviewed changes

julianoes merged commit e671354 into mavlink:develop Nov 21, 2018

julianoes mentioned this pull request Nov 21, 2018

Improve locked queue deadlocks #601

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

hdiethelm commented Nov 15, 2018

hdiethelm commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

julianoes commented Nov 20, 2018 •

edited

Loading

hdiethelm commented Nov 20, 2018

julianoes commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

julianoes commented Nov 21, 2018

julianoes commented Nov 21, 2018

julianoes commented Nov 21, 2018

Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

Conversation

hdiethelm commented Nov 15, 2018

hdiethelm commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

julianoes commented Nov 20, 2018 • edited Loading

hdiethelm commented Nov 20, 2018

julianoes commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

hdiethelm commented Nov 20, 2018

julianoes commented Nov 21, 2018

julianoes commented Nov 21, 2018

julianoes commented Nov 21, 2018

julianoes commented Nov 20, 2018 •

edited

Loading