Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock between do_work, reveive_timeout and receive_command_ack #590

Merged
merged 3 commits into from
Nov 21, 2018

Conversation

hdiethelm
Copy link

Same as #589 for develop branch

core/mavlink_commands.cpp Outdated Show resolved Hide resolved
julianoes
julianoes previously approved these changes Nov 20, 2018
@hdiethelm
Copy link
Author

Sorry about my force-push, you where to fast... ;-) I had to remove
lock.lock();
because a unique_lock locks it's mutex in the constructor.

@hdiethelm
Copy link
Author

Hmm, I just checked the code again, the problem is even more complex and lies here:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L45
This now unlocks and relocks the queue for just a really short time, which allows that an other process waiting at borrow_front to continue and creating the problems.

By using _state_mutex in the way I did, this problem is indirectly solved. And by inverting the order back as it was before, you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call borrow_front . So in this case, state_mutex is the correct name but the order locking state_mutex and then calling borrow_front has to be always the same to be deadlock prof.

However, I think that's not all of the problem. I think the actions are:
-receive_command_ack -> borrow_front
-receive_command_ack -> _state_mutex.lock()
-do_work waits in borrow_front
-receive_command_ack -> _work_queue.pop_front();
-pop_front unlocks the queue mutex
-do_work can continue in borrow_front until _state_mutex.lock() and waits there
-receive_command_ack -> _work_queue.pop_front(); waits at lock(_mutex); because this is locked by borrow_front from do_work

Result:
do_work locked at:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/mavlink_commands.cpp#L287
receive_command_ack locked at
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/mavlink_commands.cpp#L151
in:
https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L47

This is exactly the state I get from my backtrace, I just mapped the line numbers to develop.

I'll suppose a better way of implementing locked_queue as soon as I find time.

@julianoes
Copy link
Collaborator

julianoes commented Nov 20, 2018

This now unlocks and relocks the queue for just a really short time, which allows that an other process waiting at borrow_front to continue and creating the problems.

Agreed, that's actually silly!

you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call borrow_front pop_front

Right ?

I'll suppose a better way of implementing locked_queue as soon as I find time.

I wonder if we can just drop the silly relocking on pop_front? I'll make a PR.

@hdiethelm
Copy link
Author

you can get a deadlock where one process has locked _state_mutex and is blocked at borrow_front and the other has locked the queue by calling borrow_front and then blocks when wanting to call borrow_front pop_front

Right ?

No, borrow_front locks the queue. Here I described what would happen if you don't apply my pull request and just fix locked_queue.

Bad (mixed order):
a: lock _state_mutex -> locked
b: borrow_front -> locked
a: borrow front -> locked already -> wait
b: lock _state_mutex -> locked already -> wait -> deadlock

Good (same order):
a: lock _state_mutex -> locked
b: lock _state_mutex -> locked already -> wait
a: borrow_front -> locked
a: pop_front -> unlocked
a: unlock _state_mutex -> unlocked
b: lock _state_mutex -> continues -> locked again
b: borrow_front -> locked
b: pop_front -> unlocked
b: unlock _state_mutex -> unlocked
-> fine

I'll suppose a better way of implementing locked_queue as soon as I find time.

I wonder if we can just drop the silly relocking on pop_front? I'll make a PR.

It's actually not so simple. In pop_front(), you have to check if the mutex is already locked. If not, something went wrong.
I think also https://github.com/Dronecode/DronecodeSDK/blob/develop/core/locked_queue.h#L38
went in because there was a problem with locking.
Might be there is the possibility that borrow_front() returns a unique_lock() which then needs to be supplied to pop_front() or is automatically deleted at the end of the function calling borrow_front(). So it will be save.

@julianoes
Copy link
Collaborator

No, borrow_front locks the queue. Here I described what would happen if you don't apply my pull request and just fix locked_queue.

You wrote borrow front and then again borrow front which confused me.

It's actually not so simple.

You're right. What if we make it simple and instead of using this locked queue just have a normal queue and one mutex for both the state and the queue?

@hdiethelm
Copy link
Author

You're right. What if we make it simple and instead of using this locked queue just have a normal queue and one mutex for both the state and the queue?

That's sounds also good! This way you can always use lock_guards and unique_locks. The way it is implemented now if you forget one return_front or pop_front, the queue is locked forever.

@hdiethelm
Copy link
Author

I was thinking about some other possibilities, but assuming that return_front can be called from different processes without calling borrow_front first, none of them will work surely. So I think using a normal queue and renaming _state_mutex to _mutex or so is really the best way.

If you make sure borrow_front is always called first and later pop or return, this should also work:
hdiethelm@c62f6e6

However, if you change the assert in return_front to:
if(_queue_borrowed){
_mutex.unlock();
}
one process can lock the queue by calling borrow_front and another can unlock it by calling return_front without borrow_front which is not so nice.

@julianoes
Copy link
Collaborator

I think we should include this change anyway: hdiethelm/DronecodeSDK@c62f6e6 and adapt all places where it is used.

So I think using a normal queue and renaming _state_mutex to _mutex or so is really the best way.

The question is actually, do we use a mutex because we want to protect the _state field or because we want to not interfere in general, and I think it's the latter so just one mutex makes more sense.

@julianoes
Copy link
Collaborator

So, for now I'm gonna merge this as is because you convinced me that this fixes the problem for now.
Then, we can always improve it in of the ways that you suggested.

@julianoes julianoes merged commit e671354 into mavlink:develop Nov 21, 2018
@julianoes julianoes mentioned this pull request Nov 21, 2018
3 tasks
@julianoes
Copy link
Collaborator

Followed up in #601.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants