Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tizen] Fix deadlock caused by mutex lock inversion #25573

Merged
merged 1 commit into from
Mar 13, 2023

Conversation

arkq
Copy link
Contributor

@arkq arkq commented Mar 8, 2023

Problem

Tizen mDNS API uses callbacks for notifying caller about particular events, like resolving service. Unfortunately, Tizen API internally uses global recursive mutex for every API call, which is not unlocked before calling callbacks...

The problem arises when the caller uses its own mutex which needs to be locked inside a callback function passed to mDNS API. We might have a case when:

  • in thread 1, we are running callback function, which holds mDNS API internal lock, and in that callback we need to lock our lock
  • in thread 2, mDNS API is called under callers lock, which internally tries to lock its internal lock

Changes

As a workaround, mDNS callback function will not lock matter stack mutex, but instead it will only gather all data and schedule further data processing to a glib idle source callback. Fixes #25466

Testing

Tested using test case provided in #25466

@github-actions github-actions bot added platform tizen For Tizen platform labels Mar 8, 2023
Tizen mDNS API uses callbacks for notifying caller about particular
events, like resolving service. Unfortunately, Tizen API internally
uses global recursive mutex for every API call, which is not unlocked
before calling callbacks...

The problem arises when the caller uses its own mutex which needs to be
locked inside a callback function passed to mDNS API. We might have a
case when:

- in thread 1, we are running callback function, which holds mDNS API
	internal lock, and in that callback we need to lock our lock
- in thread 2, mDNS API is called under callers lock, which internally
	tries to lock its internal lock

As a workaround mDNS callback function will not lock matter stack mutex,
but instead it will only gather all data and schedule further data
processing to a glib idle source callback.
@github-actions
Copy link

github-actions bot commented Mar 8, 2023

PR #25573: Size comparison from 2086355 to 09621f8

Full report (1 build for cc32xx)
platform target config section 2086355 09621f8 change % change
cc32xx lock CC3235SF_LAUNCHXL 0 0 0 0.0
(read only) 644425 644425 0 0.0
(read/write) 203688 203688 0 0.0
.ARM.attributes 44 44 0 0.0
.ARM.exidx 8 8 0 0.0
.bss 197088 197088 0 0.0
.comment 194 194 0 0.0
.data 1480 1480 0 0.0
.debug_abbrev 930235 930235 0 0.0
.debug_aranges 87344 87344 0 0.0
.debug_frame 300044 300044 0 0.0
.debug_info 20267469 20267469 0 0.0
.debug_line 2659766 2659766 0 0.0
.debug_loc 2802853 2802853 0 0.0
.debug_ranges 282960 282960 0 0.0
.debug_str 3024079 3024079 0 0.0
.ramVecs 780 780 0 0.0
.resetVecs 64 64 0 0.0
.rodata 105929 105929 0 0.0
.shstrtab 232 232 0 0.0
.stab 204 204 0 0.0
.stabstr 441 441 0 0.0
.stack 2048 2048 0 0.0
.strtab 378571 378571 0 0.0
.symtab 256624 256624 0 0.0
.text 536372 536372 0 0.0

@msandstedt
Copy link
Contributor

msandstedt commented Mar 8, 2023

Can't we just acquire the stack lock at the top of each iteration of https://github.com/project-chip/connectedhomeip/blob/master/src/platform/Tizen/MainLoop.h? This would ensure that locks are always acquired in the same order in all threads.

This would also ensure that any calls from the Tizen's MainLoop.h into the larger body of sdk code are always thread safe.

@arkq
Copy link
Contributor Author

arkq commented Mar 8, 2023

This would ensure that locks are always acquired in the same order in all threads.

I don't see how that would happen, unless we will somehow add matter locking into dns-sd Tizen library... The problem here is that Tizen library calls callback function (OnResolve() in our case) with DNSSD internal mutex locked. And there is no way to lock matter lock before that unless we will rewrite g_main_loop_run() function, so the matter stack lock will be acquired before calling the g_main_dispatch(). Of course, we could write custom g_main_loop_run in Tizen platform, but IMHO that's an overkill. The true problem here is not with matter SDK code, but with Tizen DNS-SD library. True fix, should be in the library, so the callback is not called with mutex locked.... I guess that someone didn't know how to solve this problem correctly, so instead of proper internal locking, we've got recursive mutex... which in most cases works, but from time to time someone will have to face locking inversion...

@msandstedt
Copy link
Contributor

This would ensure that locks are always acquired in the same order in all threads.

I don't see how that would happen, unless we will somehow add matter locking into dns-sd Tizen library... The problem here is that Tizen library calls callback function (OnResolve() in our case) with DNSSD internal mutex locked. And there is no way to lock matter lock before that unless we will rewrite g_main_loop_run() function, so the matter stack lock will be acquired before calling the g_main_dispatch(). Of course, we could write custom g_main_loop_run in Tizen platform, but IMHO that's an overkill. The true problem here is not with matter SDK code, but with Tizen DNS-SD library. True fix, should be in the library, so the callback is not called with mutex locked.... I guess that someone didn't know how to solve this problem correctly, so instead of proper internal locking, we've got recursive mutex... which in most cases works, but from time to time someone will have to face locking inversion...

Isn't the g_main_loop thread that's dispatching the callbacks started in https://github.com/project-chip/connectedhomeip/blob/master/src/platform/Tizen/MainLoop.h? If so, can't we wrap the g_poll() function with one that matter stack mutex? Better yet, can't we execute Tizen's main loop from the sdk's event loop / scheduler? glib should be amenable to either or these approaches.

More generally, if the Tizen platform layer is starting its own event loop on a thread separate from the the sdk's main event loop + thread, it seems like this should have built-in thread safety for calling back into the sdk, as I think many of the tasks, timers and callbacks executing from Tizen/MainLoop are going to be doing just this. And to do so safely, the matter stack mutex must be acquired. Have we audited everything that executes from Tizen/MainLoop to verify that all calls into the main sdk properly acquire the lock? If Tizen/MainLoop just always did this on its own, we could be much more confident.

@arkq
Copy link
Contributor Author

arkq commented Mar 9, 2023

More generally, if the Tizen platform layer is starting its own event loop on a thread separate from the the sdk's main event loop + thread, it seems like this should have built-in thread safety for calling back into the sdk

Yes, I can see your point. Some time ago I've been doing some refactoring of glib main thread in Linux platform. Tizen platform might need similar refactoring, because right now instead of running a single dedicated thread with main event loop, new thread is spawned every time async processing is needed. I know that even better would be to incorporate glib main event loop processing into matter event loop. I've started such refactoring in Linux, but I had to skip it because of other important tasks.

PS.
Did you "somehow" approved this PR? pullapprove removed "review - pending" tag (it's possible to merge it), but this PR's got only one approve.... However, I'd like to wait for @gharveymn confirmation

@arkq arkq marked this pull request as draft March 9, 2023 07:03
Copy link
Contributor

@dh79pyun dh79pyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Before refactorying glib main thread. We can apply to resolve the critical deadlock issue.
And we will prepare Tizen SDK's modification to remove the basic cause of this issue.

@arkq arkq marked this pull request as ready for review March 13, 2023 08:26
@arkq
Copy link
Contributor Author

arkq commented Mar 13, 2023

@msandstedt What do you think about merging it as is? Currently, I'm working on Tizen glib thread refactoring, but it will take some time. Firstly I will unify all glib threads into one thread, and later I'll try to add proper locking when calling back to matter stack. However, as I think of it right now, adding matter stack lock directly into the glib main event loop will defeat the purpose of separate thread... because all actions will be sequenced with matter thread. Anyway, I'll try to think of something...

PS. when done with Tizen platform, Linux also will have to be "fixed", because it also runs separate thread for glib main event loop.

@msandstedt
Copy link
Contributor

@msandstedt What do you think about merging it as is? Currently, I'm working on Tizen glib thread refactoring, but it will take some time. Firstly I will unify all glib threads into one thread, and later I'll try to add proper locking when calling back to matter stack. However, as I think of it right now, adding matter stack lock directly into the glib main event loop will defeat the purpose of separate thread... because all actions will be sequenced with matter thread. Anyway, I'll try to think of something...

PS. when done with Tizen platform, Linux also will have to be "fixed", because it also runs separate thread for glib main event loop.

That sounds fine. The logic looks correct, and I understand the challenge with attempting a larger refactor.

@arkq arkq merged commit d1604b3 into project-chip:master Mar 13, 2023
@arkq arkq deleted the tizen-fix-nsd-lock branch March 13, 2023 16:34
lecndav pushed a commit to lecndav/connectedhomeip that referenced this pull request Mar 22, 2023
Tizen mDNS API uses callbacks for notifying caller about particular
events, like resolving service. Unfortunately, Tizen API internally
uses global recursive mutex for every API call, which is not unlocked
before calling callbacks...

The problem arises when the caller uses its own mutex which needs to be
locked inside a callback function passed to mDNS API. We might have a
case when:

- in thread 1, we are running callback function, which holds mDNS API
	internal lock, and in that callback we need to lock our lock
- in thread 2, mDNS API is called under callers lock, which internally
	tries to lock its internal lock

As a workaround mDNS callback function will not lock matter stack mutex,
but instead it will only gather all data and schedule further data
processing to a glib idle source callback.
mwswartwout pushed a commit to mwswartwout/connectedhomeip that referenced this pull request Mar 27, 2023
Tizen mDNS API uses callbacks for notifying caller about particular
events, like resolving service. Unfortunately, Tizen API internally
uses global recursive mutex for every API call, which is not unlocked
before calling callbacks...

The problem arises when the caller uses its own mutex which needs to be
locked inside a callback function passed to mDNS API. We might have a
case when:

- in thread 1, we are running callback function, which holds mDNS API
	internal lock, and in that callback we need to lock our lock
- in thread 2, mDNS API is called under callers lock, which internally
	tries to lock its internal lock

As a workaround mDNS callback function will not lock matter stack mutex,
but instead it will only gather all data and schedule further data
processing to a glib idle source callback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Tizen] Calling ChipDnssdResolve with the CHIP stack locked can cause deadlock
4 participants