Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the SDK's Threading model sorted out #6841

Closed
mrjerryjohns opened this issue May 14, 2021 · 15 comments
Closed

Get the SDK's Threading model sorted out #6841

mrjerryjohns opened this issue May 14, 2021 · 15 comments
Assignees

Comments

@mrjerryjohns
Copy link
Contributor

As #6441 has demonstrated, it's clear that there isn't a clear expectation of how the threading model should be for the SDK amongst all the developers working on the SDK.

We should get this sorted out and more importantly, documented so that new developers & reviewers know the expectation well.

@mrjerryjohns
Copy link
Contributor Author

mrjerryjohns commented May 14, 2021

My suggestion:

  1. Unless stipulated otherwise, the default assumption for all SDK APIs shall be that they are run on a single threading context (i.e the CHIP thread context). If this is not the case, it shall be clearly documented in the description of the API call.
  2. If any of the SDK APIs are to be invoked from a different threading context, they shall be called with the CHIP lock held (invoked through the PlatformManager::LockChipStack call).
  3. This lock should be a recursive lock to reduce the possibility of dead-locks, and to ensure a clear ownership model for the threads doing locking/unlocking.
  4. It should be possible to 'post' to the CHIP threading context to make it call closures that have to be run in the CHIP threading context, either as an alternative to the grab-lock-and-call option, or to split up work into a bottom-half such that it reduces the stack depth.

@mrjerryjohns
Copy link
Contributor Author

FYI @msandstedt @andreilitvin @bzbarsky-apple

@msandstedt
Copy link
Contributor

msandstedt commented May 15, 2021

Thanks for filing this. The suggested four tenets seem very straight forward.

WRT 3: If I'm being asked, I also vote for recursive locks. I've heard many times 'these are bad', 'these lead to messy locking practices', etc. I've found it to be quite the opposite and can lead to clean and maintainable patterns.

Edit: Setting aside the debate about whether recursive locks are an inherent anti-pattern, if some of our desired target platforms don't provide a recursive locking primitive, then we obviously can't use such a thing.

Also, the point is well taken that we do not need recursive locks if the event loop holds the single platform lock during each iteration and there is a guarantee that callbacks are always either executed with the platform lock held or the platform lock not held. It just has to be uniform. Currently, on platforms where the sdk does lock (most I believe), callbacks are executed with the lock held.

@andy31415
Copy link
Contributor

All suggestioons seem reasonable. I would rename the LockChipStack to LockChipStackRecursive and also provide some RAII helpers for this to make code cleaner and not require exit labels.

We now also have a 'assertStackIsLocked' to help detecting thread issues. I would suggest to add this in more places as possible and eventually move it to a forcing assert (with chipDie) instead of the log only assert we currently have once threading model is more stable.

@tcarmelveilleux
Copy link
Contributor

tcarmelveilleux commented May 19, 2021

I think a plan of action would be as follows:

  • Make sure that the PlatformManager::LockChipStack is recursive on all platforms and correctly records the owner thread/task for the assertion
  • Rename PlatformManager::LockChipStack to PlatformManager::LockChipStackRecursive()
  • Introduce an RAII lock helper for PlatformManager::LockChipStackRecursive()
  • Settle on the external synchronization mechanisms to use for the Matter main event loop, for when a caller decides not to post with ScheduleWork.
  • Make sure assertStackIsLocked works on all platforms (see lwip's sys_check_core_locking for a good example)
  • Introduce assertStackIsLocked to core resource acquisition helpers at the very least (e.g. ExchangeManager::RegisterUMH, BitMapObjectPool::*, etc), as a first step
  • Document explicitly the concurrency management model in a top level document (which?)
  • Add examples of using PlatformManager::ScheduleWork to delegate some work to the main event loop
  • Add version of PlatformManager::ScheduleWork that can work with wrapped non-capturing lambdas (e.g. std::function-like calls)
  • Delegate orderly init/shutdown to the main event loop, rather than distributed sets of unsafe calls. This also enables the main event loop to finish orderly work before Shutdown occurs (helped by Add CHIP stack module. #6967).

Any more to the "first step" plan to get ahead of this?

@yufengwangca
Copy link
Contributor

We have leveraged a big chunk of code base from Weave with single thread mode in mind, but we didn't adopt the thread mode from Weave as we port Message layer to CHIP stack and merge with existing Transport layer. We need to sort it out.

Change LockChipStack to be recursive will hide some problems and make things extremely hard to debug later.

In general, I have some concerns to reuse big chip lock for other purpose, this lock is only used to protect event processing in Weave, and has similar role in CHIP currently, the CHIP event loop takes this lock around each event it processes.

If we have to use this lock to protect other stuff, we need to be extremely careful and make sure we only take this lock from none chip code in a non chip main thread.

@andy31415
Copy link
Contributor

I believe we currently assume 'we are single threaded'. Supporting multi-threaded operation is probably a non-trivial change.

I believe the lock should be shareable between threads for the purposes of saying 'only one piece of chip code is running at one time' or the ask people to only run things inside the chip event system.

@tcarmelveilleux
Copy link
Contributor

The lock is only to protect app-thread code trying to call IM and secure message API related to Matter stack. It should not be used for everything. There is no practical alternative at this moment, short of re-write with different concurrency style. Today, there is unsufficient protection and lots of shared data between multiple threads (including CHIP app client code in one thread, and main CHIP event processing loop), so we only stand to improve things if we do it safely.

@mspang
Copy link
Contributor

mspang commented May 25, 2021

I am the person who removed the recursive locking and so obviously I disagree with reintroducing it.

Why do you want to use a recursive lock?

  • Recursive locks will allow arbitrary reentrancy, which is error prone
  • e.g., Invariants protected by the lock can easily be violated with reentrant control flow.
  • Some codebases / platforms / applications may not support them or may not permit them.
    i.e., It could be a barrier to adoption. Specifically, this includes Google projects with which we need to integrate.
  • Recursive locks means we can not be sure that unlock will actually unlock.
    e.g., you can't wait on a condition variable without risking a deadlock
  • Recursive locking can mask errors (e.g. lock inversions) in the common cases leaving only a low level of difficult to debug deadlocks.

Recursive locking is invasive and hard to remove. It took the Linux kernel years to remove their recursive lock (BKL). This is not a path to follow in a new library / application.

#1 is actually quite false today. CHIP is not single threaded on POSIX and moreover there is currently no straightforward way to integrate CHIP into existing single threaded applications. #5556 will help with this and then it should be possible to share a single thread with the application. I think what you mean to say in #1 is that almost all SDK APIs assume external synchronization.

Re #2. External synchronization is also the correct description here. It does not matter how you do it or how many threads you have as long as races are prevented through synchronization.

If you use the "device layer", you do get a global lock in order to support mutual exclusion with its "event loop task" (IO thread). This is actually not a "core" API but layered in the middle between application code and low level parts of the SDK (as yufeng says, it was added for mutual exclusion between the application & event processing task).

The creator of POSIX recursive locks has some strong thoughts on the matter [1]

[1] https://groups.google.com/g/comp.programming.threads/c/tcrTKnfP8HI/m/me2K7_byNdgJ

@andy31415
Copy link
Contributor

@mspang the case for the recursive lock is to be able to ensure things are in the "IO loop" - we seem to be using a lot of resources within the CHIP stack without any locking (all the pools and shared bits). We either have to ensure locking on each of them or ensure a single 'CHIP thread/lock'.

Are you against the recursive bit (i.e. we need to think careful which methods are to be called from external threads and which not) or against a big global lock in general and require granular locking?

Ideally our data structures should be thread safe, however this will take a bit of work and has some minor resource cost as we need to allocate space for all those mutexes and extra lock/unlock code - assuming more places for those than the assertions on the one big lock.

@mspang
Copy link
Contributor

mspang commented May 25, 2021

We are not close to having fine grained locking and I'd say that it's unnecessary and likely even infeasible to do that for 1.0. The clear alternative is to require all accesses to SDK objects to be serialized. Arguably, that's already the design, and the races we've been seeing are from not following / enforcing it.

Serializing access does not necessarily imply locking; ensuring all accesses happen on a single thread is also sufficient. Coarse locks and sharing one thread achieve exactly the same thing. Currently if you enable the device layer you must own the lock it provides over all of your API calls and while holding any pointers, or else the IO thread will stomp on you. This is the property we have been forgetting to enforce. If you don't use device layer (in-tree this is only Android) you can use another strategy as long as concurrent calls are avoided, but currently Android uses exactly the same strategy implemented in the JNI layer rather than in the device layer.

@yufengwangca
Copy link
Contributor

yufengwangca commented May 26, 2021

#7117 is posted to prove the concept with following proposed Thread Model:

  1. All access to CHIP Service APIs must be performed via events posted to the CHIP event loop.
  2. Some CHIP Library APIs can be directly called from other threads need dedicated mutex locking.
  3. The boundary of application domain and CHIP stack domain is Protocol Layer.
  4. Hide all of the event dispatching including thread context switch within Protocol Layer.

This proposed thread model can be used to integrate CHIP stack with application using a shared thread (recommended) or using a separate thread to drive CHIP event loop (such as integrate CHIP to existing apps).

@msandstedt
Copy link
Contributor

@yufengwangca ,

This proposed thread model can be used to integrate CHIP stack with application using a shared thread (recommended) or using a separate thread to drive CHIP event loop (such as integrate CHIP to existing apps).

I would respectfully request that we not codify this recommendation here, in comments, documentation or anywhere else. This would seem to relegate the latter approach to a second-class status and that's just not reasonable. There is not consensus that a single-shared thread is necessary or better. Also, we know this won't be the approach on many, many platforms. Even the project repository itself does not follow this.

We know that, without question, the sdk must be supportable in multithreaded environments. The job here is to define how safe synchronization will occur across thread boundaries. This has to work in 100% of cases. It isn't optional and can't be less robust than an implementation that exists entirely within a single event loop.

@yufengwangca
Copy link
Contributor

I try to make a summary of our discussions above.

  1. We should not mandate how third-party integrate CHIP stack, either with a shared thread or using separate thread.
  2. Recursive locking should not be used within CHIP stack.
  3. It is hard to protect the whole CHIP stack with locking and make it thread-safe, a more realistic solution is to make all access to CHIP Service APIs be performed via events posted to the CHIP event loop.

@yufengwangca
Copy link
Contributor

We have moved to the single thread mode with various PRs landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants