Get the SDK's Threading model sorted out #6841

mrjerryjohns · 2021-05-14T23:24:42Z

As #6441 has demonstrated, it's clear that there isn't a clear expectation of how the threading model should be for the SDK amongst all the developers working on the SDK.

We should get this sorted out and more importantly, documented so that new developers & reviewers know the expectation well.

mrjerryjohns · 2021-05-14T23:35:53Z

My suggestion:

Unless stipulated otherwise, the default assumption for all SDK APIs shall be that they are run on a single threading context (i.e the CHIP thread context). If this is not the case, it shall be clearly documented in the description of the API call.
If any of the SDK APIs are to be invoked from a different threading context, they shall be called with the CHIP lock held (invoked through the PlatformManager::LockChipStack call).
This lock should be a recursive lock to reduce the possibility of dead-locks, and to ensure a clear ownership model for the threads doing locking/unlocking.
It should be possible to 'post' to the CHIP threading context to make it call closures that have to be run in the CHIP threading context, either as an alternative to the grab-lock-and-call option, or to split up work into a bottom-half such that it reduces the stack depth.

mrjerryjohns · 2021-05-14T23:37:20Z

FYI @msandstedt @andreilitvin @bzbarsky-apple

msandstedt · 2021-05-15T16:07:14Z

Thanks for filing this. The suggested four tenets seem very straight forward.

WRT 3: If I'm being asked, I also vote for recursive locks. I've heard many times 'these are bad', 'these lead to messy locking practices', etc. I've found it to be quite the opposite and can lead to clean and maintainable patterns.

Edit: Setting aside the debate about whether recursive locks are an inherent anti-pattern, if some of our desired target platforms don't provide a recursive locking primitive, then we obviously can't use such a thing.

Also, the point is well taken that we do not need recursive locks if the event loop holds the single platform lock during each iteration and there is a guarantee that callbacks are always either executed with the platform lock held or the platform lock not held. It just has to be uniform. Currently, on platforms where the sdk does lock (most I believe), callbacks are executed with the lock held.

andy31415 · 2021-05-17T13:32:13Z

All suggestioons seem reasonable. I would rename the LockChipStack to LockChipStackRecursive and also provide some RAII helpers for this to make code cleaner and not require exit labels.

We now also have a 'assertStackIsLocked' to help detecting thread issues. I would suggest to add this in more places as possible and eventually move it to a forcing assert (with chipDie) instead of the log only assert we currently have once threading model is more stable.

tcarmelveilleux · 2021-05-19T21:10:28Z

yufengwangca · 2021-05-21T18:31:39Z

We have leveraged a big chunk of code base from Weave with single thread mode in mind, but we didn't adopt the thread mode from Weave as we port Message layer to CHIP stack and merge with existing Transport layer. We need to sort it out.

Change LockChipStack to be recursive will hide some problems and make things extremely hard to debug later.

In general, I have some concerns to reuse big chip lock for other purpose, this lock is only used to protect event processing in Weave, and has similar role in CHIP currently, the CHIP event loop takes this lock around each event it processes.

If we have to use this lock to protect other stuff, we need to be extremely careful and make sure we only take this lock from none chip code in a non chip main thread.

andy31415 · 2021-05-21T21:12:59Z

I believe we currently assume 'we are single threaded'. Supporting multi-threaded operation is probably a non-trivial change.

I believe the lock should be shareable between threads for the purposes of saying 'only one piece of chip code is running at one time' or the ask people to only run things inside the chip event system.

tcarmelveilleux · 2021-05-21T21:24:56Z

The lock is only to protect app-thread code trying to call IM and secure message API related to Matter stack. It should not be used for everything. There is no practical alternative at this moment, short of re-write with different concurrency style. Today, there is unsufficient protection and lots of shared data between multiple threads (including CHIP app client code in one thread, and main CHIP event processing loop), so we only stand to improve things if we do it safely.

mspang · 2021-05-25T18:22:14Z

I am the person who removed the recursive locking and so obviously I disagree with reintroducing it.

Why do you want to use a recursive lock?

Recursive locks will allow arbitrary reentrancy, which is error prone
e.g., Invariants protected by the lock can easily be violated with reentrant control flow.
Some codebases / platforms / applications may not support them or may not permit them.
i.e., It could be a barrier to adoption. Specifically, this includes Google projects with which we need to integrate.
Recursive locks means we can not be sure that unlock will actually unlock.
e.g., you can't wait on a condition variable without risking a deadlock
Recursive locking can mask errors (e.g. lock inversions) in the common cases leaving only a low level of difficult to debug deadlocks.

Recursive locking is invasive and hard to remove. It took the Linux kernel years to remove their recursive lock (BKL). This is not a path to follow in a new library / application.

#1 is actually quite false today. CHIP is not single threaded on POSIX and moreover there is currently no straightforward way to integrate CHIP into existing single threaded applications. #5556 will help with this and then it should be possible to share a single thread with the application. I think what you mean to say in #1 is that almost all SDK APIs assume external synchronization.

Re #2. External synchronization is also the correct description here. It does not matter how you do it or how many threads you have as long as races are prevented through synchronization.

If you use the "device layer", you do get a global lock in order to support mutual exclusion with its "event loop task" (IO thread). This is actually not a "core" API but layered in the middle between application code and low level parts of the SDK (as yufeng says, it was added for mutual exclusion between the application & event processing task).

The creator of POSIX recursive locks has some strong thoughts on the matter [1]

[1] https://groups.google.com/g/comp.programming.threads/c/tcrTKnfP8HI/m/me2K7_byNdgJ

andy31415 · 2021-05-25T18:33:04Z

@mspang the case for the recursive lock is to be able to ensure things are in the "IO loop" - we seem to be using a lot of resources within the CHIP stack without any locking (all the pools and shared bits). We either have to ensure locking on each of them or ensure a single 'CHIP thread/lock'.

Are you against the recursive bit (i.e. we need to think careful which methods are to be called from external threads and which not) or against a big global lock in general and require granular locking?

Ideally our data structures should be thread safe, however this will take a bit of work and has some minor resource cost as we need to allocate space for all those mutexes and extra lock/unlock code - assuming more places for those than the assertions on the one big lock.

mspang · 2021-05-25T21:54:43Z

We are not close to having fine grained locking and I'd say that it's unnecessary and likely even infeasible to do that for 1.0. The clear alternative is to require all accesses to SDK objects to be serialized. Arguably, that's already the design, and the races we've been seeing are from not following / enforcing it.

Serializing access does not necessarily imply locking; ensuring all accesses happen on a single thread is also sufficient. Coarse locks and sharing one thread achieve exactly the same thing. Currently if you enable the device layer you must own the lock it provides over all of your API calls and while holding any pointers, or else the IO thread will stomp on you. This is the property we have been forgetting to enforce. If you don't use device layer (in-tree this is only Android) you can use another strategy as long as concurrent calls are avoided, but currently Android uses exactly the same strategy implemented in the JNI layer rather than in the device layer.

yufengwangca · 2021-05-26T03:33:44Z

#7117 is posted to prove the concept with following proposed Thread Model:

All access to CHIP Service APIs must be performed via events posted to the CHIP event loop.
Some CHIP Library APIs can be directly called from other threads need dedicated mutex locking.
The boundary of application domain and CHIP stack domain is Protocol Layer.
Hide all of the event dispatching including thread context switch within Protocol Layer.

This proposed thread model can be used to integrate CHIP stack with application using a shared thread (recommended) or using a separate thread to drive CHIP event loop (such as integrate CHIP to existing apps).

msandstedt · 2021-05-26T14:11:05Z

@yufengwangca ,

This proposed thread model can be used to integrate CHIP stack with application using a shared thread (recommended) or using a separate thread to drive CHIP event loop (such as integrate CHIP to existing apps).

I would respectfully request that we not codify this recommendation here, in comments, documentation or anywhere else. This would seem to relegate the latter approach to a second-class status and that's just not reasonable. There is not consensus that a single-shared thread is necessary or better. Also, we know this won't be the approach on many, many platforms. Even the project repository itself does not follow this.

We know that, without question, the sdk must be supportable in multithreaded environments. The job here is to define how safe synchronization will occur across thread boundaries. This has to work in 100% of cases. It isn't optional and can't be less robust than an implementation that exists entirely within a single event loop.

yufengwangca · 2021-05-27T17:22:40Z

I try to make a summary of our discussions above.

We should not mandate how third-party integrate CHIP stack, either with a shared thread or using separate thread.
Recursive locking should not be used within CHIP stack.
It is hard to protect the whole CHIP stack with locking and make it thread-safe, a more realistic solution is to make all access to CHIP Service APIs be performed via events posted to the CHIP event loop.

yufengwangca · 2022-01-21T17:10:54Z

We have moved to the single thread mode with various PRs landed.

mrjerryjohns mentioned this issue May 14, 2021

Fix race condition during CRMP sendMessage #6441

Merged

andy31415 assigned yunhanw-google and yufengwangca and unassigned yunhanw-google May 21, 2021

yufengwangca mentioned this issue May 26, 2021

[Discussion] Post event in Interaction Model protocol to the event queue of CHIP thread to send message #7117

Closed

bzbarsky-apple mentioned this issue Jun 2, 2021

Crashes in chip-tool client after PR 7060 #7297

Closed

This was referenced Jun 8, 2021

Fix chip-tool's race conditions #7468

Closed

Fix all detectable thread races in chip-tool #7478

Merged

erjiaqing mentioned this issue Jul 19, 2021

[python] Add ChipStack.RunOnChipThread to run a python function on CHIP stack #8472

Merged

yufengwangca closed this as completed Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the SDK's Threading model sorted out #6841

Get the SDK's Threading model sorted out #6841

mrjerryjohns commented May 14, 2021

mrjerryjohns commented May 14, 2021 •

edited

Loading

mrjerryjohns commented May 14, 2021

msandstedt commented May 15, 2021 •

edited

Loading

andy31415 commented May 17, 2021

tcarmelveilleux commented May 19, 2021 •

edited

Loading

yufengwangca commented May 21, 2021

andy31415 commented May 21, 2021

tcarmelveilleux commented May 21, 2021

mspang commented May 25, 2021 •

edited

Loading

andy31415 commented May 25, 2021

mspang commented May 25, 2021 •

edited

Loading

yufengwangca commented May 26, 2021 •

edited

Loading

msandstedt commented May 26, 2021

yufengwangca commented May 27, 2021

yufengwangca commented Jan 21, 2022

Get the SDK's Threading model sorted out #6841

Get the SDK's Threading model sorted out #6841

Comments

mrjerryjohns commented May 14, 2021

mrjerryjohns commented May 14, 2021 • edited Loading

mrjerryjohns commented May 14, 2021

msandstedt commented May 15, 2021 • edited Loading

andy31415 commented May 17, 2021

tcarmelveilleux commented May 19, 2021 • edited Loading

yufengwangca commented May 21, 2021

andy31415 commented May 21, 2021

tcarmelveilleux commented May 21, 2021

mspang commented May 25, 2021 • edited Loading

andy31415 commented May 25, 2021

mspang commented May 25, 2021 • edited Loading

yufengwangca commented May 26, 2021 • edited Loading

msandstedt commented May 26, 2021

yufengwangca commented May 27, 2021

yufengwangca commented Jan 21, 2022

mrjerryjohns commented May 14, 2021 •

edited

Loading

msandstedt commented May 15, 2021 •

edited

Loading

tcarmelveilleux commented May 19, 2021 •

edited

Loading

mspang commented May 25, 2021 •

edited

Loading

mspang commented May 25, 2021 •

edited

Loading

yufengwangca commented May 26, 2021 •

edited

Loading