-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get the SDK's Threading model sorted out #6841
Comments
My suggestion:
|
Thanks for filing this. The suggested four tenets seem very straight forward.
Edit: Setting aside the debate about whether recursive locks are an inherent anti-pattern, if some of our desired target platforms don't provide a recursive locking primitive, then we obviously can't use such a thing. Also, the point is well taken that we do not need recursive locks if the event loop holds the single platform lock during each iteration and there is a guarantee that callbacks are always either executed with the platform lock held or the platform lock not held. It just has to be uniform. Currently, on platforms where the sdk does lock (most I believe), callbacks are executed with the lock held. |
All suggestioons seem reasonable. I would rename the We now also have a 'assertStackIsLocked' to help detecting thread issues. I would suggest to add this in more places as possible and eventually move it to a forcing assert (with chipDie) instead of the log only assert we currently have once threading model is more stable. |
I think a plan of action would be as follows:
Any more to the "first step" plan to get ahead of this? |
We have leveraged a big chunk of code base from Weave with single thread mode in mind, but we didn't adopt the thread mode from Weave as we port Message layer to CHIP stack and merge with existing Transport layer. We need to sort it out. Change LockChipStack to be recursive will hide some problems and make things extremely hard to debug later. In general, I have some concerns to reuse big chip lock for other purpose, this lock is only used to protect event processing in Weave, and has similar role in CHIP currently, the CHIP event loop takes this lock around each event it processes. If we have to use this lock to protect other stuff, we need to be extremely careful and make sure we only take this lock from none chip code in a non chip main thread. |
I believe we currently assume 'we are single threaded'. Supporting multi-threaded operation is probably a non-trivial change. I believe the lock should be shareable between threads for the purposes of saying 'only one piece of chip code is running at one time' or the ask people to only run things inside the chip event system. |
The lock is only to protect app-thread code trying to call IM and secure message API related to Matter stack. It should not be used for everything. There is no practical alternative at this moment, short of re-write with different concurrency style. Today, there is unsufficient protection and lots of shared data between multiple threads (including CHIP app client code in one thread, and main CHIP event processing loop), so we only stand to improve things if we do it safely. |
I am the person who removed the recursive locking and so obviously I disagree with reintroducing it. Why do you want to use a recursive lock?
Recursive locking is invasive and hard to remove. It took the Linux kernel years to remove their recursive lock (BKL). This is not a path to follow in a new library / application. #1 is actually quite false today. CHIP is not single threaded on POSIX and moreover there is currently no straightforward way to integrate CHIP into existing single threaded applications. #5556 will help with this and then it should be possible to share a single thread with the application. I think what you mean to say in #1 is that almost all SDK APIs assume external synchronization. Re #2. External synchronization is also the correct description here. It does not matter how you do it or how many threads you have as long as races are prevented through synchronization. If you use the "device layer", you do get a global lock in order to support mutual exclusion with its "event loop task" (IO thread). This is actually not a "core" API but layered in the middle between application code and low level parts of the SDK (as yufeng says, it was added for mutual exclusion between the application & event processing task). The creator of POSIX recursive locks has some strong thoughts on the matter [1] [1] https://groups.google.com/g/comp.programming.threads/c/tcrTKnfP8HI/m/me2K7_byNdgJ |
@mspang the case for the recursive lock is to be able to ensure things are in the "IO loop" - we seem to be using a lot of resources within the CHIP stack without any locking (all the pools and shared bits). We either have to ensure locking on each of them or ensure a single 'CHIP thread/lock'. Are you against the recursive bit (i.e. we need to think careful which methods are to be called from external threads and which not) or against a big global lock in general and require granular locking? Ideally our data structures should be thread safe, however this will take a bit of work and has some minor resource cost as we need to allocate space for all those mutexes and extra lock/unlock code - assuming more places for those than the assertions on the one big lock. |
We are not close to having fine grained locking and I'd say that it's unnecessary and likely even infeasible to do that for 1.0. The clear alternative is to require all accesses to SDK objects to be serialized. Arguably, that's already the design, and the races we've been seeing are from not following / enforcing it. Serializing access does not necessarily imply locking; ensuring all accesses happen on a single thread is also sufficient. Coarse locks and sharing one thread achieve exactly the same thing. Currently if you enable the device layer you must own the lock it provides over all of your API calls and while holding any pointers, or else the IO thread will stomp on you. This is the property we have been forgetting to enforce. If you don't use device layer (in-tree this is only Android) you can use another strategy as long as concurrent calls are avoided, but currently Android uses exactly the same strategy implemented in the JNI layer rather than in the device layer. |
#7117 is posted to prove the concept with following proposed Thread Model:
This proposed thread model can be used to integrate CHIP stack with application using a shared thread (recommended) or using a separate thread to drive CHIP event loop (such as integrate CHIP to existing apps). |
I would respectfully request that we not codify this recommendation here, in comments, documentation or anywhere else. This would seem to relegate the latter approach to a second-class status and that's just not reasonable. There is not consensus that a single-shared thread is necessary or better. Also, we know this won't be the approach on many, many platforms. Even the project repository itself does not follow this. We know that, without question, the sdk must be supportable in multithreaded environments. The job here is to define how safe synchronization will occur across thread boundaries. This has to work in 100% of cases. It isn't optional and can't be less robust than an implementation that exists entirely within a single event loop. |
I try to make a summary of our discussions above.
|
We have moved to the single thread mode with various PRs landed. |
As #6441 has demonstrated, it's clear that there isn't a clear expectation of how the threading model should be for the SDK amongst all the developers working on the SDK.
We should get this sorted out and more importantly, documented so that new developers & reviewers know the expectation well.
The text was updated successfully, but these errors were encountered: