-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix all detectable thread races in chip-tool #7478
Conversation
chip::DeviceLayer::PlatformMgr().StopEventLoopTask(); | ||
#endif | ||
|
||
chip::DeviceLayer::PlatformMgr().LockChipStack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't stopping the event loop task mean the lock is no longer needed?
Also, as an aside (and for better or worse(), locking but not stopping the event loop task would not be safe because this would leave open the possibility of dereferencing destroyed objects with further in-flight callbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to stop and lock and whatnot around controller shutdown should not be necessary after #7430, I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Michael's first point, yes, you're right, the locks around Shutdown are not needed since the CHIP thread is no longer running.
I think I saw races before with the code that was there before @bzbarsky-apple 's change in #7430, hence why I did what I did. I'll take it out.
@bzbarsky-apple I've made some intentional pivots in DeviceController::Shutdown - see my comments later on here.
chip::DeviceLayer::PlatformMgr().StopEventLoopTask(); | ||
#endif | ||
|
||
chip::DeviceLayer::PlatformMgr().LockChipStack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to stop and lock and whatnot around controller shutdown should not be necessary after #7430, I think...
// great at all. | ||
ReturnErrorOnFailure(DeviceLayer::PlatformMgr().Shutdown()); | ||
// | ||
// We don't call the PlatformMgr's() Shutdown() API since that is a thread-safe API, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed after #7430 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somebody is misunderstanding something here. It could well be me, so apologies if that's the case.
But on POSIX at least, PlatformMgr().Shutdown() is pthread_join() to the event loop thread. Not only do you not need the lock for this, you must NOT have the lock held when calling this.
Instead, by clearing the should-run flag and performing the pthread_join, you're safely pending on exit of the event loop thread. Once this occurs, there is no longer have another thread to contend with and subsequent shutdown can occur without the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it was me slightly misunderstanding. Yes, I see now that POSIX _StopEventLoopTask
has locking, and only holds the lock for this:
mShouldRunEventLoop.store(false, std::memory_order_relaxed);
if (mChipTask) {
SystemLayer.WakeSelect();
It would be nice to leverage what I think was the original intent, using atomics, and not have to hold the lock here. But that's an aside.
I have a larger question though. On POSIX at least:(1) the code can't safely call these subsequent shutdown and delete operations until the event loop thread is pthread_joined, and (2) it is obviously impossible to pthread_join the event loop thread from the event loop thread itself.
This all means that, at least on POSIX, both Controller shutdown and Platform shutdown (pthread_join) must be called from another thread. So at least on POSIX, it's unclear how the separation of TeardownChipStack
accomplishes anything. How does the caller leverage this refactor to make shutdown more sensible? It still isn't possible to dispatch these shutdown operations to the event loop itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so the current situation seems to be that Controller and Platform shutdown needs to happen from some not-the-message-handling thread thread. My general point here is that if the first thing Controller shutdown does is Platform shutdown after that we are the only thread left and we should not need any locking here...
And yes, I agree we need to not be holding the lock at all while calling Platform shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it is very possible that I am missing something here. This stuff is not very straightforward. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is to have it only terminate the CHIP thread if called from a different thread AND mChipTask is set:
template <class ImplClass>
CHIP_ERROR GenericPlatformManagerImpl_POSIX<ImplClass>::_StopEventLoopTask()
{
int err = 0;
mShouldRunEventLoop.store(false, std::memory_order_relaxed);
if (mChipTask && (pthread_self() != mChipTask))
{
//
// We need to grab the lock to protect critical sections accessed by the WakeSelect() call within
// SystemLayer
//
Impl()->LockChipStack();
SystemLayer.WakeSelect();
Impl()->UnlockChipStack();
SuccessOrExit(err = pthread_join(mChipTask, nullptr));
}
exit:
return System::MapErrorPOSIX(err);
}
This makes it callable from any thread context, including the CHIP context as well. This will always stop the chip event queue, and additionally, stop the thread as well if one was created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: this needs to use pthread_equal
for the comparison to pthread_selfI(). Other than that, seems like a good solution and finally gets us to the uniformity we want to have.
I was trying to figure out if there are any safety issues around the mChipTask variable. I don't think there are any. If there is an attempt to call _StopEventLoopTask
simultaneously from both the non-event-loop thread and the event-loop thread (which one should not do), I believe this code would still work properly and often return ESRCH
from pthread_join (that is, the improper use would not lead to undefined behavior and would often produce a sane error).
As to reads of mChipTask
not being atomic, it appears that in this particular case, no problems can arise from that. I believe this to be the case because the non-event-loop thread is the only writer and the event-loop thread is strictly a reader, and the writer's writes are implicitly synchronized on pthread_create
and pthread_join
calls.
Looks great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Will make the changes soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another small wrinkle!
A new example app causing a wrinkle: TestMdns.cpp
It spawns a thread to run the CHIP event loop on its own (doesn't call into StartEventLoopTask()
), and the main thread, arms a timer and when it fires, tears down the stack.
When the timer fires, calling StopEventLoopTask()
just sets the mShouldRunEventLoop
flag to false, but doesn't actually wait for the event queue to stop.
So this triggers a race condition detection error by tsan (good job tsan!), since the calling thread continues ahead to calling Shutdown, which races accessing the SDK with the CHIP thread that is still doing work and hasn't actually gotten around to stopping the event queue.
StopEventLoopTask()
doesn't wait for the event loop to stop (by waiting on pthread_join()
) since in this case, a thread was created externally outside of the PlatformMgr's awareness.
Suggestion: Set mChipTask
to point to the thread that is running the event loop when _RunEventLoop()
is called, and set a special flag that indicates whether it's 'externally managed' or 'internally managed' thread.
In this case, it would be externally managed.
In StopEventLoopTask()
, we check to see if pthread_sef()
is != mChipTask, if it's not the same, then check if it's externally or internally managed. If internal, do what we do today (signal wake, pthread_join).
If external, signal, then wait on a cond var till the event queue stops (which in turn, would signal back).
We shouldn't try to assume the thread will terminate if externally managed, hence why I think we shouldn't do pthread_join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the requisite fixes to the POSIX implementation, and had to do some fairly major surgery there.
@msandstedt please check it out.
template <class ImplClass> | ||
CHIP_ERROR GenericPlatformManagerImpl_POSIX<ImplClass>::_Shutdown() | ||
{ | ||
int err = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the goto exit pattern here?
int err = _StopEventLoopTask();
if (err != SYSTEM_NO_ERROR) {
err = GenericPlatformManagerImpl<ImplClass>::_TeardownChipStack();
}
return System::MapErrorPOSIX(err);
Maybe if _StopEvent* and _Teardown... return ints, the _Shutdown should do the same to not even need a MapErrorPOSIX and could use return macros:
ReturnErrorOnFailure(_StopEventLoopTask());
return GenericPlatformManagerImpl<ImplClass>::_TeardownChipStack();
although I am not sure if the CHIP_ERROR temporary in the macros will work well with the int :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Altering the return values of those functions isn't the scope of this PR (I was trying to keep this very narrow to just fixing the threading issues), so given that, there is a need to do the error conversion.
Also, I think we should have a single return pointer in a given function to permit consolidated 'exit cleanup'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over time I have found that the 'consolidated exit cleanup' results in code concessions:
- variables need to be declared at top, hence increasing scope unnecessarely
- readability-wise a 'return' vs 'ExitNow' is functionally the same with the exit macros incurring extra mental effort (minimal though once you get used to the code)
- separates code into 'initialization', 'usage' and 'return/cleanup' that for longer functions is harder to read
As a general thing going forward I would want to discourage any usage of 'goto exit' at all because it hurts readability and C++ has better facilities for cleanup in the form of RAII. The only case I do not have a clean answer for is 'if (err) Log'.
Since this is newly written code, I expect it to not use goto exit
. It does have somewhat of a cleanup (mapping posix error to chip) but that seems solvable.
src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp
Outdated
Show resolved
Hide resolved
*/ | ||
struct ExecutionContext | ||
{ | ||
ChipDeviceCommissioner * Commissioner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest having an explicit constructor to clarify instantiation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POD structs intentionally contain public members, which mean you don't need to 'insulate' the contents of the structs as much since the members are 'public API'.
For something as simple as this, having a constructor seems like un-necessary boilerplate (especially since it's not paired with getters/setters).
If you still think it's valuable, let me know and I'll add one.
@@ -241,6 +241,21 @@ void GenericPlatformManagerImpl_FreeRTOS<ImplClass>::PostEventFromISR(const Chip | |||
template <class ImplClass> | |||
CHIP_ERROR GenericPlatformManagerImpl_FreeRTOS<ImplClass>::_Shutdown(void) | |||
{ | |||
VerifyOrDieWithMsg(false, DeviceLayer, "Shutdown is not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assert here? And why is shutdown not implemented on FreeRTOS? Is there an issue tracking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asserting because in general, during shutdown, folks rarely check the return values of these calls (this should really be 'void' perhaps?), and since it's not implemented, the contractual expectation of this API is not held and should result in an assert to make it easier to discover this later when it gets thrown on embedded platforms.
OK, it's |
Looks like the main change there was that each command used to construct its own DeviceCommissioner and now it does not. And that constructing all that stuff just took time; DeviceCommissioner is not small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work this. This really is a big improvement.
} | ||
|
||
UpdateWaitForResponse(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look right to me. The order here will be:
- Take stack lock, send command, release stack lock.
- Take
cvWaitingForResponseMutex
. - Set
mWaitingForResponse
to true. - Release
cvWaitingForResponseMutex
. - Block on a condvar until
mWaitingForResponse
becomes false.
What if the response comes in between steps 1 and 2? Won't we end up blocking on a response that will never come (because it already did) until we time out?
Or am I missing something that makes this safe?
UpdateWaitForResponse
takes a lock, so we need to be a bit careful, but I think if we are consistent about only calling UpdateWaitForResponse
while the stack lock is held we will be OK because then the acquisition order will always be the same so we can't deadlock...
Alternately, if we do the initial UpdateWaitForResponse
before we ever send the request, which is what PairingCommand
does, for example. That might be simpler.
Note that Tsan won't detect this problem because access to mWaitingForResponse
is in fact synchronized so there is no data race...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this across the board for all command sending logic.
} | ||
|
||
UpdateWaitForResponse(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same issue: response could come back before UpdateWaitForResponse(true)
runs, since we release the stack lock before we call this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
std::atomic_uint16_t mTestIndex; | ||
const uint16_t mTestCount = 5; | ||
uint16_t mTestIndex = 0; | ||
uint16_t mTestCount = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint16_t mTestCount = 5; | |
const uint16_t mTestCount = 5; |
* @brief | ||
* Tears down the entirety of the stack, including destructing key objects in the system. | ||
* This is not a thread-safe API, and should be called with external synchronization. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explicitly point to DeviceController::Shutdown
documentation here, since that provides more details?
* | ||
* Additionally, this stops the CHIP task if the following criteria are met: | ||
* 1. One was created earlier through a call to StartEventLoopTask | ||
* 2. This call isn't being made from that task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little weird to have this API be "stop task" when sometimes it does not, but I don't have a better suggestion... The docs will have to do, I guess.
src/include/platform/internal/GenericPlatformManagerImpl_POSIX.cpp
Outdated
Show resolved
Hide resolved
src/include/platform/internal/GenericPlatformManagerImpl_POSIX.h
Outdated
Show resolved
Hide resolved
Problem: This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool: Change: - Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool - Removed usleep and global instance hacks. - Reverts changes in project-chip#7494 - Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong. - Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways. - Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization. - Pivots PlatformMgr::Shutdown() to not handle stopping the event queue, but only focus on cleaning up the stack objects. - Fixed up TestMdns as well along the way. Testing: - Enabled tsan using 'is_tsan' build arg and used that catch well over 10+ races, with not a single false-positive. - Ran through all the chip-tool command groups (pairing, IM, discover, testcluster, payload, etc) 10x each to ensure no regressions in functionality as well as ensuring clean shutdown with tsan.
9cfc555
to
7ee2771
Compare
This doesn't hold true on Darwin, issue #7557 tracks the problem.connectedhomeip/examples/chip-tool/commands/common/Commands.cpp Lines 88 to 98 in 7ee2771
This comment was generated by todo based on a
|
This doesn't hold true on Darwin, issue #7557 tracks the problem.connectedhomeip/examples/chip-tool/commands/common/Commands.cpp Lines 88 to 98 in 94bb4e3
This comment was generated by todo based on a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .zapt file needs to be updated to match the output we want.
This doesn't hold true on Darwin, issue #7557 tracks the problem.connectedhomeip/examples/chip-tool/commands/common/Commands.cpp Lines 88 to 98 in e60afe4
This comment was generated by todo based on a
|
* Fix thread races in chip-tool Problem: This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool: Change: - Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool - Removed usleep and global instance hacks. - Reverts changes in project-chip#7494 - Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong. - Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways. - Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization. - Pivots PlatformMgr::Shutdown() to not handle stopping the event queue, but only focus on cleaning up the stack objects. - Fixed up TestMdns as well along the way. Testing: - Enabled tsan using 'is_tsan' build arg and used that catch well over 10+ races, with not a single false-positive. - Ran through all the chip-tool command groups (pairing, IM, discover, testcluster, payload, etc) 10x each to ensure no regressions in functionality as well as ensuring clean shutdown with tsan. * Restyler fixes * Forgot a file..
This enables TSAN checking when running the functional tests in CI. This is currently only enabled on Linux since the various races there were fixed when project-chip#7478 landed. When Darwin races are fixed, tsan can be enabled for those too.
This enables TSAN checking when running the functional tests in CI. This is currently only enabled on Linux since the various races there were fixed when project-chip#7478 landed. When Darwin races are fixed, tsan can be enabled for those too.
* Enable TSAN builds on CI for Functional Tests. This enables TSAN checking when running the functional tests in CI. This is currently only enabled on Linux since the various races there were fixed when #7478 landed. When Darwin races are fixed, tsan can be enabled for those too. * Fix-ups as per review feedback, including enabling it for Darwin * Increased timeout for tsan jobs since it takes some time! * Increased timeout for tsan jobs since it takes some time (real fix)!
* Fix thread races in chip-tool Problem: This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool: Change: - Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool - Removed usleep and global instance hacks. - Reverts changes in project-chip#7494 - Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong. - Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways. - Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization. - Pivots PlatformMgr::Shutdown() to not handle stopping the event queue, but only focus on cleaning up the stack objects. - Fixed up TestMdns as well along the way. Testing: - Enabled tsan using 'is_tsan' build arg and used that catch well over 10+ races, with not a single false-positive. - Ran through all the chip-tool command groups (pairing, IM, discover, testcluster, payload, etc) 10x each to ensure no regressions in functionality as well as ensuring clean shutdown with tsan. * Restyler fixes * Forgot a file..
* Enable TSAN builds on CI for Functional Tests. This enables TSAN checking when running the functional tests in CI. This is currently only enabled on Linux since the various races there were fixed when project-chip#7478 landed. When Darwin races are fixed, tsan can be enabled for those too. * Fix-ups as per review feedback, including enabling it for Darwin * Increased timeout for tsan jobs since it takes some time! * Increased timeout for tsan jobs since it takes some time (real fix)!
Problem:
This PR fixes-up all detectable thread-races as detected by tsan in chip-tool.
Change Overview:
LockChipStack()
andUnlockChipStack()
calls around key call sites that called into the stack from the various command logic in chip-tool that runs from the main thread.usleep()
and global instance hacks.Command::Run()
to now have the bulk of the stack initialization and shutdown be managed similarly for all commands beforeRun()
is called inCommands::Run()
, and anExecutionContext
object pointer be stashed inside the Command for convenient access to stack objects. This reduces the chances of people writing new commands of getting stack initialization wrong.chip::Controller::DeviceController
and sometimesDeviceCommissioner
, just used the latter in all commands since that is the super-set class anyways.PlatformMgr::StopEventLoopTask()
that is thread-safe, that is needed to be called by application logic beforeDeviceController::Shutdown()
can be called with external synchronization.Testing: