Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all detectable thread races in chip-tool #7478

Merged
merged 3 commits into from
Jun 14, 2021

Conversation

mrjerryjohns
Copy link
Contributor

@mrjerryjohns mrjerryjohns commented Jun 9, 2021

Problem:

This PR fixes-up all detectable thread-races as detected by tsan in chip-tool.

Change Overview:

  • Following the pattern of 'external synchronization' as outlined in Get the SDK's Threading model sorted out #6841, sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool that runs from the main thread.
  • Removed usleep() and global instance hacks.
  • Re-structured Command::Run() to now have the bulk of the stack initialization and shutdown be managed similarly for all commands before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access to stack objects. This reduces the chances of people writing new commands of getting stack initialization wrong.
  • Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways.
  • Added a new PlatformMgr::StopEventLoopTask() that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization.
  • Fixed up TestMdns as well along the way.

Testing:

  • Enabled tsan using 'is_tsan' build arg and used that catch well over 10+ races, with not a single false-positive.
  • Ran through all the chip-tool command groups (pairing, IM, discover, testcluster, payload, etc) 10x each to ensure no regressions in functionality as well as ensuring clean shutdown with tsan.

@mrjerryjohns mrjerryjohns changed the title Fix all thread races in chip-tool Fix all detectable thread races in chip-tool Jun 9, 2021
chip::DeviceLayer::PlatformMgr().StopEventLoopTask();
#endif

chip::DeviceLayer::PlatformMgr().LockChipStack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't stopping the event loop task mean the lock is no longer needed?

Also, as an aside (and for better or worse(), locking but not stopping the event loop task would not be safe because this would leave open the possibility of dereferencing destroyed objects with further in-flight callbacks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to stop and lock and whatnot around controller shutdown should not be necessary after #7430, I think...

Copy link
Contributor Author

@mrjerryjohns mrjerryjohns Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Michael's first point, yes, you're right, the locks around Shutdown are not needed since the CHIP thread is no longer running.

I think I saw races before with the code that was there before @bzbarsky-apple 's change in #7430, hence why I did what I did. I'll take it out.

@bzbarsky-apple I've made some intentional pivots in DeviceController::Shutdown - see my comments later on here.

chip::DeviceLayer::PlatformMgr().StopEventLoopTask();
#endif

chip::DeviceLayer::PlatformMgr().LockChipStack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to stop and lock and whatnot around controller shutdown should not be necessary after #7430, I think...

examples/chip-tool/commands/tests/TestCommand.cpp Outdated Show resolved Hide resolved
// great at all.
ReturnErrorOnFailure(DeviceLayer::PlatformMgr().Shutdown());
//
// We don't call the PlatformMgr's() Shutdown() API since that is a thread-safe API,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed after #7430 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somebody is misunderstanding something here. It could well be me, so apologies if that's the case.

But on POSIX at least, PlatformMgr().Shutdown() is pthread_join() to the event loop thread. Not only do you not need the lock for this, you must NOT have the lock held when calling this.

Instead, by clearing the should-run flag and performing the pthread_join, you're safely pending on exit of the event loop thread. Once this occurs, there is no longer have another thread to contend with and subsequent shutdown can occur without the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it was me slightly misunderstanding. Yes, I see now that POSIX _StopEventLoopTask has locking, and only holds the lock for this:

    mShouldRunEventLoop.store(false, std::memory_order_relaxed);
    if (mChipTask)    {
        SystemLayer.WakeSelect();

It would be nice to leverage what I think was the original intent, using atomics, and not have to hold the lock here. But that's an aside.

I have a larger question though. On POSIX at least:(1) the code can't safely call these subsequent shutdown and delete operations until the event loop thread is pthread_joined, and (2) it is obviously impossible to pthread_join the event loop thread from the event loop thread itself.

This all means that, at least on POSIX, both Controller shutdown and Platform shutdown (pthread_join) must be called from another thread. So at least on POSIX, it's unclear how the separation of TeardownChipStack accomplishes anything. How does the caller leverage this refactor to make shutdown more sensible? It still isn't possible to dispatch these shutdown operations to the event loop itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the current situation seems to be that Controller and Platform shutdown needs to happen from some not-the-message-handling thread thread. My general point here is that if the first thing Controller shutdown does is Platform shutdown after that we are the only thread left and we should not need any locking here...

And yes, I agree we need to not be holding the lock at all while calling Platform shutdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is very possible that I am missing something here. This stuff is not very straightforward. :(

Copy link
Contributor Author

@mrjerryjohns mrjerryjohns Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to have it only terminate the CHIP thread if called from a different thread AND mChipTask is set:

template <class ImplClass>
CHIP_ERROR GenericPlatformManagerImpl_POSIX<ImplClass>::_StopEventLoopTask()
{
    int err = 0;

    mShouldRunEventLoop.store(false, std::memory_order_relaxed);

    if (mChipTask && (pthread_self() != mChipTask))
    {
        //
        // We need to grab the lock to protect critical sections accessed by the WakeSelect() call within
        // SystemLayer
        //
        Impl()->LockChipStack();
        SystemLayer.WakeSelect();
        Impl()->UnlockChipStack();

        SuccessOrExit(err = pthread_join(mChipTask, nullptr));
    }

exit:
    return System::MapErrorPOSIX(err);
}

This makes it callable from any thread context, including the CHIP context as well. This will always stop the chip event queue, and additionally, stop the thread as well if one was created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: this needs to use pthread_equal for the comparison to pthread_selfI(). Other than that, seems like a good solution and finally gets us to the uniformity we want to have.

I was trying to figure out if there are any safety issues around the mChipTask variable. I don't think there are any. If there is an attempt to call _StopEventLoopTask simultaneously from both the non-event-loop thread and the event-loop thread (which one should not do), I believe this code would still work properly and often return ESRCH from pthread_join (that is, the improper use would not lead to undefined behavior and would often produce a sane error).

As to reads of mChipTask not being atomic, it appears that in this particular case, no problems can arise from that. I believe this to be the case because the non-event-loop thread is the only writer and the event-loop thread is strictly a reader, and the writer's writes are implicitly synchronized on pthread_create and pthread_join calls.

Looks great to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Will make the changes soon.

Copy link
Contributor Author

@mrjerryjohns mrjerryjohns Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small wrinkle!

A new example app causing a wrinkle: TestMdns.cpp

It spawns a thread to run the CHIP event loop on its own (doesn't call into StartEventLoopTask()), and the main thread, arms a timer and when it fires, tears down the stack.

When the timer fires, calling StopEventLoopTask() just sets the mShouldRunEventLoop flag to false, but doesn't actually wait for the event queue to stop.

So this triggers a race condition detection error by tsan (good job tsan!), since the calling thread continues ahead to calling Shutdown, which races accessing the SDK with the CHIP thread that is still doing work and hasn't actually gotten around to stopping the event queue.

StopEventLoopTask() doesn't wait for the event loop to stop (by waiting on pthread_join()) since in this case, a thread was created externally outside of the PlatformMgr's awareness.

Suggestion: Set mChipTask to point to the thread that is running the event loop when _RunEventLoop() is called, and set a special flag that indicates whether it's 'externally managed' or 'internally managed' thread.

In this case, it would be externally managed.

In StopEventLoopTask(), we check to see if pthread_sef() is != mChipTask, if it's not the same, then check if it's externally or internally managed. If internal, do what we do today (signal wake, pthread_join).
If external, signal, then wait on a cond var till the event queue stops (which in turn, would signal back).

We shouldn't try to assume the thread will terminate if externally managed, hence why I think we shouldn't do pthread_join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the requisite fixes to the POSIX implementation, and had to do some fairly major surgery there.

@msandstedt please check it out.

src/include/platform/PlatformManager.h Outdated Show resolved Hide resolved
src/controller/CHIPDeviceController.h Outdated Show resolved Hide resolved
src/controller/CHIPDeviceController.h Outdated Show resolved Hide resolved
template <class ImplClass>
CHIP_ERROR GenericPlatformManagerImpl_POSIX<ImplClass>::_Shutdown()
{
int err = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the goto exit pattern here?

int err = _StopEventLoopTask();

if (err != SYSTEM_NO_ERROR) {
    err = GenericPlatformManagerImpl<ImplClass>::_TeardownChipStack();
}

return System::MapErrorPOSIX(err);

Maybe if _StopEvent* and _Teardown... return ints, the _Shutdown should do the same to not even need a MapErrorPOSIX and could use return macros:

ReturnErrorOnFailure(_StopEventLoopTask());
return GenericPlatformManagerImpl<ImplClass>::_TeardownChipStack();

although I am not sure if the CHIP_ERROR temporary in the macros will work well with the int :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Altering the return values of those functions isn't the scope of this PR (I was trying to keep this very narrow to just fixing the threading issues), so given that, there is a need to do the error conversion.

Also, I think we should have a single return pointer in a given function to permit consolidated 'exit cleanup'?

Copy link
Contributor

@andy31415 andy31415 Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over time I have found that the 'consolidated exit cleanup' results in code concessions:

  • variables need to be declared at top, hence increasing scope unnecessarely
  • readability-wise a 'return' vs 'ExitNow' is functionally the same with the exit macros incurring extra mental effort (minimal though once you get used to the code)
  • separates code into 'initialization', 'usage' and 'return/cleanup' that for longer functions is harder to read

As a general thing going forward I would want to discourage any usage of 'goto exit' at all because it hurts readability and C++ has better facilities for cleanup in the form of RAII. The only case I do not have a clean answer for is 'if (err) Log'.

Since this is newly written code, I expect it to not use goto exit. It does have somewhat of a cleanup (mapping posix error to chip) but that seems solvable.

*/
struct ExecutionContext
{
ChipDeviceCommissioner * Commissioner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest having an explicit constructor to clarify instantiation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POD structs intentionally contain public members, which mean you don't need to 'insulate' the contents of the structs as much since the members are 'public API'.

For something as simple as this, having a constructor seems like un-necessary boilerplate (especially since it's not paired with getters/setters).

If you still think it's valuable, let me know and I'll add one.

@@ -241,6 +241,21 @@ void GenericPlatformManagerImpl_FreeRTOS<ImplClass>::PostEventFromISR(const Chip
template <class ImplClass>
CHIP_ERROR GenericPlatformManagerImpl_FreeRTOS<ImplClass>::_Shutdown(void)
{
VerifyOrDieWithMsg(false, DeviceLayer, "Shutdown is not implemented");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assert here? And why is shutdown not implemented on FreeRTOS? Is there an issue tracking this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asserting because in general, during shutdown, folks rarely check the return values of these calls (this should really be 'void' perhaps?), and since it's not implemented, the contractual expectation of this API is not held and should result in an assert to make it easier to discover this later when it gets thrown on embedded platforms.

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Jun 11, 2021

OK, it's chip-tool starting up faster: before this change it takes ~34-35ms to go from top of main to whatever it is that logs _Init right before logging TransportMgr initialized (looks like BLEManagerImpl::_Init). After this change it takes 4-5ms.

@bzbarsky-apple
Copy link
Contributor

Looks like the main change there was that each command used to construct its own DeviceCommissioner and now it does not. And that constructing all that stuff just took time; DeviceCommissioner is not small.

Copy link
Contributor

@msandstedt msandstedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work this. This really is a big improvement.

examples/chip-tool/commands/pairing/PairingCommand.cpp Outdated Show resolved Hide resolved
Comment on lines 33 to 35
}

UpdateWaitForResponse(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right to me. The order here will be:

  1. Take stack lock, send command, release stack lock.
  2. Take cvWaitingForResponseMutex.
  3. Set mWaitingForResponse to true.
  4. Release cvWaitingForResponseMutex.
  5. Block on a condvar until mWaitingForResponse becomes false.

What if the response comes in between steps 1 and 2? Won't we end up blocking on a response that will never come (because it already did) until we time out?

Or am I missing something that makes this safe?

UpdateWaitForResponse takes a lock, so we need to be a bit careful, but I think if we are consistent about only calling UpdateWaitForResponse while the stack lock is held we will be OK because then the acquisition order will always be the same so we can't deadlock...

Alternately, if we do the initial UpdateWaitForResponse before we ever send the request, which is what PairingCommand does, for example. That might be simpler.

Note that Tsan won't detect this problem because access to mWaitingForResponse is in fact synchronized so there is no data race...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this across the board for all command sending logic.

Comment on lines 47 to 49
}

UpdateWaitForResponse(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same issue: response could come back before UpdateWaitForResponse(true) runs, since we release the stack lock before we call this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

std::atomic_uint16_t mTestIndex;
const uint16_t mTestCount = 5;
uint16_t mTestIndex = 0;
uint16_t mTestCount = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint16_t mTestCount = 5;
const uint16_t mTestCount = 5;

src/controller/CHIPDeviceController.h Outdated Show resolved Hide resolved
* @brief
* Tears down the entirety of the stack, including destructing key objects in the system.
* This is not a thread-safe API, and should be called with external synchronization.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicitly point to DeviceController::Shutdown documentation here, since that provides more details?

*
* Additionally, this stops the CHIP task if the following criteria are met:
* 1. One was created earlier through a call to StartEventLoopTask
* 2. This call isn't being made from that task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird to have this API be "stop task" when sometimes it does not, but I don't have a better suggestion... The docs will have to do, I guess.

Problem:

This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool:

Change:

- Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool
- Removed usleep and global instance hacks.
- Reverts changes in project-chip#7494
- Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong.
- Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways.
- Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization.
- Pivots PlatformMgr::Shutdown() to not handle stopping the event queue,
  but only focus on cleaning up the stack objects.
- Fixed up TestMdns as well along the way.

Testing:

- Enabled tsan using 'is_tsan' build arg and used that catch well over
  10+ races, with not a single false-positive.
- Ran through all the chip-tool command groups (pairing, IM, discover,
  testcluster, payload, etc) 10x each to ensure no regressions in
  functionality as well as ensuring clean shutdown with tsan.
@todo
Copy link

todo bot commented Jun 11, 2021

This doesn't hold true on Darwin, issue #7557 tracks the problem.

// TODO: This doesn't hold true on Darwin, issue #7557 tracks the problem.
//
mController.Shutdown();
return (err == CHIP_NO_ERROR) ? EXIT_SUCCESS : EXIT_FAILURE;
}
CHIP_ERROR Commands::RunCommand(NodeId localId, NodeId remoteId, int argc, char ** argv)
{
CHIP_ERROR err = CHIP_NO_ERROR;
std::map<std::string, CommandsVector>::iterator cluster;


This comment was generated by todo based on a TODO comment in 7ee2771 in #7478. cc @mrjerryjohns.

@todo
Copy link

todo bot commented Jun 11, 2021

This doesn't hold true on Darwin, issue #7557 tracks the problem.

// TODO: This doesn't hold true on Darwin, issue #7557 tracks the problem.
//
mController.Shutdown();
return (err == CHIP_NO_ERROR) ? EXIT_SUCCESS : EXIT_FAILURE;
}
CHIP_ERROR Commands::RunCommand(NodeId localId, NodeId remoteId, int argc, char ** argv)
{
CHIP_ERROR err = CHIP_NO_ERROR;
std::map<std::string, CommandsVector>::iterator cluster;


This comment was generated by todo based on a TODO comment in 94bb4e3 in #7478. cc @mrjerryjohns.

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .zapt file needs to be updated to match the output we want.

@todo
Copy link

todo bot commented Jun 11, 2021

This doesn't hold true on Darwin, issue #7557 tracks the problem.

// TODO: This doesn't hold true on Darwin, issue #7557 tracks the problem.
//
mController.Shutdown();
return (err == CHIP_NO_ERROR) ? EXIT_SUCCESS : EXIT_FAILURE;
}
CHIP_ERROR Commands::RunCommand(NodeId localId, NodeId remoteId, int argc, char ** argv)
{
CHIP_ERROR err = CHIP_NO_ERROR;
std::map<std::string, CommandsVector>::iterator cluster;


This comment was generated by todo based on a TODO comment in e60afe4 in #7478. cc @mrjerryjohns.

@mspang mspang merged commit 6cf91db into project-chip:master Jun 14, 2021
mkardous-silabs pushed a commit to mkardous-silabs/connectedhomeip that referenced this pull request Jun 14, 2021
* Fix thread races in chip-tool

Problem:

This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool:

Change:

- Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool
- Removed usleep and global instance hacks.
- Reverts changes in project-chip#7494
- Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong.
- Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways.
- Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization.
- Pivots PlatformMgr::Shutdown() to not handle stopping the event queue,
  but only focus on cleaning up the stack objects.
- Fixed up TestMdns as well along the way.

Testing:

- Enabled tsan using 'is_tsan' build arg and used that catch well over
  10+ races, with not a single false-positive.
- Ran through all the chip-tool command groups (pairing, IM, discover,
  testcluster, payload, etc) 10x each to ensure no regressions in
  functionality as well as ensuring clean shutdown with tsan.

* Restyler fixes

* Forgot a file..
mrjerryjohns added a commit to mrjerryjohns/connectedhomeip that referenced this pull request Jun 22, 2021
This enables TSAN checking when running the functional tests in CI.

This is currently only enabled on Linux since the various races there
were fixed when project-chip#7478 landed.

When Darwin races are fixed, tsan can be enabled for those too.
mrjerryjohns added a commit to mrjerryjohns/connectedhomeip that referenced this pull request Jun 22, 2021
This enables TSAN checking when running the functional tests in CI.

This is currently only enabled on Linux since the various races there
were fixed when project-chip#7478 landed.

When Darwin races are fixed, tsan can be enabled for those too.
bzbarsky-apple pushed a commit that referenced this pull request Jun 23, 2021
* Enable TSAN builds on CI for Functional Tests.

This enables TSAN checking when running the functional tests in CI.

This is currently only enabled on Linux since the various races there
were fixed when #7478 landed.

When Darwin races are fixed, tsan can be enabled for those too.

* Fix-ups as per review feedback, including enabling it for Darwin

* Increased timeout for tsan jobs since it takes some time!

* Increased timeout for tsan jobs since it takes some time (real fix)!
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this pull request Sep 23, 2021
* Fix thread races in chip-tool

Problem:

This PR achieves the following to fix-up the various thread-races detected by tsan in chip-tool:

Change:

- Following the pattern of 'external synchronization', sprinkled LockChipStack() and UnlockChipStack() calls around key call sites that called into the stack from the various command logic in chip-tool
- Removed usleep and global instance hacks.
- Reverts changes in project-chip#7494
- Re-structured Command::Run to now have the bulk of the stack initialization and shutdown be managed before Run() is called in Commands::Run(), and an ExecutionContext object pointer be stashed inside the Command for convenient access. This reduces the changes of people writing new commands of getting stack initialization wrong.
- Instead of sometimes using chip::Controller::DeviceController and sometimes DeviceCommissioner, just used the latter in all commands since that is the super-set class anyways.
- Added a new 'StopEventLoopTask' that is thread-safe, that is needed to be called by application logic before DeviceController::Shutdown() can be called with external synchronization.
- Pivots PlatformMgr::Shutdown() to not handle stopping the event queue,
  but only focus on cleaning up the stack objects.
- Fixed up TestMdns as well along the way.

Testing:

- Enabled tsan using 'is_tsan' build arg and used that catch well over
  10+ races, with not a single false-positive.
- Ran through all the chip-tool command groups (pairing, IM, discover,
  testcluster, payload, etc) 10x each to ensure no regressions in
  functionality as well as ensuring clean shutdown with tsan.

* Restyler fixes

* Forgot a file..
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this pull request Sep 23, 2021
* Enable TSAN builds on CI for Functional Tests.

This enables TSAN checking when running the functional tests in CI.

This is currently only enabled on Linux since the various races there
were fixed when project-chip#7478 landed.

When Darwin races are fixed, tsan can be enabled for those too.

* Fix-ups as per review feedback, including enabling it for Darwin

* Increased timeout for tsan jobs since it takes some time!

* Increased timeout for tsan jobs since it takes some time (real fix)!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants