Crashes in chip-tool client after PR 7060 #7297

bzbarsky-apple · 2021-06-02T02:40:33Z

Problem

C++ chip-tool intermittently crashes when sending commands, in a build with PR #7060 in it. Attached is a log showing the crash plus some debugging information, but the upshot is that the following things happen, in order:

DeviceController::Shutdown is called on thread 1
We try to send a message on thread 2 and crash because things are shut down and hence the SecureSessionMgr has a null mTransportMgr.

log.txt

What's happening here is that we send a command and get a response, which we process on thread 2. That response ends up in CommandSender::OnMessageReceived. This processes the response, and, after the PR #7060 changes, calls the consumer's callback. The callback in this case is one of the functions defined in examples/chip-tool/commands/common/Commands.h, which calls SetCommandExitStatus. That notifies the condvar that thread 1 (which sent the command) is waiting for.

At this point the two threads are racing against each other. Thread 2 proceeds to finish processing the message and close the exchange context, which flushes out the CRMP/MRP ack. That's the second stack in the attached log, leading to the message send. At the same time, Thread 1 is running and eventually calling Shutdown. The question is who wins the race: do we get our message off before we are shut down or not? If we do, we're fine. If not, we crash.

Proposed Solution

Not have a race here. At the very least, we should not be able to enter shutdown while we are still in the middle of message processing. This is generally related to #6841.

Before PR #7060 I don't think this was a problem because we always closed the exchange and flushed out the pending ack before notifying our consumer. But we don't really want to be forced to do that....

@erjiaqing @pan-apple @andy31415 @arunbharadwaj

The text was updated successfully, but these errors were encountered:

erjiaqing · 2021-06-02T02:53:03Z

IMHO, this case is quite tricky…… since in most apps we won't call shotdown and this won't be a problem (i.e. the Shutdown is never called by example apps since we terminates it without gracefully shutting down)

One possible solution would be:

Stop mainloop, so no new package will be received -- no new exchange will be open.
Set the exchange manager to some Stopping state, thus no exchange will be created.
Wait for a while so existing exchange can finish their job
After a graceperiod, force reset the exchanges, or notify any unclean shutdown.

bzbarsky-apple · 2021-06-02T14:13:54Z

Just to be clear, on iOS we do in fact call Shutdown. In practice we end up having to call it a good bit, so we're interested in it working properly. ;)

The right solution here seems to be to:

Make Shutdown async, with notification when it's done.
Dispatch the actual shutdown to the "chip thread" (whatever thread it is we are doing the message processing on).
When shutdown runs on that thread it can do the sort of steps Crashes in chip-tool client after PR 7060 #7297 (comment) proposes: stop accepting new messages, flush out pending acks, etc. Not sure what this should do with pending MRP resends.

bzbarsky-apple · 2021-06-03T22:22:43Z

For things that land us inside Device::OnMessageReceived (i.e. go via Device::SendMessage, like attr writes), we have the exact same issue even without PR 7060: Device::OnMessageReceived does all the message processing, then closes the exchange, and at that point we are racing shutdown.

Before this fix we would tear down some things (importantly the secure session manager) before starting shutdown of the network layer. This would lead to a window of time during which we can still receive messages while in a partially torn down state, which would leave to crashes. Moving network layer shutdown, and in particular platform manager shutdown, to be first in the shutdown sequence ensures this can't happen by shutting down the message processing thread before we tear down any other state. Fixes project-chip#7297

Before this fix we would tear down some things (importantly the secure session manager) before starting shutdown of the network layer. This would lead to a window of time during which we can still receive messages while in a partially torn down state, which would leave to crashes. Moving network layer shutdown, and in particular platform manager shutdown, to be first in the shutdown sequence ensures this can't happen by shutting down the message processing thread before we tear down any other state. Fixes #7297

Before this fix we would tear down some things (importantly the secure session manager) before starting shutdown of the network layer. This would lead to a window of time during which we can still receive messages while in a partially torn down state, which would leave to crashes. Moving network layer shutdown, and in particular platform manager shutdown, to be first in the shutdown sequence ensures this can't happen by shutting down the message processing thread before we tear down any other state. Fixes project-chip#7297

bzbarsky-apple mentioned this issue Jun 2, 2021

Move serialization of chip::Device earlier. #7218

Merged

bzbarsky-apple mentioned this issue Jun 4, 2021

Create Tests workflow, will house certification tests #7381

Merged

bzbarsky-apple mentioned this issue Jun 7, 2021

Fix shutdown ordering in DeviceController. #7430

Merged

mspang closed this as completed in #7430 Jun 8, 2021

erjiaqing mentioned this issue Jun 11, 2021

Revert "Revert "[controller] Move callbacks to CommandResponseStatus since timing related issue is resolved."" #7542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes in chip-tool client after PR 7060 #7297

Crashes in chip-tool client after PR 7060 #7297

bzbarsky-apple commented Jun 2, 2021

erjiaqing commented Jun 2, 2021

bzbarsky-apple commented Jun 2, 2021

bzbarsky-apple commented Jun 3, 2021

Crashes in chip-tool client after PR 7060 #7297

Crashes in chip-tool client after PR 7060 #7297

Comments

bzbarsky-apple commented Jun 2, 2021

Problem

Proposed Solution

erjiaqing commented Jun 2, 2021

bzbarsky-apple commented Jun 2, 2021

bzbarsky-apple commented Jun 3, 2021