Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes in chip-tool client after PR 7060 #7297

Closed
bzbarsky-apple opened this issue Jun 2, 2021 · 3 comments · Fixed by #7430
Closed

Crashes in chip-tool client after PR 7060 #7297

bzbarsky-apple opened this issue Jun 2, 2021 · 3 comments · Fixed by #7430

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

C++ chip-tool intermittently crashes when sending commands, in a build with PR #7060 in it. Attached is a log showing the crash plus some debugging information, but the upshot is that the following things happen, in order:

  1. DeviceController::Shutdown is called on thread 1
  2. We try to send a message on thread 2 and crash because things are shut down and hence the SecureSessionMgr has a null mTransportMgr.

log.txt

What's happening here is that we send a command and get a response, which we process on thread 2. That response ends up in CommandSender::OnMessageReceived. This processes the response, and, after the PR #7060 changes, calls the consumer's callback. The callback in this case is one of the functions defined in examples/chip-tool/commands/common/Commands.h, which calls SetCommandExitStatus. That notifies the condvar that thread 1 (which sent the command) is waiting for.

At this point the two threads are racing against each other. Thread 2 proceeds to finish processing the message and close the exchange context, which flushes out the CRMP/MRP ack. That's the second stack in the attached log, leading to the message send. At the same time, Thread 1 is running and eventually calling Shutdown. The question is who wins the race: do we get our message off before we are shut down or not? If we do, we're fine. If not, we crash.

Proposed Solution

Not have a race here. At the very least, we should not be able to enter shutdown while we are still in the middle of message processing. This is generally related to #6841.

Before PR #7060 I don't think this was a problem because we always closed the exchange and flushed out the pending ack before notifying our consumer. But we don't really want to be forced to do that....

@erjiaqing @pan-apple @andy31415 @arunbharadwaj

@erjiaqing
Copy link
Contributor

IMHO, this case is quite tricky…… since in most apps we won't call shotdown and this won't be a problem (i.e. the Shutdown is never called by example apps since we terminates it without gracefully shutting down)

One possible solution would be:

  1. Stop mainloop, so no new package will be received -- no new exchange will be open.
  2. Set the exchange manager to some Stopping state, thus no exchange will be created.
  3. Wait for a while so existing exchange can finish their job
  4. After a graceperiod, force reset the exchanges, or notify any unclean shutdown.

@bzbarsky-apple
Copy link
Contributor Author

Just to be clear, on iOS we do in fact call Shutdown. In practice we end up having to call it a good bit, so we're interested in it working properly. ;)

The right solution here seems to be to:

  1. Make Shutdown async, with notification when it's done.
  2. Dispatch the actual shutdown to the "chip thread" (whatever thread it is we are doing the message processing on).
  3. When shutdown runs on that thread it can do the sort of steps Crashes in chip-tool client after PR 7060 #7297 (comment) proposes: stop accepting new messages, flush out pending acks, etc. Not sure what this should do with pending MRP resends.

@bzbarsky-apple
Copy link
Contributor Author

For things that land us inside Device::OnMessageReceived (i.e. go via Device::SendMessage, like attr writes), we have the exact same issue even without PR 7060: Device::OnMessageReceived does all the message processing, then closes the exchange, and at that point we are racing shutdown.

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 7, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes project-chip#7297
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 8, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes project-chip#7297
mspang pushed a commit that referenced this issue Jun 8, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes #7297
andy31415 pushed a commit that referenced this issue Jun 9, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes #7297
doublemis1 pushed a commit to doublemis1/connectedhomeip that referenced this issue Jul 7, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes project-chip#7297
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this issue Sep 23, 2021
Before this fix we would tear down some things (importantly the secure
session manager) before starting shutdown of the network layer.  This
would lead to a window of time during which we can still receive
messages while in a partially torn down state, which would leave to
crashes.

Moving network layer shutdown, and in particular platform manager
shutdown, to be first in the shutdown sequence ensures this can't
happen by shutting down the message processing thread before we tear
down any other state.

Fixes project-chip#7297
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants